Dec 1, 2023
At OpenPipe we’re building a fully-managed fine-tuning platform for developers. OpenPipe lets you replace your existing prompt with a fine-tuned model with just a few minutes of work. We capture your existing prompts and completions, synthesize them into a dataset, and fine-tune models that are a drop-in replacement for your prompt.
Starting today, we also automate the process of evaluating your model’s performance. Here’s what an eval looks like at the top level:
How it Works
In every OpenPipe dataset, we hold out a random “test set” that isn’t used in training. We then let you view the output of your models (as well as comparison models like GPT-3.5) on your test set:
When you create an evaluation, we use GPT-4 to compare each model’s output with every other model’s output for the same input in a pairwise fashion. Research shows that when used in this way GPT-4 agrees with human consensus as frequently as humans agree with each other (with the important caveat that it often prefers its own answers slightly more than a human would!).
When you create an evaluation, you can choose which models you want to compare, the number of samples, as well as any custom instructions GPT-4 should use while evaluating.
It’s also really easy to click into a specific result and see which model won, as well as GPT-4’s reasoning.
We're really excited to see the types of evals our users build. Happy evaluating!