News
Announcing Automatic Evals for Fine-tuned Models
Kyle Corbitt
Dec 1, 2023
3 minutes
At OpenPipe we’re building a fully-managed fine-tuning platform for developers. OpenPipe lets you replace your existing prompt with a fine-tuned model with just a few minutes of work. We capture your existing prompts and completions, synthesize them into a dataset, and fine-tune models that are a drop-in replacement for your prompt.
Starting today, we also automate the process of evaluating your model’s performance. Here’s what an eval looks like at the top level:
data:image/s3,"s3://crabby-images/92395/92395145d2aad66b7f09b77554ce750ce5657e2d" alt=""
How it Works
In every OpenPipe dataset, we hold out a random “test set” that isn’t used in training. We then let you view the output of your models (as well as comparison models like GPT-3.5) on your test set:
data:image/s3,"s3://crabby-images/38e7f/38e7fe4471604ced68de87791e7fc11ca88de725" alt=""
When you create an evaluation, we use GPT-4 to compare each model’s output with every other model’s output for the same input in a pairwise fashion. Research shows that when used in this way GPT-4 agrees with human consensus as frequently as humans agree with each other (with the important caveat that it often prefers its own answers slightly more than a human would!).
When you create an evaluation, you can choose which models you want to compare, the number of samples, as well as any custom instructions GPT-4 should use while evaluating.
data:image/s3,"s3://crabby-images/af4b7/af4b767a4cad73547b30233c9542ebe35094be16" alt=""
It’s also really easy to click into a specific result and see which model won, as well as GPT-4’s reasoning.
data:image/s3,"s3://crabby-images/ad121/ad121f51a163d5f45e5dadad787881cdf9a3dbcd" alt=""
We're really excited to see the types of evals our users build. Happy evaluating!