News

Announcing Automatic Evals for Fine-tuned Models

Kyle Corbitt

Dec 1, 2023

3 minutes

At OpenPipe we’re building a fully-managed fine-tuning platform for developers. OpenPipe lets you replace your existing prompt with a fine-tuned model with just a few minutes of work. We capture your existing prompts and completions, synthesize them into a dataset, and fine-tune models that are a drop-in replacement for your prompt.

Starting today, we also automate the process of evaluating your model’s performance. Here’s what an eval looks like at the top level:

How it Works

In every OpenPipe dataset, we hold out a random “test set” that isn’t used in training. We then let you view the output of your models (as well as comparison models like GPT-3.5) on your test set:

When you create an evaluation, we use GPT-4 to compare each model’s output with every other model’s output for the same input in a pairwise fashion. Research shows that when used in this way GPT-4 agrees with human consensus as frequently as humans agree with each other (with the important caveat that it often prefers its own answers slightly more than a human would!).

When you create an evaluation, you can choose which models you want to compare, the number of samples, as well as any custom instructions GPT-4 should use while evaluating.

It’s also really easy to click into a specific result and see which model won, as well as GPT-4’s reasoning.

We're really excited to see the types of evals our users build. Happy evaluating!

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.