News

Announcing Llama 3.1 and GPT-4o Mini fine-tuning through OpenPipe!

Kyle Corbitt

Jul 24, 2024

4 minutes

Hi everyone,

We’re excited to announce three new major models that went live on the OpenPipe platform yesterday: Llama 3.1 8B and 70B, and GPT-4o mini. Here’s what you need to know:

High-level Stats

Model Quality

The good news is, all 3 of models are extremely high quality. The bad news is, they saturate most of the standard evals we ran, which makes comparing them difficult! In fact, both Llama 3.1 variants saturate all 3 of the standard evals we ran, and GPT-4o mini also saturated 2/3 of them.

What do we mean by saturate? For any given input, you can imagine there is a potential “perfect” output (or set of outputs) that cannot be improved upon. The more complex the task, the more difficult it is for a model to generate a perfect output. However, once a model is strong enough to consistently generate a perfect output for that task, we consider the task saturated for that model. In our LLM-as-judge evals, this usually shows up as a cluster of models all doing about the same on the task without any model significantly outperforming.

And in fact, that’s what we see in the evaluations below:

In the chart above, all 3 fine-tuned models do about as well as each other (win rates within 6%) on both the Resume Summarization and Data Extraction tasks. On Chatbot Responses, however, both Llama 3.1 variants significantly outperform GPT-4o mini. So the Chatbot Responses task isn’t saturated for GPT-4o mini, but all other tasks and models are.

This is very significant—we chose these tasks explicitly because older models on our platform, like Mistral 7B and Llama 3 8B, did not saturate these tasks! There are two main reasons why we’re seeing this saturation now:

  • The new models we’re testing here are stronger than the previous generation of models available on-platform.

  • Our benchmark models are now all trained on datasets relabeled with Mixture of Agents, which substantially improves the quality of the dataset and thus the fine-tuned model.

We’re working on developing better benchmarks (please email us if you’re interested in us adapting your dataset into an internal benchmark!). In the meantime though, since all 3 models are available through OpenPipe, feel free to test them all and see which one does best for your specific task. We’re excited to see what you build!

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.