News
Announcing Llama 3.1 and GPT-4o Mini fine-tuning through OpenPipe!
Kyle Corbitt
Jul 24, 2024
4 minutes
Hi everyone,
We’re excited to announce three new major models that went live on the OpenPipe platform yesterday: Llama 3.1 8B and 70B, and GPT-4o mini. Here’s what you need to know:
High-level Stats
Model Quality
The good news is, all 3 of models are extremely high quality. The bad news is, they saturate most of the standard evals we ran, which makes comparing them difficult! In fact, both Llama 3.1 variants saturate all 3 of the standard evals we ran, and GPT-4o mini also saturated 2/3 of them.
What do we mean by saturate? For any given input, you can imagine there is a potential “perfect” output (or set of outputs) that cannot be improved upon. The more complex the task, the more difficult it is for a model to generate a perfect output. However, once a model is strong enough to consistently generate a perfect output for that task, we consider the task saturated for that model. In our LLM-as-judge evals, this usually shows up as a cluster of models all doing about the same on the task without any model significantly outperforming.
And in fact, that’s what we see in the evaluations below:
In the chart above, all 3 fine-tuned models do about as well as each other (win rates within 6%) on both the Resume Summarization and Data Extraction tasks. On Chatbot Responses, however, both Llama 3.1 variants significantly outperform GPT-4o mini. So the Chatbot Responses task isn’t saturated for GPT-4o mini, but all other tasks and models are.
This is very significant—we chose these tasks explicitly because older models on our platform, like Mistral 7B and Llama 3 8B, did not saturate these tasks! There are two main reasons why we’re seeing this saturation now:
The new models we’re testing here are stronger than the previous generation of models available on-platform.
Our benchmark models are now all trained on datasets relabeled with Mixture of Agents, which substantially improves the quality of the dataset and thus the fine-tuned model.
We’re working on developing better benchmarks (please email us if you’re interested in us adapting your dataset into an internal benchmark!). In the meantime though, since all 3 models are available through OpenPipe, feel free to test them all and see which one does best for your specific task. We’re excited to see what you build!