Comparison
Mixtral Curious? Comparing Mistral 7B and Mixtral for fine-tuning
Kyle Corbitt
Feb 29, 2024
4 minutes
Mistral and Mixtral: sibling rivalry vibes
Mistral 7B and Mixtral, both released by Mistral AI, are the two most popular base models that OpenPipe users train today. Mixtral is the “big brother” of the two: with 2x the required compute for inference and 6x the parameters, it’s a lot more expensive to train and serve. On the other hand, it’s also much stronger on standard benchmarks.
Over the last few months we’ve fine-tuned hundreds of Mistral 7B (using our improved Mistral 7B Optimized variant) and dozens of Mixtrals. So at this point we can start getting a better sense of how much better Mixtral really is than Mistral as a fine-tuning target [1].
Evals are all you need
Evals can be tricky, but we’ve found that LLM-as-judge is the most reliable way to compare two outputs head to head in the absence of a golden test set or human expert feedback. OpenPipe has a built-in automated eval system which our customers can use to define GPT-4 powered evaluations with custom criteria.
To get a better sense of Mixtral-based model performance, we grabbed a representative sample of user-defined evaluations with the following criteria:
The evaluation contains at least one Mixtral-based fine-tune and one Mistral 7B Optimized-based fine-tune
Both fine-tunes were trained on the same dataset
The training dataset contained at least 10K examples
The evaluation was run against at least 100 randomly-selected test set entries
By relying on our users’ custom-defined evaluations, we can have high confidence that the results correlate with actual response quality.
Results (just the facts, please)
Let’s pull the aggregated data for the resulting evaluations! In the figure below, the green bar represents the fraction of comparisons in each eval that the Mixtral-based model won, the red bar is the same for the Mistral-based model, and blue represents ties (either they gave literally the exact same answer, or GPT-4 judged their answers to be equally good).
We can see from the data that Mixtral is clearly stronger, but both base models are pretty dang good. Disregarding ties, Mixtral wins 60.1% of its head-to-head comparisons and Mistral wins 39.9%. However, when factoring in ties (which generally happen when both models did a good job), Mixtral’s win rate is only 53%.
And in fact, there are a few wildcards that might make Mixtral even stronger than it appears here[1].
Tell me how you really feel
What can we take away from this? Here are some thoughts:
It’s always worth trying a smaller model first. Some of our users created accounts specifically to fine-tune Mixtral, only to find out that actually Mistral 7B Optimized was plenty good enough!
For some tasks even smaller models might be a better fit. For Eval 6 and Eval 7 in particular, it seems like even Mistral 7B has pretty much “maxed out” the evaluation. A smaller model like Phi-2 or Gemma 2B might be a good match here to get even lower costs and latency.
All that said, if you need the best, Mixtral is the model to beat. If the highest quality is non-negotiable and cost isn’t a concern, Mixtral definitely is the stronger model to fine-tune.
Is there another model you’d like evaluated or added to our platform? Tag us on Twitter and let us know!
——————————
Footnotes
[1]: There are a few of reasons why Mixtral might be under-performing relative to its capabilities in our evaluation:
We’re comparing Mixtral to Mistral 7B Optimized, which is already significantly stronger than the standard Mistral 7B Instruct as a fine-tuning base. Over time I expect we’ll be able to produce a Mixtral Optimized as well.
For Mixtral, we skip fine-tuning the router layers and only tuning the linear layers within each expert. This is primarily because of a limitation in vLLM’s Mixtral LoRA support, which doesn’t allow for serving LoRA’s on router layers. However, there is reason to believe that enabling training of the router layers would boost performance even further.
We’ve had more time to tune the heuristics we for setting training hyperparameters for Mistral than Mixtral. Over time our Mixtral fine-tuning will likely continue to become more effective.