Llama 2 vs Mistral: Believe the Hype

Llama 2 vs Mistral: Believe the Hype

Kyle Corbitt

Nov 5, 2023

TL;DR: If you’re fine-tuning a task-specific LLM under 34B parameters you should probably be using Mistral. The performance improvements vs Llama 2 are significant and generalize across many task types.

At OpenPipe we make it easy to replace your existing prompt with a task-specific fine-tuned model in just a few minutes. We launched in August with support for Llama 2, and have since added support for GPT-3.5 and most recently Mistral. Mistral was exciting because it (1) outperforms similar small models on benchmarks and (2) has support for an 8K context window out of the box.

Over the 3 weeks since we landed Mistral support, we’ve worked with our existing customers to proactively test Mistral fine-tunes on their existing datasets. Here’s the aggregate accuracy on two types of deterministic tasks (tasks for which there is a single right answer, making accuracy easy to calculate):

For non-deterministic tasks (summarization, chat, question answering, etc.) we asked our users to manually review outputs on the test set and compare Mistral to Llama 2 13B. Users consistently either preferred the Mistral output, or considered Mistral and Llama 2 13B about the same.There are some areas we haven’t evaluated yet — for example, we don’t have any multilingual tasks currently, so we haven’t evaluated Mistral’s performance there. Overall though, Mistral is an extremely strong model that we’re encouraging as a default for future fine-tunes!

PS. If you’re using GPT-3.5 or GPT-4 at scale and are interested in saving significant money, improving accuracy and decreasing latency, reach out! We decreased one YC customer’s monthly inference bill from $80K to $15K while drastically improving their accuracy vs. the GPT-3.5 prompt they were using previously.