Fine-tune your own Llama 2 to replace GPT-3.5/4

Fine-tune your own Llama 2 to replace GPT-3.5/4

Sep 12, 2023

Originally published on Hacker News.

There has been a lot of interest on HN in fine-tuning open-source LLMs recently (eg. Anyscale's post at https://news.ycombinator.com/item?id=37090632). I've been playing around with fine-tuning models for a couple of years, and wanted to share some insights and practical code. I’ve condensed what I’ve learned into a small set of notebooks at https://github.com/OpenPipe/OpenPipe/tree/main/examples/clas..., covering labeling data, fine-tuning, running efficient inference, and evaluating costs/performance. The 7B model we train here matches GPT-4’s labels 95% of the time on the test set, and for the 5% of cases where they disagree it’s often because the correct answer is genuinely ambiguous.

What is fine-tuning? You can think of it as a more-powerful form of prompting, where instead of writing your instructions in text you actually encode them in the weights of the model itself. You do this by training an existing model on example input/output pairs that demonstrate the task you want your fine-tuned model to learn. Fine-tuning can work with as few as 50 examples but I usually try to get 1000+ if possible.

Prompting still has some big advantages over fine-tuning. It's way easier/faster to iterate on your instructions than label data and re-train a model. And operationally it's easier to deploy one big model and just adjust its behavior as necessary vs deploying many small fine-tuned models that will likely each get lower utilization.

Fine-tuning has one huge advantage though: it is far more effective at guiding a model's behavior than prompting, so you can often get away with a much smaller model. That gets you faster responses and lower inference costs. A fine-tuned Llama 7B model is 50x cheaper than GPT-3.5 on a per-token basis, and for many use cases can produce results that are as good or better!

For example, classifying the 2M recipes at https://huggingface.co/datasets/corbt/all-recipes with GPT-4 would cost $23k. Even with GPT-3.5 it would cost over $1k. The model we fine-tuned performs similarly to GPT-4 and costs just $19 to run over the entire dataset.

Disclaimer: My brother David and I are working on an open-source product called OpenPipe (https://github.com/openpipe/openpipe) to help engineers adopt fine-tuning as simply as possible. But none of the information above depends on our startup. The current post is just about sharing information that we’ve learned about fine-tuning. I hope it’s useful!