Basics

Fine-Tuning in a Nutshell

Mar 28, 2024

5 minutes

Hi, I’m David, the CTO of OpenPipe. This is a post for individuals who want to build high-level intuition around fine-tuning.

In particular, we'll address the top-of-mind questions that enterprises think through when adopting any new technology. We can cluster them into three high-level groups:

  • What is fine-tuning?

  • How are people currently using fine-tuned models? What hasn’t worked?

  • Does fine-tuning make sense for me?

A surprising observation I've heard from researchers working in this field is how far the comparison between an LLM and a person can take you in understanding the behavior of your model. We'll follow their example by comparing LLMs to employees at a business.

What is fine-tuning?

Fine-tuning is the process of teaching an LLM to behave in a certain way. The most common method is known as supervised fine-tuning, which entails providing examples of instructions to an LLM followed by desirable responses. In effect, you’re showing the LLM how it ought to behave.

Training Dataset

At a high level, fine-tuning a model is quite similar to training a new employee. The employee starts with a broad understanding of the world at large, but isn’t sure exactly what’s expected in their new job. In addition to providing verbal instructions (prompting), you train your new employee by demonstrating how to handle common scenarios (fine-tuning). By teaching a generally capable LLM exactly how to handle a wide array of sample inputs, you prepare it for novel inputs that are similar to any of the data it was trained on (the same way an employee can generalize to slight variations in the work they need to do in their job).

How are people currently using fine-tuned models? What hasn’t worked?

This is a great question, and the answer is changing quickly. When fine-tuning LLMs was popularized after the release of the Llama 1 7B models, it was primarily used as a tactic to get around censorship limitations imposed by Meta. I’ll leave the reader to do their own research on exactly which kinds of censorship were most commonly avoided by fine-tuning.

With the release of more capable open-weight base models such as Mistral 7B and Mixtral, fine-tuning has become a general-purpose solution for improving the performance while reducing costs and latency. Small fine-tuned models are successfully deployed as chatbots, data analysts, summarizers, and almost everything else you can think of. In addition to stronger base models, new fine-tuning techniques like those pioneered by Wing at Axolotl and Daniel at Unsloth have increased the effectiveness of fine-tuning while reducing its cost.

Things small fine-tuned models are good at:

  • Learning desired behavior

  • Developing expertise in a subject

  • Consistency

  • Speed

  • Cost

Things small fine-tuned models are bad at:

  • Handling out-of-domain inputs

  • Deep reasoning ability

What doesn't work? In general, fine-tuned models perform poorly when they’re faced with data that differs significantly from anything they were trained on. For instance, if you fine-tune a model to extract data from real estate listings, then try to use that model as a chatbot for your fast-food restaurant, the results won’t be pretty. The fine-tuned model will perform noticeably worse at a novel task than the original base model would have. The model has forgotten some amount of the original behavior it possessed before fine-tuning began. This means that if you need your model to frequently adjust to new types of data it wasn’t trained on, fine-tuning isn’t the right solution for you.

Does fine-tuning make sense for me?

Across dozens of cases evaluated by our customers using LLM-as-judge, expert fine-tuned models reliably beat GPT-3.5-turbo and exhibit similar or better performance than the GPT-4 variants. Engineers fine-tune a different model for each prompt they were previously making to GPT-4 to create hyper-specialized experts that perform extremely well at their individual tasks. To go back to the employee analogy, this is the equivalent of training one person to be your software developer, another to be your accountant, and a third to be your social media manager. These jobs require different skillsets, and there isn’t a lot of overlap in capability. By contrast, using GPT-4 is like contracting a world-class PhD/engineer/accountant/marketer/doctor/etc that can do almost anything with no training at all.

But like a highly skilled contractor, GPT-4 commands a premium for its services. GPT-4 is a massive model and occupies more GPUs for a longer period of time to produce each token of its output than a smaller expert model. And keep in mind, you’re paying by the hour (or by the token).

When you ask GPT-4 to perform a task, you’re under-utilizing all the weights that aren’t related to that task in particular. By contrast, when you use an expert model you’ll get an excellent response that required fewer unrelated weights to be loaded onto GPUs, leading to a dramatically lower cost per token and a drop in latency. Even without advanced optimizations like pruning rules, fine-tuning a 7B model on OpenPipe leads to an average 14x savings compared to GPT-4-Turbo (that’s a whopping 32x compared to GPT-4-0613). And as the open-source ecosystem for model hosting continues to develop, we expect those costs to drop rapidly.

Should everyone fine-tune? If you’re making fewer than 100 requests a day to GPT-4 and you aren’t sensitive to latency, I wouldn’t bother training a model yet. Fine-tuning also isn’t a good fit for engineers who are rapidly updating their prompt as they prototype. If you fall into either of those two categories, I would focus on prompt engineering until you’ve reached scale and your LLM costs, latency or response quality start to hurt.

Conclusion

We'll get to LoRAs, DPO and data flywheels in another post. If you think fine-tuning could be a good fit, start capturing your data through the OpenPipe SDK. It’s a drop-in replacement for the OpenAI SDK, with the added difference that it records your LLM input/output pairs to use as a fine-tuning dataset. As with everything else, the more data the better.

I hope this was helpful and I wish you the best of luck!

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.