Technical

Fine-tuning Best Practices Chapter 2: Models

Reid Mayo

Aug 28, 2024

10 minutes

Welcome back! This week we’re continuing our series on the best practices surrounding LLM fine-tuning with a focus specifically on models themselves.

Keep in mind that the best way to improve the quality of a fine-tuned model is through better selection and curation of your training data. So if you missed it, I highly recommend reviewing the first chapter in our series which covers training datasets. Even so, there are meaningful implications to consider when choosing a base model to fine-tune, so today we’re sharing insight into the factors that will help you make the best decision there.

Accordingly, I sat down with Kyle Corbitt, OpenPipe Co-Founder and CEO, to flesh out this topic in detail.

Let’s dive right in!

Chapter 2 Full Interview with Kyle Corbitt

Chapter 2 Model Best Practices

OpenAI vs Open Source Models: Model Performance

TL;DR - lots of data? Go open source! less data? Go OpenAI! Need highest performance? Go open source!

In comparing proprietary to open source, we’ll set aside non-OpenAI proprietary models like Google’s Gemini Flash, because as it currently stands the vast majority of proprietary model fine-tunes in the "real world" continue to be OpenAI.

Currently OpenAI allows for fine-tuning GPT-4o and GPT-4o mini. We’ll skip discussion of older models like GPT-3.5 Turbo since there’s not much of a reason to use them outside of legacy use cases.

OpenAI models have strong performance with less data

We’ve found OpenAI’s SOTA models are able to achieve very strong performance with relatively small training datasets. For example, many of our customers achieve great results by fine-tuning with just 10 or 20 high-quality examples. This could be potentially due to the fact that the performance of the base models themselves is quite strong, so it simply doesn’t take much extra training to get performance to a sufficient threshold on your specific task.

OpenAI models hit lower performance ceilings

Strong performance on small data is a great strength, but the downside we’ve experienced helping our customers train hundreds of these models at OpenPipe is that diminishing returns on additional training data kick in pretty quickly. After several hundred examples in a training dataset additional performance gains come more slowly, and often times plateau after a few thousand examples.

Open Source models have higher performance ceilings but require more data

When fine-tuning open models such as Llama 3.1 8B or 70B the ceiling on performance is higher, but it may take more training data to get there. Typically you’ll need a few hundred to a few thousand examples before you see fine-tuned Open Source models at performance parity with fine-tuned OpenAI models, but then you can take Open Source model performance even further and hit a higher ceiling on performance by supplying additional data.

So ultimately, if you’ve got the data, fine-tuned Open Source models outcompete fine-tuned OpenAI models on the same downstream tasks (when fine-tuned using the same dataset).

OpenAI vs Open Source Models: Other Factors

TL;DR - most orgs fine-tune open source models today, and control and ownership of technology is the decisive factor.

Ease of Use

Initially it was very difficult to fine-tune LLMs without making use of a proprietary provider like OpenAI. While it was certainly possible to roll your own solution, the nuances of the problem were difficult to get right. Implementation failures were easy to make and commonly led to challenges like "catastrophic forgetting" where a ham-fisted fine-tune effectively overwrites general purpose knowledge the model relies on.

More recently, open source model providers have caught up on the ease-of-use front. OpenPipe is an easy platform to get started with for fine-tuning, and there are many inference providers that provide OpenAI-compatible inference endpoints for open source models. That makes them easy to drop into existing tooling.

From an ease-of-use point of view, fine-tuning both proprietary and open-source models is very approachable at this point.

Cost

Originally both ease of use and cost were major factors when deciding between fine-tuning open source and proprietary models. When open source was challenging to deploy, that was also when the largest optimizations around cost were possible. Perhaps for that very reason – and this is 100% speculation on my part – as open source has become more accessible OpenAI and other closed providers have released newer models at cost points that are notably closer to what you experience using open source models.

To be clear, you can achieve substantial cost savings using open source over proprietary models, anywhere from 30-200%, but it’s not the "order of magnitude" difference in cost it used to be.

Control

Control can be an important factor when determining whether to use an open-source or proprietary model.

There are material risks involved in building Gen AI applications and technology on top of closed providers like OpenAI. Models can be deprecated or switched out behind the scenes with little visibility or notice.

Many companies choose to fine-tune open source models for this reason. We see more companies deploying fine-tuned open source models than proprietary ones, and this trend appears to be growing stronger over time.

Furthermore, if strict data access control is required during the entire fine-tuning process, companies can deploy end-to-end fine-tuning solutions such as OpenPipe entirely within their VPC.

Given this shift from proprietary to open source models, we’ll focus entirely on open source models for the rest of this article.

Model Size: How Large is Large Enough?

TL;DR - always use the smallest model you can "get away with" to optimize on a variety of dimensions.

As a general rule you typically want to use the smallest model that provides acceptable performance for your specific task. Increasing model size increases costs across a variety of dimensions that we’ll cover below.

The Decisive Factors in Choosing (Training Dataset & Task Difficulty)

On Training Datasets

If you have a smaller or perhaps lower quality training dataset and you want to use open source models (maybe as a hard requirement for control reasons), then all things equal you will need to fine-tune a larger base model.

On Task Difficulty

Some tasks are inherently difficult. Imagine you’re a law firm building a model to analyze data from a plaintiff intake form and initial consultation to determine whether the plaintiff has a strong case that you should take on. The complexity of the task means it will likely require a larger base model.

Putting It All Together

If your task is relatively simple: ie "take a support email and route it to either customer support or billing," then there's a strong chance using a small fine-tuned model will work great for your task. If you have a lot of existing data, such as a massive database of historical emails and the department they were routed to originally, you can easily whip that data into a training dataset that will get the small model to perform correctly basically 100% of the time.

Large Model "Costs"

The reason you always want to use the smallest model you can successfully achieve your task with, is because larger models impose larger costs:

Financial Cost of Inference

Larger models cost more at inference time because there’s fundamentally more compute that has to occur during inference. Simply put, running inference on a larger model takes more GPU resources – and you’re going to pay for that.

Latency Cost of Inference

Not only does running inference on a larger model require more (expensive) compute resources, there are hard limits to how much of this compute can be done in parallel. As such, larger models are materially slower than smaller models at inference time. There are some techniques like speculative decoding, Mixture of Experts and certain types of quantization that can help with this, but typically you will see higher latency with larger models, all else being equal.

Infrastructure Flexibility Challenges

Larger models require more complicated hardware setups to host. A Llama 3.1 70B model is typically going to be deployed on top of four to eight different H100 GPUs. With quantization maybe you can get it down to 2 H100s.

On the other hand, you can throw a Llama 3.1 8B model on a single cheaper GPU like an A10 or 4090. Depending on your performance requirements, you can even run it on CPU on any random server. You just have way more flexibility in your realistic deployment options if your model is smaller.

Hyperparameter Tuning

TL;DR - use platform provided sensible defaults (OpenPipe’s are great!), or quickly test a range of values in parallel. Don’t spend too much time here!

I’ve left the topic of hyperparameter (hparam) tuning for the end, because it’s often the last thing to optimize for after you’ve got your dataset and base model right.

There are many hparams involved in fine-tuning that can be tweaked, but the following three are commonly changed:

  1. Number of Epochs: This is the number of times your model sees each entry in the training data set. A higher number increases model performance in tasks similar to your training data, but go too high and your model will overfit (do well on training data but struggle to generalize to new unseen data). Common values are between 1 and 5 depending on the task and amount of data available. In general if you have large amounts of quality data it’s better to use more data and fewer epochs. However, if you only have a small amount of data it can be better to run it for many epochs.

  2. Learning Rate: This is the absolute amount model parameters get updated during each iteration step. Too low and the model won’t learn enough during the tuning, too high and parameter updates could overshoot the ideal values (and the next iteration step can overshoot back the other way, so it can’t self-correct). The correct values are model, optimizer and task dependent, but are often in the 1e-6 to 1e-3 range. At OpenPipe we often use 2e-4 as a starting point in experiments.

  3. LoRA Rank: This hparam is specific to training Low-Rank Adapters (LoRAs), a common technique for limiting your training to just a small fraction of the total weights. LoRA fine-tuning is popular because it requires less memory and can be easier to deploy. If training LoRAs, the rank is proportional to the number of parameters each LoRA will have. Higher ranks train more parameters and better approximate a full fine-tune. Lower ranks are easier to train and deploy, and less susceptible to "catastrophic forgetting." Sebastian Raschka has an excellent post detailing the tradeoffs. Common values for LoRA rank are 8-64.

Use Sensible Defaults Most of the Time

Fine-tuning platforms like OpenPipe provide high performing hparams defaults using a number of heuristics such as the base model size, dataset properties, and other factors. It’s usually a good idea to start with these defaults. However, we also provide the ability to override the defaults for deeper experimentation.

Even When Maximum Performance is Required — Keep it Simple!

Before optimizing hparams, it’s important to have a robust evals suite developed to measure the performance of your fine-tuned LLM with tight precision. This will be necessary to detect the (often small) gains in performance tweaking hparams could provide.

Furthermore, you should also have fine-tuning budget you feel comfortable being a bit less efficient consuming. For many teams it may not be worth the engineering effort required to develop an hparam experiment plan. Your time is often better spent by reinvesting in improving the quality of your training data.

OpenPipe MoA: Outperform GPT-4 at 1/25th the Cost

OpenPipe MoA: Outperform GPT-4 at 1/25th the Cost

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.