Technical

Fine-tuning Best Practices Series Introduction and Chapter 1: Training Data

Reid Mayo

Aug 1, 2024

10 minutes

Hi there friend! We’re going to be releasing a series of 3 recorded conversations with Kyle Corbitt, OpenPipe Co-Founder & CEO, on the topic of LLM fine-tuning best practices. Each conversation will be accompanied by a related article. Today's conversation and article will focus on the first chapter (Training Data) listed in the series below:

  1. Chapter 1: Training Data - We’ll explore how to choose the best data, common methods for collecting it, and common methods for shaping it (automated, relabeling, human shaping)

  2. Chapter 2: Models - We’ll explore how to choose the best base model to fine-tune, including trade-off decisions to make, and some tweaks such as hyperparameter optimization

  3. Chapter 3: Evaluations - We’ll explore evaluating fine-tuned model performance both against baselines/benchmarks, and also how to monitor ongoing performance and prevent data-drift

Without further ado, let’s jump into it!

Full Interview with Kyle Corbitt

Chapter 1 Training Data Best Practices

What Does Training Data Actually Look Like?

While there are a variety of different input and output data shapes that can be used to fine tune a large language model, the most common format that the Gen AI industry is standardizing around is the OpenAI chat completion messages format where data is typically organized in JSON format with the following structure:

Input (ie “feature”) data looks like a list of messages. Each message includes a “role”, which indicates the purpose of the message. Roles include system (high-level directives to guide LLM responses such as response formatting or tone across all assistant responses), user (actual user input), assistant (LLM responses), and tool (helpful for integrating external tools). Basically the same data you would supply when making an API call to an OpenAI chat completions endpoint.

Output (ie “label”) data again looks like the desired output you would receive from the OpenAI chat completions endpoint. Namely, the assistant response message in the OpenAI format

To be clear, while using OpenAI's format is a requirement for fine-tuning using OpenPipe's platform, if you decide to roll your own fine-tune using a lower level framework you can get away with basic raw input/output data. That is, using the OpenAI messages format is not a hard requirement in that scenario. However, you do get a performance boost during the fine-tune by including the full prompt and additional meta information such as the system message.

Strategy for Collecting Optimal Training Data

So now that we know what the individual pieces of datum concretely look like, the next question is, what is the optimal strategy for building a high-performance collection of training data?

The fundamental objective of training data is to match the real world input domain you expect to see (or have already seen) as closely as possible. Put another way, the types of prompts your LLM will see in production should be well-represented in your training data.

If you have an existing LLM-based application in production, then you can start by simply logging all the calls sent to your existing LLMs. Log both the input/prompt and the output/completion. By reviewing and analyzing your logs, you'll learn usage patterns and grow your understanding of what areas your training data should cover. Additionally, these real-world logs can later be used as training data!

For developer convenience, OpenPipe provides an SDK that automatically logs all prompt/completion data sent to and received from OpenAI's APIs. Our SDK is fully backwards compatible with the OpenAI SDK, allowing for immediate use as a drop-in replacement. As such, it takes only minutes to start collecting your application's real-world production data for usage as training data.

High-Level Optimization Considerations:

Comprehensive Coverage of the Task

Training data should have adequate coverage reflective of the actual inputs expected in the real world. If you expect high cardinality of input data in the real world, make sure your training data provides sufficient coverage for all these different cases.

Match Real-World Probability Distributions

Training data should match the probability distribution of input types expected in the real world. For example, if your task is to classify an email as either “spam” or “not spam”, and in the real world only 5% of emails are “spam”, then roughly 5% of your training data set should be examples of “spam”.

Allocate More Data to Underperforming Areas

Notwithstanding the above statement about probability distribution of training data matching real world usage, if there’s a specific case where the LLM is performing poorly, additional examples that cover that specific case should be added to the training data – even if doing this throws off the probability distribution.

Match Real-World Formats

Training data should be in the same format as expected in the real world (align as closely as possible, including the same delimiters in the prompt for example). This isn’t a hard requirement, and dropping meta information such as system messages is okay. But all things equal, the closer the alignment between training data input format and real-world input format, the higher the quality of fine-tuned model performance.

Highly Targeted Manual Optimization

You can patch specific edge cases where the LLM is not performing well. If you have evidence that your LLM is underperforming on certain types of tasks or prompts, potentially surfaced via automated evals you are running in prod, or perhaps human feedback such as user ratings on LLM responses (or angry emails), manual intervention goes surprisingly far. Modern LLMs are extremely sample-efficient regarding fine-tuning performance boosts. Identifying even just 4-5 examples of each case where it is going wrong, manually correcting the outputs, and then fine-tuning an LLM with these corrected examples can often solve the problem moving forward.

The Fundamental Common Mistake

Often the biggest mistakes we make when building a fine-tuning training dataset are caused by assuming characteristics about real-world usage that simply don't turn out to be accurate. This often involves our users not actually prompting our LLMs with the type of topics, or tasks, we expect them to. However assumptions can also be more granular than that. We might accurately guess the topic or task, but our assumption might fall down regarding the structure, format, or shape of the data as it comes from our users in the real world.

Fine-tuning is best understood as an optimization strategy to boost performance of your LLMs and lower their cost and latency.

As Donald Knuth famously put it, “premature optimization is the root of all evil.” Consider starting with prompt engineering and get your application into production so you can start monitoring real-world LLM usage BEFORE fine-tuning your LLMs. When it comes time to fine-tune, it will be much faster to accomplish, and your results will be far superior than starting with fine-tuning up front based on assumptions of usage (and likely 100% synthetic data).

How Much Training Data Do I Need?

Training data required to get the best performance varies considerably regarding the type of task you are trying to accomplish, and the base model you are fine-tuning itself.

The first general rule is, the larger the base model, and the better the “raw” model performs on your task, the less training data you need to boost ultimate model performance on your task.

For example, on a lot of tasks where a base model like llama 3.1 70B has decent but not acceptable performance on a given task, even a training dataset with 100 examples could be sufficient to boost the fine-tuned model's performance to an acceptable level.

But in the same scenario with a smaller base model like llama 3.1 8B, you’re more likely to need a minimum of 500 examples, and possibly many more.

As noted earlier, when “patching” edge case scenarios, where prompt engineering alone works most of the time but you still see wonky or unacceptable responses from time to time, even just 5-10 examples could get performance where you need it consistently.

The other general rule, is that the more complex a task is, the more training data is required to get ultimate performance to where it needs to be. On extremely complex tasks, we’ve seen on OpenPipe’s platform that users continue to get performance increases at 10,000 and even 20,000 examples – though this is uncommon. Beyond that it’s very rare for noticeable gains by adding additional examples unless the data mix required to fully represent the task(s) expected in real-world usage are unusually diverse.

How Do I Actually Collect Training Data?

The best source of training data is pre-existing data from your production application. You can collect this data automatically using OpenPipe’s SDK. Or perhaps you have an observability or application performance monitoring solution in place that is logging real-world usage data you can mine from. Consider even just pulling relevant data out of your database.

What About Human Generated Data?

In addition to being expensive to generate, training data created manually by armies of non-specialists really just doesn’t work very well. At this point it’s fairly rare to have a task that GPT-4 performs poorly on that a human non-domain-expert would be able to generate quality training data for.

Where humans can be leveraged in generating training data, is domain-experts or people with proprietary knowledge of the subject matter. However, when leveraging these experts the best approach isn’t to have them manually create a thousand examples by hand. Typically what you see by doing this, is the quality of examples created really starts to drop off a cliff as the human gets exhausted by the task. A better approach where possible, is to have human experts review outputs generated by an LLM (perhaps RAG assisted) and either have the human fix the output manually, or just provide some sort of thumbs up/thumbs down feedback.

What About Synthetic Data?

On the “input” side of the training data equation, using synthetic data to scale up your training set is generally not very performant and certainly not a best practice. For example, having a human manually create a set of seed prompts for use as training data inputs and then using an LLM to scale up the volume of those prompts by sort of “fuzzing” the inputs to get wider theoretical coverage and higher volume of data could result in model quality issues. The resulting LLM won't understand your proprietary task because it was trained primarily on inputs that aren't representative of production workloads. As mentioned before, the ideal source of data on the "input" side is real-world logs of usage from your real users – this is the actual data we’re trying to do better on.

However, on the “output” side of the training data equation, synthetic data can be quite effective. Human-generated output data is expensive and sometimes imperfect, especially if created by non-experts. Even experts are typically better-leveraged by having them review the quality of output that was generated by an LLM. During review they can either manually fix the output for use as training data, or score it in some way. Having humans manually patch edge cases where the LLM is poorly performing is fine too.

Automated Training Data Quality Improvements

The objective of training data is to provide the highest quality outputs or labels for given inputs (ideally real-world inputs). This is a truism given the entire point of fine-tuning is to increase LLM output performance.

As you might expect, when generating synthetic training data outputs, the more intelligent the LLM used to create output, typically the higher quality the output will be.

However even SOTA models have limitations, and human experts can’t realistically be expected to relabel thousands of outputs manually.

A cutting-edge approach to achieve higher quality outputs than is possible with current SOTA models, is to use an agentic approach to output generation.

For example, OpenPipe’s Mixture-of-Agents (MoA) technology, which orchestrates multiple LLMs to generate a superior result, is capable of generating outputs with superior performance than SOTA LLMs from OpenAI and Anthropic. To be clear, using MoA to generate an output is much slower in real-time than using a single SOTA LLM. But latency doesn’t matter when you’re building a training dataset – quality does.

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

Introducing Direct Preference Optimization (DPO) Support on OpenPipe

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.