Technical
Analyzing OpenAI’s Reinforcement Fine-Tuning: Less Data, Better Results
Kyle Corbitt
Dec 30, 2024
6 minutes
A few weeks ago, OpenAI announced Reinforcement Fine-Tuning (RFT). This post will cover the technical details of how it works, as well as the types of tasks for which it is a major breakthrough that lets LLMs be applied to even more complex custom tasks.
At a high level, RFT helps a reasoning model such as o1 adapt its thinking to new domains more effectively. And importantly, compared to standard supervised fine-tuning (SFT), RFT allows models to learn from very small datasets—“a few dozen” examples, according to OpenAI.
This is a significant breakthrough. Training and inference costs have dropped precipitously, but collecting enough high-quality labeled data remains a bottleneck on applying AI to complex, novel tasks. So reducing the data required by 1+ orders of magnitude is a big deal!
RFT: How Does It Work?
OpenAI’s internal implementation of RFT is not public, but we can make an educated guess about how it works based on their public description and related research. Here are the high-level steps:
The user uploads a dataset with easily verifiable outputs (this part is important!). Tasks like classification or information extraction, where there is a clear “right” or “wrong” answer, work well. OpenAI puts it this way: “Reinforcement Fine-Tuning excels at tasks where the outcome has an objectively ‘correct’ answer that most experts would agree with.” In the launch video, OpenAI’s example task involved mapping a medical case report to a specific genetic mutation.
The user uploads a dataset with easily verifiable outputs (this part is important!). Tasks like classification or information extraction, where there is a clear “right” or “wrong” answer, work well. OpenAI puts it this way: “Reinforcement Fine-Tuning excels at tasks where the outcome has an objectively ‘correct’ answer that most experts would agree with.” In the launch video, OpenAI’s example task involved mapping a medical case report to a specific genetic mutation.
Using that dataset, the RFT training process performs the following loop:
Take a batch of dataset inputs, and generate a reasoning trace and output for each one. Here’s a toy example:
The grader scores each generated output. The earlier the correct output is in a list, the higher the score. In our example we give a score of 1 if the correct answer is in the first position, 0.5 if it’s in the second position, and 0 if it’s in the third position or not listed.
Use PPO or a similar reinforcement learning technique to update the model weights, favoring generated outputs that receive higher grades.
Repeat until the model stops improving.
Let’s Talk About That Grading Function!
You may be wondering why, in our toy example, we ask the model to output a ranked list, instead of simply asking it to predict the most likely emotion—after all, the dataset output we’re training on just includes a single emotion. We do this because it allows our grader to assign partial credit to an answer that includes the correct response in a later position.
It turns out that being able to assign partial credit is really useful. **In reinforcement learning terms, this makes our reward function more dense. By giving partial credit for a correct answer that isn’t in the top position, we provide a more granular reward signal—one that acknowledges the model is “on the right track” even if it isn’t fully correct. This helps stabilize and speed up training, since the model doesn’t have to wait for a perfect response to receive positive feedback. Instead, it learns from incremental improvements toward the correct output.
When Should You Use RFT?
There are three main qualifications that make a task a good match for RFT:
The task is difficult. (If the task is simple, you may not need any fine-tuning at all.)
Outputs are easy to verify. Because RFT requires grading each output, the task should have a clear verification mechanism. This works well for classification or structured information extraction, but might be less feasible for tasks like summarization or open-ended conversation.
Labeled data is hard to collect. If you have a lot of pre-labeled data, you will likely get good results with SFT and won’t need to resort to RFT, which is more complicated, slow and expensive at both training and inference-time.
One interesting implication of point (3) above is that for very high-volume tasks, RFT may be a useful “stepping stone” towards a more-optimized classical SFT model. Let’s say that you have 5 million PDFs, and you need to perform a complicated data extraction task on each of them. You might design your pipeline in the following way:
Use an expert human to hand-label 50-100 examples.
Use those examples as the training data to create an RFT model that performs the task well.
Use your new RFT model to machine-label an additional 20K examples.
Use those 20K examples to train a simpler, faster LLM to do the same task using SFT.
Use your simpler, faster LLM to label the remaining ~5M documents.
The Road Ahead: Open Source RFT?
Hopefully this has been helpful to build intuition around the types of scenarios where RFT is potentially useful. But there’s one more thing I’m excited to share!
We’re leading an active project to develop an open-source RFT implementation to fine-tune reasoning models like Qwen’s QwQ. Early results are promising, but we want to test on more datasets before releasing. If you’re a researcher interested in collaborating on this project—or you have a dataset well suited to RFT—please reach out to me directly at kyle@openpipe.ai. I’d love to get you involved!