Technical
One Right Answer or Many? A Useful Distinction for Evaluating and Fine-Tuning LLMs
Kyle Corbitt
Jan 14, 2025
9 minutes
I recently wrote about reinforcement fine-tuning (RFT). In that post, I defined the class of tasks for which RFT is a good fit in this way:
[RFT] works well for classification or structured information extraction, but might be less feasible for tasks like summarization or open-ended conversation.
This distinction actually comes up a lot, not just in relation to RFT! “Best practices” around evaluation and fine-tuning techniques tend to differ significantly depending on whether your task has one right answer (I’ll call these deterministic tasks), or many potential right answers (I’ll refer to these as freeform).
In this post I’ll work to sharpen that distinction and share some of the techniques that best lend themselves to each.
Deterministic or freeform? Let’s define terms.
Deterministic tasks have one (or a few) correct outputs for a given input, and the model should always produce the correct output. This category includes most classical ML workloads such as classification and information extraction, as well as newer flows such as copilots that convert a natural-language user intent into a concrete action.
Example Deterministic Tasks
Freeform tasks have many (sometimes infinite) “correct” outputs for a given input. Things like writing summaries, drafting emails, or writing chatbot responses are all freeform tasks.
Example Freeform Tasks
A good heuristic is to ask yourself the following question: “given a known-correct output, can I write a simple piece of code that checks whether an unknown output is also correct?” If the answer is “yes”, your task is deterministic. Otherwise, it’s probably freeform.
So how common are these?
LLMs are widely used for both deterministic and freeform tasks. To get a sense of their relative popularity, I wrote a quick script to go through 1000 recent datasets on OpenPipe and classify each one as “deterministic” or “freeform”, as well as its high-level category. The breakdown showed that (among OpenPipe users) about 63% of tasks are freeform, and 37% are deterministic.
Interestingly though, when limiting to the top 30 datasets by inference usage, we find the opposite pattern—60% are actually deterministic tasks! We don’t have visibility into specific tasks to know exactly what’s going on here, but I’d hypothesize that it’s because computers usually consume the output of deterministic tasks but humans usually consume the output of freeform tasks, and computers can process a lot more information than humans.
With that established, let’s get into the concrete differences in technique!
Difference #1: ideal temperature
One relatively well-known difference between deterministic and freeform tasks is the ideal temperature to use at generation time. A higher “temperature” makes a model more creative by introducing a bit of randomness into the sampled tokens. This can be helpful for freeform tasks, but is almost always harmful for deterministic ones. We recently evaluated a deterministic task where dropping the temperature from 0.7 (the default used by OpenAI and many other providers) to 0 increased benchmark performance from 71 to 76%, which is a significant jump!
In general, I recommend using temperature=0 for deterministic tasks and evaluating temperatures in the 0.7 to 1.0 range for freeform ones.
Difference #2: evaluations
Evaluating deterministic tasks
The beautiful thing about deterministic tasks is that model performance can be relatively easily evaluated. Typically this involves gathering a high-quality “golden dataset” containing many inputs and known-correct outputs. Once this is done you can easily evaluate a new prompt or model by running it against all of your inputs, gathering the new outputs, and programmatically comparing them to your known-correct outputs.
It’s slightly more complex than that, of course—you may care much more about errors of one type vs another, you need to make sure that your golden dataset is reasonably representative of the inputs your model will see in production, you need to ensure important edge cases are covered, etc. But these are all solvable problems, and ones that can be improved over time by continually agumenting the golden dataset with failing examples you find.
This ease of evaluation makes optimization much easier. You can experiment on different prompts and models with more confidence if you have a reliable way of measuring performance! Because of this, for deterministic tasks I recommend building at least a seed golden dataset and evaluation suite first, before even starting to iterate on prompts or models.
Evaluating freeform tasks
On the other hand it is very difficult to evaluate freeform tasks in practice. Having a “golden dataset” may not be that helpful—if you’re writing a chatbot, knowing one assistant message is good doesn’t really help you decide whether another alternative message is good or not.
Despite this difficulty, some kind of evaluation feedback loop is necessary to iterate on your system without introducing regressions. There are a number of approaches we see users take in practice:
Vibe checks: this is the “default” evaluation method. Run several different inputs through your system, look at the outputs, and decide if they look reasonable. If they don’t, try another model or tweak the prompt. This is often the most practical place to start.
LLM-as-judge: Use another model to review the freeform outputs and decide whether each one is “good” or “bad”. This is particularly useful for detecting specific, concrete failure modes—for example, if you want to make sure your chatbot always talks in friendly, informal language, you might define an LLM-as-judge that judges whether a message is too formal. If you go this direction, it’s important to align your judge prompt to ensure it’s detecting the failure cases you expect. Who Validates the Validators is a good paper on this, and there are software tools like AlignEval or OpenPipe’s Criteria that can help with the process as well.
User Feedback: allow end users to upvote/downvote responses, propose changes, request regenerations, etc. Track this feedback over time to determine whether model performance is getting better or worse.
Business Metrics: your system is designed to achieve some outcome (solve support tickets, sell a product, increase user engagement, whatever). Measure that metric, and see whether changes to your AI system have any impact on it.
These methods all have relative pros and cons. Here’s how they compare:
In practice, when deploying LLMs on freeform tasks, we see most people start with vibe checks, and then roll out the later methods, which require more work but are also higher-signal, as necessary.
Difference #3: fine-tuning
Fine-tuning is the process of specializing an LLM on a specific task by training it on many example inputs and outputs. Fine-tuning can be a good fit for both deterministic and freeform tasks. And the most basic techniques, such as using supervised fine-tuning to train a smaller LLM on a larger model’s outputs, work great for both. If you’re looking for better performance beyond that though, the techniques to explore can look quite different!
Fine-tuning for deterministic tasks
When fine-tuning on deterministic tasks, you may have good experience with the following approaches:
RFT. Reinforcement fine-tuning (RFT) is a new technique that trains a reasoning model like o1 to operate in a new domain. Its training process requires outputs to be easily verifiable as good or bad, which makes it a good fit for deterministic tasks.
Smaller models. While not universally true, you can often effectively train a far smaller model for deterministic tasks like classification and information extraction without degrading performance. We’ve seen projects where customers successfully move from GPT-4o to a fine-tuned Llama 3.2 3B, which cuts costs by >100x and drops latency significantly without degrading performance.
Design for logprobs. If you’re designing a classification task, consider asking the model to output a single distinct token for each possible class [1]. That way you can examine the logprobs of each token and get an idea not only of the most likely class, but also the model’s confidence in it and the runners-up. If you’re going to self-deploy, you can also slightly modify the architecture of your model and train it with a classification head instead of a language modeling head, which causes it to directly output class probabilities instead of tokens and has some small theoretical benefits in model performance.
Consider alternative architectures. Tasks like classification and span extraction can be completed effectively by relatively small encoder models like ModernBERT, oftentimes at similar or higher levels of performance even when compared to larger decoder models like Qwen 2.5 or Llama 3. Because of their small size, they have far higher throughput, lower latency, lower costs, and can easily be deployed on the edge. Deploying a new architecture may require more substantial changes to your application but can be worth it for high-volume or low-latency applications.
Fine-tuning for freeform tasks
Consider DPO. Direct Preference Optimization (DPO) is a technique for training your model on pairs of outputs, one “accepted” and one “rejected”. This allows you to directly “show, not tell” the model what good looks like. In practice this can work much better than prompting to nudge the model in a desired direction.[2]
Consider RLHF. DPO operates on static preference pairs, but there are many other ways to incorporate user feedback into model behavior. We’ve found that training a reward model to predict actual business metrics (eg. case resolution, user engagement, appointment booked, etc), and then improving your fine-tuned model via PPO or its modern variants directly optimizing for the reward model.
Key takeaways
Use temperature=0 for deterministic tasks and higher temperatures for freeform tasks.
Evaluations differ: deterministic tasks can leverage golden datasets, while freeform tasks may need vibe checks, LLM-as-judge, or user feedback.
Fine-tuning options: RFT works well for deterministic tasks, while freeform tasks benefit from preference-based methods like DPO or RLHF.
Smaller fine-tuned models can drastically cut inference costs and latency, especially for deterministic tasks.
We’ve found the deterministic vs freeform distinction very helpful in practice while evaluating different techniques and determining which ones to try for a specific problem. I hope you will as well!
[1]: You do have to be careful that each option you have maps to a separate token. If two categories that the LLM might output share a token prefix, their probabilities will “add up” and make those options look artificially more probable than they actually are.
[2]: One practical advantage of DPO is that it is much more data efficient than SFT. A few examples with selected and rejected outputs can meaningfully change model behavior. However, this can be a double-edged sword; it’s also much easier to overfit while training with DPO and start making the model worse instead of better. One way around this is by using online DPO, which iteratively improves a model by judging pairs of outputs from the model undergoing training. However, this requires a more complicated training loop that has some way to provide feedback in realtime on arbitrary generations.