Technical

Using GRPO to Beat o1, o3-mini and R1 at "Temporal Clue"

Brad Hilton, Kyle Corbitt

Mar 6, 2025

7 minutes


In this post we’ll discuss how we used Group Relative Policy Optimization (GRPO) to surpass R1, o1, o3-mini, and come within a couple percentage points of Sonnet 3.7 on a reasoning-heavy game called “temporal clue”, while being over 100x cheaper to run at inference time. We’ll include specific lessons learned about task design and hyperparameters we’ve found to work well. And finally, we share the training recipe we used to achieve these results, built on top of torchtune.

Background

Since OpenAI launched its powerful new o-series of reasoning models last year, we've seen rapid progress in Large Language Models (LLMs) trained with Reinforcement Learning (RL). Leading organizations like Google DeepMindAlibabaDeepSeek, and Anthropic quickly followed suit, and have trained their own advanced models to reason with long “chains-of-thought” (CoT), taught with reinforcement learning on verifiable problems. Many previously challenging benchmarks—in areas like mathematics and coding—now approach saturation.

Yet despite these impressive strides, logical deduction remains stubbornly difficult for even today's best models. Typically, LLMs struggle to consistently attend to all relevant details, maintain logically sound reasoning chains, or reliably link multiple deduction steps. Even state-of-the-art models generating outputs 10–100 times longer frequently introduce elementary mistakes that a human solver would easily catch.

Intrigued by this unsolved mystery, we donned our deerstalker caps and set out to investigate: Could smaller, open-weight models reach frontier-level deduction performance with the latest reinforcement learning techniques? We began with substantially weaker models and iteratively trained them on a novel deduction task. Over time, we observed clear improvements in their detective prowess, eventually matching or even exceeding some of the strongest proprietary models.

Now we're happy to share our findings, including our experiments, training recipe, dataset, and model weights, all freely available under the MIT license, along with key practical insights (right here). Grab your magnifying glass, detective; the game is afoot!

Benchmarking

To begin our experiments, we first had to identify a challenging reasoning task with clearly verifiable solutions and scalable complexity. As it happened (not coincidentally), one of the authors previously created a puzzle set named Temporal Clue that matched these needs perfectly. Beyond meeting the criteria of ground truth clarity, new puzzles can be created as needed—a neat bonus.

Temporal Clue is inspired by the popular board game, Clue (Cluedo), where players race to uncover who killed Mr. Boddy in his palatial estate. Temporal Clue turns the game into a solitary logic puzzle that extends beyond the standard dimensions—who, (with) what, and where—and incorporates two additional dimensions: when (time) and why (motive). Puzzles are randomly generated, and minimal yet sufficient clues are selected with  OR-Tools' CP-SAT solver.

On a dark winter night, wealthy and enigmatic Mr. John Q. Boddy hosted a small, but lavish, dinner party for some of his closest associates. However, the night ended in tragedy when Mr. Boddy was found dead in one of the rooms of Tudor Mansion in the early hours of the morning. The following persons of interest have been identified as suspects…

To establish the current state-of-the-art for this deduction task, we benchmarked leading reasoning models—including DeepSeek R1, OpenAI’s o1 and o3-mini, and Anthropic’s Claude Sonnet 3.7. Additionally, we have benchmarked the 14B and 32B Qwen models, which we later improve using reinforcement learning, and include a preview of our final results:

Organization

Model

Reasoning Effort

Avg. Accuracy

Avg. Cost

DeepSeek

R1

Default

51.6%

$0.029

OpenAI

o1

Default

54.9%

$0.901

OpenAI

o3-mini

Medium

55.9%

$0.068

OpenAI

o3-mini

High

56.7%

$0.170

Anthropic

Sonnet 3.7

None

51.7%

$0.017

Anthropic

Sonnet 3.7

16k Token Budget

61.7%

$0.222

Anthropic

Sonnet 3.7

64k Token Budget

69.5%

$0.392

Alibaba

Qwen 2.5 14B Instruct

None

28.1% → 59.4%

$0.001

Alibaba

Qwen 2.5 32B Instruct

None

37.3% → 67.1%

$0.002

From these benchmarks, we saw that Claude Sonnet 3.7 with a 64k token thinking budget performed best on our task, but that all the leading models showed room for improvement. DeepSeek R1, a popular open-weight model, performed nearly as well as OpenAI's o1 and o3-mini. However, the untuned Qwen 2.5 Instruct models’ performance is unimpressive in comparison. The big question is: Can we train these smaller, open-weight models to frontier-level performance? Elementary, our dear reader—we just need the right approach.

Training

To train a frontier-level deduction model, we turned to reinforcement learning—an approach that allows agents to learn from their own experience inside a controlled environment. Here, the LLMs were our agents, and the puzzles were our environment. We guided the LLMs’ learning by having them generate multiple responses for each puzzle, exploring the problem landscape. We reinforced deductions leading to correct solutions and penalized reasoning that took the models astray.

Among various RL methods, we selected the popular Group Relative Policy Optimization (GRPO) algorithm developed by DeepSeek. GRPO simplifies the training process compared to more traditional methods like Proximal Policy Optimization (PPO), while still providing robust performance. To speed up our experiments, we omitted the  Kullback–Leibler (KL) divergence penalty, although our training recipe supports it for interested readers.

At a high level, our training loop followed these basic steps:

  • Generate model responses to puzzle tasks

  • Grade responses and estimate advantages for each group of chat completions (that’s the “Group Relative” part in GRPO)

  • Fine-tune the model using clipped policy gradients guided by these advantage estimates

  • Repeat these steps with new puzzles and the latest version of the model until we reach peak performance

For generating responses, we used the popular vLLM inference engine. We tuned our parameter choices to maximize throughput and minimize startup time. Prefix caching was particularly important because we sampled many responses for each task, and caching prompts helps avoid redundant computation.

We observed that overwhelming vLLM with too many requests forces preemption or swapping out of in-progress requests. To address this, we limited requests using a semaphore tuned to maintain high key-value (KV) cache utilization while minimizing swaps. More advanced scheduling mechanisms could yield even higher utilization while still supporting flexible generation lengths.

After sampling, we processed completions using the standard HuggingFace Transformers AutoTokenizer. Its chat template feature, which renders message objects as a prompt string, includes an assistant mask for determining which tokens the LLM generated. We found the models lacked the necessary % generation % tags in their default templates, so we modified them during the tokenization step. The resulting assistant mask was included in the dictionary of tensors used for tuning, identifying which positions required loss calculations.

After tokenizing responses and obtaining assistant masks, we packed the data for tuning. In addition to including multiple prompt/response pairs in each packed sequence, we identified shared prompt tokens and assigned each token a Parent ID alongside the standard Group ID. Particularly for tasks like Temporal Clue—averaging over 1,000 tokens per puzzle—generating numerous responses per task and efficiently packing tensors significantly reduced redundancy. Once packed with all necessary information, we could visualize our training dataset two-dimensionally, each row being a sequence of tokens potentially containing multiple prompts and completions:

With tightly-packed data in hand, we could proceed to tuning. Our models were already pre-trained, instruction-tuned, fairly intelligent, and adept at following instructions. However, they could not yet reliably solve Temporal Clue puzzles. Still, they occasionally succeeded, and that was enough. By increasing the probability of good reasoning and decreasing the probability of “not good” reasoning, we incrementally steered the models toward Master Detective status. We achieved this using standard machine learning techniques, employing policy gradient methods to compute loss and shift the weights beneficially.

For training, we used the torchtune library provided by the PyTorch team. Torchtune features efficient decoder-only transformer implementations for popular models including Llama, Gemma, Phi, and more. Although we primarily used the Qwen models for this project, we also ran experiments with 8B and 70B Llama models. Torchtune also provides memory-saving and performance-enhancing utilities, including:

See the README here for the full list of supported optimizations.

Additionally, Torchtune supports multi-device (and now multi-node) training, making it ideal for larger models. It supports both Fully Sharded Data Parallel (FSDP) and Tensor Parallel (TP) training, which can be combined. They also provide over a dozen recipes, encouraging users to copy and customize them for their use cases. We created a modified version of their full fine-tune recipes supporting:

  • Both multi-device and single-device training

  • Reference model loading and weight swapping for calculating KL divergences

  • Advanced causal mask calculations using group and parent IDs

  • GRPO loss integration and component logging

The recipe can be seen here. In the future, we would like to add tensor parallelism support and explore PEFT and quantization.

The RL training process involves selecting a myriad of hyperparameters. While training our models, we tested various configurations and largely settled upon the following:

  • Models: Qwen 2.5 Instruct 14B & 32B

  • Tasks per Iteration: 32

  • Samples per Task per Iteration: 50

  • Total Samples per Iteration: 32 * 50 = 1600

  • Learning Rate: 6e-6

  • Micro-Batch Size: 4 sequences for 14B model, 8 for 32B model

  • Batch Size: Variable, depending on the number of sequences

The batch size is variable because response lengths can vary during training, sequence packing efficiency fluctuates each iteration, and responses with zero advantage are discarded. For one run, we tried dynamically adjusting learning rates inversely proportional to batch size, but this resulted in excessively high learning rates for small batch sizes, requiring a cap. The capped version didn’t meaningfully differ from using a constant learning rate, but tuning batch size and learning rate remains an interesting area for future experimentation.

We also ran brief experiments increasing tasks per iteration while reducing samples per task—and vice versa—keeping total samples per iteration roughly equal. Over a short training horizon, these variations showed no meaningful differences, suggesting the recipe is robust to different balances between the number of tasks and samples per task.

Results

After training our models for over 100 iterations, we reached frontier-level deduction.

Our models quickly improved before accuracy gains started to taper off and eventually degrade, sometimes aggressively. At their best, the 14B model approached Claude Sonnet 3.7’s performance at 16k tokens and the 32B model nearly matched Sonnet's results with the larger 64k budget.

While training, performance gains followed a power law, forming a linear relationship on a log-log chart (before deteriorating).

We suspect the models may have converged too early on greedy strategies that worked out of the gate, but potentially limited their long-term prospects. A logical next step would explore approaches that encourage diverse responses, or that build capabilities incrementally (like with curriculum learning), or that assign larger rewards to particularly outstanding solutions incentivizing thorough exploration.

Additionally, we noted interesting patterns in output length during training. Initially, responses grew longer, then stabilized, before diverging near the end of training, with the 14B model’s responses getting longer and the 32B model’s response lengths collapsing, especially after reaching peak performance.

To qualitatively assess improvements in logical reasoning, we asked the strongest frontier model, Claude Sonnet 3.7, to identify and evaluate the soundness of deductions made by the Qwen 32B model—before and after training for 100+ iterations—on similar puzzles. Sonnet identified 6 deductions from the base model, with all but one judged erroneous; conversely, it identified 7 deductions from the trained model, with all but one judged logically sound.

Finally, assuming sufficient throughput with on-demand deployments, we estimated Qwen model costs from Fireworks AI’s serverless pricing tiers. We plotted accuracy against the natural logarithm of the average inference cost per response, and observed a clear linear Pareto frontier among untuned models. By successfully training open-weight models to frontier-level accuracy, we dramatically improved the cost–accuracy trade-off.

Here, after sharing satisfied nods for a job well done, we hail a hansom cab and return to Baker Street—the perfect place to contemplate our findings.

Conclusion

In our investigation, we set out to explore whether smaller, open-weight language models could achieve frontier-level deductive reasoning through reinforcement learning. After training Qwen 14B and 32B models on challenging Temporal Clue puzzles—using carefully selected hyperparameters and the GRPO method—we achieved impressive performance gains. These improvements brought open-weight models to the cutting edge of reasoning performance, at significantly reduced costs. Our findings highlight the promising potential for reinforcement learning to efficiently train open models on complex deduction tasks.

As mentioned previously, the dataset, experiments, training recipe, and model weights (14B, 32B) are freely available under the MIT license. We encourage you to try reproducing and improving on our results.

Additionally, we’ve held out one particularly exciting finding for the end. We discovered that meaningful performance improvements, as high as 10–15%, can be achieved with as few as 16 training examples. This means you don’t need a lot of data to get started; just some intuition about the problem you’d like to solve.

Are you interested in using reinforcement learning to train your own models, or would like some help getting started? Feel free to reach out to us at OpenPipe—we'd love to chat!


Now, dear reader, please keep your deerstalker cap and magnifying glass handy; there's much more to explore. The game remains very much afoot.

Using GRPO to beat o3-mini at Clue

GRPO beats o3-mini at Clue →

Sign up to our newsletter

Stay updated with our latest product releases!

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.

Sign up to our newsletter

Stay updated with our latest product releases!

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.