News

ART Trainer: A New RL Trainer for Agents

Kyle Corbitt

Apr 14, 2025

We’re excited to announce Agent Reinforcement Trainer (ART), a reinforcement learning framework that allows for easy training of LLM-based agents using GRPO (as well as PPO and other techniques in the future). This release is an early alpha focused on best-in-class training efficiency and agentic multi-turn support.

Motivation

There are many excellent projects focused on training LLMs with RL, such as GRPOTrainer and verl. We’ve used these frameworks extensively for customer-facing projects at OpenPipe, as well as our own custom in-house framework based on torchtune. So, why build a new one? We’ve identified three key limitations of existing frameworks in real-world use, all of which we’re addressing with ART.

Limitation 1: multi-turn roll-outs. Existing trainers focus on single-turn interactions, where an LLM is given some context, produces an output, and then that output is graded. This is sufficient for some patterns (eg. learning to reason through math problems) but not for agentic flows, which are multi-turn by nature.

Limitation 2: GPU efficiency. Even after adapting existing trainers to handle multi-turn rollouts, we’ve found that GPU utilization rates can still be very low during the rollout phase while training agents. This is because agents often have to take real-world actions such as navigating the web, submitting forms, etc, after each model response. While waiting for these actions to execute the GPUs dedicated to inference may be left idle.

Limitation 3: integration with existing codebases. Agents in real-world use are often embedded within complex codebases that include lots of custom tool definitions and frameworks such as CrewAI, Mastra or the OpenAI Agents SDK. Existing RL training pipelines are not able to meet existing codebases where they are, and require significant refactoring and lower-level manipulation to train an agent.

A new ART-chitecture

To address the limitations above, we’ve developed a new architecture that is significantly easier to embed in existing codebases, supports multi-turn, and paves the way to increased GPU utilization.

The new architecture is composed of two key innovations:

  1. Separate backend and frontend. We’ve separated the GRPO training loop into two halves, a “frontend” composed of mostly user-defined code for defining agentic rollouts and computing rewards, and a “backend” responsible for LLM inference and training. Importantly, it will be possible to deploy these on separate machines. This means that the frontend can live in your local environment with all the tooling it needs, and the backend can be run on a cloud GPU.

  2. OpenAI-compatible inference endpoints. To maximize ecosystem compatibility, we provision an industry-standard OpenAI-compatible chat completion endpoint that can be used to perform rollouts during training. This makes it possible to train a model using your existing inference logic with minimal changes. You can use code that relies on the OpenAI SDK or wrappers like litellm out-of-the-box.

Under the hood, we use vLLM for inference and a combination of TRL+Unsloth for training. Building on Unsloth’s optimizations, we’ve made additional improvements to memory usage by offloading vLLM’s KV cache to CPU memory during training. This allows us to save up to 4GB of VRAM when training a 7B model with 8K context length, allowing us to train 7B models on Google Colab’s free tier without issues!

Early results

We have a number of public examples of GRPO-trained models on various tasks. More details on all of the examples below will be coming in future blog posts!

HN title generation

Building on our RLHF part 1 post, we’ve trained a SOTA model to generate the HN titles most likely to get upvotes given an article’s text. (code)

2048

We've trained Qwen 7B to play the popular solo game "2048". This is a good example of multi-turn rollouts! (runnable notebook)

Tic-tac-toe

We’ve trained a 7B model that surpasses gpt-4o in tic-tac-toe playing ability. (code)

Clue

We trained a 14B model to surpass most frontier models in a variant of Clue called “temporal clue”. We’ve ported the code for training this model to ART. (code)

Join the community

It’s still very early for both ART and open-source RL, and we invite everyone to play with it and help us shape the project’s direction. The best way is by joining our Discord or submitting issues on Github. We have a lot of plans already on the roadmap (Typescript support, full fine-tuning, direct integration with agentic frameworks) but community requests will have a huge impact on how we prioritize different features. We’re excited to work together to push forward the state of the ART!

FAQs

How does ART work under the hood?

This flow chart shows a highly simplified flow of how ART optimizes your agent. Your code is responsible for actually running the agent in the environment it will operate in, as well as scoring the trajectory (deciding whether the agent did a good job or not). ART is then able to take those trajectories and scores and use them to iteratively train your agent and improve performance.

Why separate frontend from backend? Doesn’t that increase complexity?

By separating the ART backend into a separate service, we’ve been able to keep the ART frontend extremely narrow and clean. This makes it much easier to embed it into existing production applications, while allowing the heavy backend to be run on separate, powerful machines with appropriate GPU resources. We’ve included an open-source ART backend and over time we expect more providers to implement hosted ART backends as well, giving users choice and convenience in where their models are trained. Note however that the initial release assumes that the frontend and backend are running on the same machine with a GPU available.

How do I know whether ART can help my agent improve performance?

To get good results with ART, we recommend first ensuring your task meets the following requirements:

  1. Open source models can complete the task at least 30% of the time already. If you try to use ART on a task that is too far out of distribution, it likely won’t be able to teach your model efficiently.

  2. You can easily verify whether a task was completed successfully. ART, like all reinforcement-learning approaches, works by training a model to maximize a reward. To use it successfully, you need to be able to define some kind of quantifiable reward for the model to optimize against. Rewards can be objective (”does this output match the golden data from my training set”) or subjective (”does this output satisfy my LLM-as-judge”) but must be consistent and quantifiable.

  3. Your agent can be run many times without affecting the real world. ART currently builds on GRPO, an RL algorithm that involves running many agents in parallel and then using the difference in the rewards they achieve as a stable training signal. This means that you need to be able to to run the model many times in the same scenario as part of training, which isn’t a good fit for agents that make changes in the external world.

Which pieces of ART are open source?

We are committed to maintaining ART as a full-featured open source project. We will also deploy an optional hosted ART backend for users who don’t want to manage GPU infrastructure on their own.

How expensive are ART training runs?

RL can often be more expensive than other training methods like SFT for a given dataset size. In our experiments, training runs often cost between $15 and $200 in GPU time. We are actively working on improving efficiency and welcome contributions in this area; there is still a lot of low-hanging fruit to pick here.

Can I use ART to train on user feedback in production directly?

This is theoretically possible, and an area that we’re interested in exploring! However, there are practical challenges that make this a bit of a longer-term project. For now, we recommend training a reward model on your production feedback, and then using ART to optimize your model or agent against that reward model.

Is ART just for agents? I’m interested in training a non-agentic model with RL.

We designed the ART architecture to make training agents trivial. But under the hood we are using GRPO, which is a general purpose RL technique and the same one used to train R1, the frontier open-source reasoning model. ART is very effective at optimizing any LLM task for which you can define a quantifiable reward signal.

Using GRPO to beat o3-mini at Clue

GRPO beats o3-mini at Clue →

Sign up to our newsletter

Stay updated with our latest product releases!

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.

Sign up to our newsletter

Stay updated with our latest product releases!

About OpenPipe

OpenPipe is the easiest way to train and deploy your own fine-tuned models. It only takes a few minutes to get started and can save you 25x relative to OpenAI with higher quality.