Overview
ART (Agent Reinforcement Trainer) is an open-source Python framework from OpenPipe for post-training LLM agents on multi-step tasks. It uses GRPO (Group Relative Policy Optimization) so a model can learn from experience: you run the agent, score its trajectories with a reward function, and ART updates the model toward higher-scoring behavior.
It is aimed at engineers who already have an agent loop and want it to become more reliable at a specific task, rather than relying only on prompt engineering. You define the data, environment, and reward, and ART provides the harness that plugs GRPO into an ordinary Python application.
Within RLHF and alignment tooling, ART focuses on agentic, multi-step rewards rather than single-turn preference tuning. Rewards can be rule-based or judge-based (the project ships a RULER approach for LLM-judged scoring), and training runs either on your own GPUs or through W&B Training (Serverless RL).
What it does
- Trains multi-step agents with GRPO so models improve from their own trajectories
- Wraps an existing Python agent loop instead of forcing a new framework
- Supports rule-based and judge-based (RULER) reward functions
- Serverless RL option via W&B Training manages training and inference infra for you
- Ships runnable example notebooks (email search, 2048, MCP, Codenames, Tic Tac Toe) on open Qwen models
- Integrates with agent stacks such as LangGraph and MCP servers
Getting started
Install the package, then define a trainable model and register a backend that runs the GRPO training loop. The fastest path is one of the example notebooks linked in the README.
Install ART
Install the package from PyPI into your Python environment.
pip install openpipe-artDefine a trainable model and register a backend
Create a TrainableModel for your base model, then register a backend that handles training and inference. This example uses the W&B Serverless RL backend shown in the README.
from art.serverless.backend import ServerlessBackend
model = art.TrainableModel(
project="voice-agent",
name="agent-001",
base_model="Qwen/Qwen3.6-27B"
)
backend = ServerlessBackend(
api_key="your_wandb_api_key"
)
model.register(backend)Start from an example notebook
Open one of the example notebooks (such as email search, 2048, or MCP-RL) to see a full loop of running the agent, scoring trajectories, and training. See the docs at art.openpipe.ai for details.
Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Improving the reliability of a multi-step agent (e.g. email search or tool use) on a specific task
- Teaching an open model to operate an MCP server or a LangGraph workflow through reinforcement learning
- Training an agent with a custom reward when prompt engineering alone plateaus
- Running RL post-training without managing GPU infrastructure via the serverless backend
How ART compares
ART alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Open-R1 | ★ 26.3k | An open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation. |
| verl | ★ 22.1k | Volcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM. |
| TRL | ★ 18.7k | Hugging Face's post-training library with trainers for SFT, reward modeling, DPO, PPO, and GRPO to align language models with preferences. |
| Agent Lightning | ★ 17.3k | An open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning. |
| ART | ★ 10.1k | Reinforcement learning to train multi-step LLM agents from experience with GRPO |
| OpenRLHF | ★ 9.7k | A Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters. |
| Alignment Handbook | ★ 5.6k | A set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models. |
| Verifiers | ★ 4.2k | A library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards. |