ART

Reinforcement learning to train multi-step LLM agents from experience with GRPO

Overview

ART (Agent Reinforcement Trainer) is an open-source Python framework from OpenPipe for post-training LLM agents on multi-step tasks. It uses GRPO (Group Relative Policy Optimization) so a model can learn from experience: you run the agent, score its trajectories with a reward function, and ART updates the model toward higher-scoring behavior.

It is aimed at engineers who already have an agent loop and want it to become more reliable at a specific task, rather than relying only on prompt engineering. You define the data, environment, and reward, and ART provides the harness that plugs GRPO into an ordinary Python application.

Within RLHF and alignment tooling, ART focuses on agentic, multi-step rewards rather than single-turn preference tuning. Rewards can be rule-based or judge-based (the project ships a RULER approach for LLM-judged scoring), and training runs either on your own GPUs or through W&B Training (Serverless RL).

What it does

Trains multi-step agents with GRPO so models improve from their own trajectories
Wraps an existing Python agent loop instead of forcing a new framework
Supports rule-based and judge-based (RULER) reward functions
Serverless RL option via W&B Training manages training and inference infra for you
Ships runnable example notebooks (email search, 2048, MCP, Codenames, Tic Tac Toe) on open Qwen models
Integrates with agent stacks such as LangGraph and MCP servers

Getting started

Install the package, then define a trainable model and register a backend that runs the GRPO training loop. The fastest path is one of the example notebooks linked in the README.

Install ART

Install the package from PyPI into your Python environment.

bashbash

pip install openpipe-art

Define a trainable model and register a backend

Create a TrainableModel for your base model, then register a backend that handles training and inference. This example uses the W&B Serverless RL backend shown in the README.

pythonpython

from art.serverless.backend import ServerlessBackend

model = art.TrainableModel(
  project="voice-agent",
  name="agent-001",
  base_model="Qwen/Qwen3.6-27B"
)

backend = ServerlessBackend(
    api_key="your_wandb_api_key"
)
model.register(backend)

Start from an example notebook

Open one of the example notebooks (such as email search, 2048, or MCP-RL) to see a full loop of running the agent, scoring trajectories, and training. See the docs at art.openpipe.ai for details.

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Improving the reliability of a multi-step agent (e.g. email search or tool use) on a specific task
Teaching an open model to operate an MCP server or a LangGraph workflow through reinforcement learning
Training an agent with a custom reward when prompt engineering alone plateaus
Running RL post-training without managing GPU infrastructure via the serverless backend

How ART compares

ART alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Open-R1	★ 26.3k	An open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation.
verl	★ 22.1k	Volcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM.
TRL	★ 18.7k	Hugging Face's post-training library with trainers for SFT, reward modeling, DPO, PPO, and GRPO to align language models with preferences.
Agent Lightning	★ 17.3k	An open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning.
ART	★ 10.1k	Reinforcement learning to train multi-step LLM agents from experience with GRPO
OpenRLHF	★ 9.7k	A Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters.
Alignment Handbook	★ 5.6k	A set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models.
Verifiers	★ 4.2k	A library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards.

// Overview

// What it does

// Getting started