AI/TLDR

ART

Reinforcement learning to train multi-step LLM agents from experience with GRPO

Overview

ART (Agent Reinforcement Trainer) is an open-source Python framework from OpenPipe for post-training LLM agents on multi-step tasks. It uses GRPO (Group Relative Policy Optimization) so a model can learn from experience: you run the agent, score its trajectories with a reward function, and ART updates the model toward higher-scoring behavior.

It is aimed at engineers who already have an agent loop and want it to become more reliable at a specific task, rather than relying only on prompt engineering. You define the data, environment, and reward, and ART provides the harness that plugs GRPO into an ordinary Python application.

Within RLHF and alignment tooling, ART focuses on agentic, multi-step rewards rather than single-turn preference tuning. Rewards can be rule-based or judge-based (the project ships a RULER approach for LLM-judged scoring), and training runs either on your own GPUs or through W&B Training (Serverless RL).

What it does

  • Trains multi-step agents with GRPO so models improve from their own trajectories
  • Wraps an existing Python agent loop instead of forcing a new framework
  • Supports rule-based and judge-based (RULER) reward functions
  • Serverless RL option via W&B Training manages training and inference infra for you
  • Ships runnable example notebooks (email search, 2048, MCP, Codenames, Tic Tac Toe) on open Qwen models
  • Integrates with agent stacks such as LangGraph and MCP servers

Getting started

Install the package, then define a trainable model and register a backend that runs the GRPO training loop. The fastest path is one of the example notebooks linked in the README.

Install ART

Install the package from PyPI into your Python environment.

bashbash
pip install openpipe-art

Define a trainable model and register a backend

Create a TrainableModel for your base model, then register a backend that handles training and inference. This example uses the W&B Serverless RL backend shown in the README.

pythonpython
from art.serverless.backend import ServerlessBackend

model = art.TrainableModel(
  project="voice-agent",
  name="agent-001",
  base_model="Qwen/Qwen3.6-27B"
)

backend = ServerlessBackend(
    api_key="your_wandb_api_key"
)
model.register(backend)

Start from an example notebook

Open one of the example notebooks (such as email search, 2048, or MCP-RL) to see a full loop of running the agent, scoring trajectories, and training. See the docs at art.openpipe.ai for details.

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Improving the reliability of a multi-step agent (e.g. email search or tool use) on a specific task
  • Teaching an open model to operate an MCP server or a LangGraph workflow through reinforcement learning
  • Training an agent with a custom reward when prompt engineering alone plateaus
  • Running RL post-training without managing GPU infrastructure via the serverless backend

How ART compares

ART alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Open-R1★ 26.3kAn open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation.
verl★ 22.1kVolcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM.
TRL★ 18.7kHugging Face's post-training library with trainers for SFT, reward modeling, DPO, PPO, and GRPO to align language models with preferences.
Agent Lightning★ 17.3kAn open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning.
ART★ 10.1kReinforcement learning to train multi-step LLM agents from experience with GRPO
OpenRLHF★ 9.7kA Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters.
Alignment Handbook★ 5.6kA set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models.
Verifiers★ 4.2kA library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards.