TRL

Post-train and align language models with SFT, DPO, GRPO, and reward modeling

github.com/huggingface/trl★ 18.7k huggingface.co/docs/trl

Overview

TRL (Transformers Reinforcement Learning) is a Hugging Face library for post-training foundation models. It gives you ready-made trainer classes for the common alignment methods: Supervised Fine-Tuning (SFT), reward modeling, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO).

It is built on top of the Transformers ecosystem, so each trainer is a light wrapper around the standard Transformers trainer and works with many model architectures. It is aimed at machine learning engineers and researchers who want to fine-tune or align a model on their own dataset without writing a training loop from scratch.

In the RLHF and alignment space, TRL covers the full pipeline from instruction tuning through preference optimization. It uses Accelerate to scale from a single GPU to multi-node clusters, and integrates with PEFT for LoRA/QLoRA and Unsloth for faster training.

What it does

Dedicated trainers for the main post-training methods: SFTTrainer, RewardTrainer, DPOTrainer, and GRPOTrainer
GRPOTrainer implements Group Relative Policy Optimization, which is more memory-efficient than PPO
Scales from a single GPU to multi-node clusters via Accelerate, with support for DDP, DeepSpeed ZeRO, and FSDP
PEFT integration enables training large models on modest hardware through quantization and LoRA/QLoRA
Unsloth integration uses optimized kernels to speed up training
Command line interface lets you run SFT or DPO without writing code

Getting started

Install TRL with pip, then pick a trainer for your method. The examples below come straight from the project README.

Install the package

Install TRL from PyPI.

bashbash

pip install trl

Run supervised fine-tuning

Use SFTTrainer to fine-tune a model on a dataset. Pass a model id and a training dataset, then call train().

pythonpython

from trl import SFTTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)
trainer.train()

Align with preferences using DPO

DPOTrainer runs Direct Preference Optimization on a preference dataset.

pythonpython

from datasets import load_dataset
from trl import DPOTrainer

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=dataset,
)
trainer.train()

Or skip the code with the CLI

The CLI runs post-training methods like SFT directly from the terminal.

bashbash

trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
    --dataset_name trl-lib/Capybara \
    --output_dir Qwen2.5-0.5B-SFT

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Instruction-tune a base model on your own dataset with SFTTrainer
Align a model to human preferences using DPO or a trained reward model
Apply GRPO to train reasoning or math models in a memory-efficient way
Fine-tune large models on limited hardware via PEFT LoRA/QLoRA and Unsloth

How TRL compares

TRL alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Open-R1	★ 26.3k	An open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation.
verl	★ 22k	Volcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM.
TRL	★ 18.7k	Post-train and align language models with SFT, DPO, GRPO, and reward modeling
Agent Lightning	★ 17.3k	An open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning.
ART	★ 10k	OpenPipe's Agent Reinforcement Trainer for post-training LLM agents on multi-step tasks using GRPO and rule- or judge-based rewards.
OpenRLHF	★ 9.7k	A Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters.
Alignment Handbook	★ 5.6k	A set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models.
Verifiers	★ 4.2k	A library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards.

// Overview

// What it does

// Getting started