Overview
TRL (Transformers Reinforcement Learning) is a Hugging Face library for post-training foundation models. It gives you ready-made trainer classes for the common alignment methods: Supervised Fine-Tuning (SFT), reward modeling, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO).
It is built on top of the Transformers ecosystem, so each trainer is a light wrapper around the standard Transformers trainer and works with many model architectures. It is aimed at machine learning engineers and researchers who want to fine-tune or align a model on their own dataset without writing a training loop from scratch.
In the RLHF and alignment space, TRL covers the full pipeline from instruction tuning through preference optimization. It uses Accelerate to scale from a single GPU to multi-node clusters, and integrates with PEFT for LoRA/QLoRA and Unsloth for faster training.
What it does
- Dedicated trainers for the main post-training methods: SFTTrainer, RewardTrainer, DPOTrainer, and GRPOTrainer
- GRPOTrainer implements Group Relative Policy Optimization, which is more memory-efficient than PPO
- Scales from a single GPU to multi-node clusters via Accelerate, with support for DDP, DeepSpeed ZeRO, and FSDP
- PEFT integration enables training large models on modest hardware through quantization and LoRA/QLoRA
- Unsloth integration uses optimized kernels to speed up training
- Command line interface lets you run SFT or DPO without writing code
Getting started
Install TRL with pip, then pick a trainer for your method. The examples below come straight from the project README.
Install the package
Install TRL from PyPI.
pip install trlRun supervised fine-tuning
Use SFTTrainer to fine-tune a model on a dataset. Pass a model id and a training dataset, then call train().
from trl import SFTTrainer
from datasets import load_dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
)
trainer.train()Align with preferences using DPO
DPOTrainer runs Direct Preference Optimization on a preference dataset.
from datasets import load_dataset
from trl import DPOTrainer
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = DPOTrainer(
model="Qwen/Qwen3-0.6B",
train_dataset=dataset,
)
trainer.train()Or skip the code with the CLI
The CLI runs post-training methods like SFT directly from the terminal.
trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/Capybara \
--output_dir Qwen2.5-0.5B-SFTCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Instruction-tune a base model on your own dataset with SFTTrainer
- Align a model to human preferences using DPO or a trained reward model
- Apply GRPO to train reasoning or math models in a memory-efficient way
- Fine-tune large models on limited hardware via PEFT LoRA/QLoRA and Unsloth
How TRL compares
TRL alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Open-R1 | ★ 26.3k | An open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation. |
| verl | ★ 22k | Volcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM. |
| TRL | ★ 18.7k | Post-train and align language models with SFT, DPO, GRPO, and reward modeling |
| Agent Lightning | ★ 17.3k | An open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning. |
| ART | ★ 10k | OpenPipe's Agent Reinforcement Trainer for post-training LLM agents on multi-step tasks using GRPO and rule- or judge-based rewards. |
| OpenRLHF | ★ 9.7k | A Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters. |
| Alignment Handbook | ★ 5.6k | A set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models. |
| Verifiers | ★ 4.2k | A library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards. |