AI/TLDR

TRL

Post-train and align language models with SFT, DPO, GRPO, and reward modeling

Overview

TRL (Transformers Reinforcement Learning) is a Hugging Face library for post-training foundation models. It gives you ready-made trainer classes for the common alignment methods: Supervised Fine-Tuning (SFT), reward modeling, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO).

It is built on top of the Transformers ecosystem, so each trainer is a light wrapper around the standard Transformers trainer and works with many model architectures. It is aimed at machine learning engineers and researchers who want to fine-tune or align a model on their own dataset without writing a training loop from scratch.

In the RLHF and alignment space, TRL covers the full pipeline from instruction tuning through preference optimization. It uses Accelerate to scale from a single GPU to multi-node clusters, and integrates with PEFT for LoRA/QLoRA and Unsloth for faster training.

What it does

  • Dedicated trainers for the main post-training methods: SFTTrainer, RewardTrainer, DPOTrainer, and GRPOTrainer
  • GRPOTrainer implements Group Relative Policy Optimization, which is more memory-efficient than PPO
  • Scales from a single GPU to multi-node clusters via Accelerate, with support for DDP, DeepSpeed ZeRO, and FSDP
  • PEFT integration enables training large models on modest hardware through quantization and LoRA/QLoRA
  • Unsloth integration uses optimized kernels to speed up training
  • Command line interface lets you run SFT or DPO without writing code

Getting started

Install TRL with pip, then pick a trainer for your method. The examples below come straight from the project README.

Install the package

Install TRL from PyPI.

bashbash
pip install trl

Run supervised fine-tuning

Use SFTTrainer to fine-tune a model on a dataset. Pass a model id and a training dataset, then call train().

pythonpython
from trl import SFTTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)
trainer.train()

Align with preferences using DPO

DPOTrainer runs Direct Preference Optimization on a preference dataset.

pythonpython
from datasets import load_dataset
from trl import DPOTrainer

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=dataset,
)
trainer.train()

Or skip the code with the CLI

The CLI runs post-training methods like SFT directly from the terminal.

bashbash
trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
    --dataset_name trl-lib/Capybara \
    --output_dir Qwen2.5-0.5B-SFT

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Instruction-tune a base model on your own dataset with SFTTrainer
  • Align a model to human preferences using DPO or a trained reward model
  • Apply GRPO to train reasoning or math models in a memory-efficient way
  • Fine-tune large models on limited hardware via PEFT LoRA/QLoRA and Unsloth

How TRL compares

TRL alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Open-R1★ 26.3kAn open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation.
verl★ 22kVolcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM.
TRL★ 18.7kPost-train and align language models with SFT, DPO, GRPO, and reward modeling
Agent Lightning★ 17.3kAn open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning.
ART★ 10kOpenPipe's Agent Reinforcement Trainer for post-training LLM agents on multi-step tasks using GRPO and rule- or judge-based rewards.
OpenRLHF★ 9.7kA Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters.
Alignment Handbook★ 5.6kA set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models.
Verifiers★ 4.2kA library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards.