verl

Reinforcement learning post-training for LLMs, from GRPO and PPO to large MoE models

github.com/verl-project/verl★ 22.1k verl.readthedocs.io

Overview

verl is a reinforcement learning training library for large language models, started by the ByteDance Seed team and maintained by the verl community. It is the open-source version of the HybridFlow paper, and it focuses on the RL post-training stage that comes after pretraining and supervised fine-tuning.

It is aimed at researchers and ML engineers who want to run RL algorithms such as GRPO and PPO on top of existing model infrastructure. verl decouples the computation and data parts of an RL pipeline, so it can plug into training backends like FSDP and Megatron-LM and inference engines like vLLM and SGLang, and it works with HuggingFace models.

Within the RLHF and alignment category, verl handles the training loop itself: generating rollouts, scoring them, and updating the policy. It supports flexible mapping of models onto GPUs, which lets the same code scale from a single 24 GB GPU demo to large clusters training MoE models.

What it does

Build GRPO, PPO, and other RL dataflows with the hybrid-controller programming model in a few lines of code
Integrates with existing training backends (FSDP, Megatron-LM) and inference engines (vLLM, SGLang) through modular APIs
Flexible device mapping places models on different sets of GPUs for better resource use across cluster sizes
3D-HybridEngine reshards the actor model to cut memory redundancy and communication when switching between training and generation
Ready integration with HuggingFace models
Scales from a single-GPU example up to large MoE models such as DeepSeek-671B and Qwen3-235B with the Megatron backend

Getting started

Install verl from source, preprocess a dataset, then launch a PPO training run. A GPU with at least 24 GB of memory is recommended for the demo; Python >= 3.10 and CUDA >= 12.8 are required.

Install from source

Clone the repository and install it in editable mode. Add the vllm or sglang extra to pull in an inference engine.

bashbash

git clone https://github.com/verl-project/verl.git
cd verl
pip install --no-deps -e .
pip install -e ".[vllm]"

Prepare the dataset

Preprocess the GSM8K math dataset into the parquet files verl expects.

bashbash

python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k

Run a PPO training job

Launch the PPO trainer on a small Qwen model. This is the minimal single-GPU demo from the docs; tune the config keys for your hardware.

bashbash

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
  data.train_files=$HOME/data/gsm8k/train.parquet \
  data.val_files=$HOME/data/gsm8k/test.parquet \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  actor_rollout_ref.rollout.name=vllm \
  critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  trainer.n_gpus_per_node=1 \
  trainer.nnodes=1 \
  trainer.total_epochs=15 2>&1 | tee verl_demo.log

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Post-train an instruction-tuned model with GRPO or PPO to improve reasoning on math and coding tasks
Reproduce or build on published RL recipes such as DAPO and ReTool
Scale RL training of large MoE models across multi-GPU clusters using the Megatron backend
Experiment with custom reward functions and RL dataflows while reusing existing FSDP, vLLM, or SGLang infrastructure

How verl compares

verl alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Open-R1	★ 26.3k	An open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation.
verl	★ 22.1k	Reinforcement learning post-training for LLMs, from GRPO and PPO to large MoE models
TRL	★ 18.7k	Hugging Face's post-training library with trainers for SFT, reward modeling, DPO, PPO, and GRPO to align language models with preferences.
Agent Lightning	★ 17.3k	An open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning.
ART	★ 10.1k	OpenPipe's Agent Reinforcement Trainer for post-training LLM agents on multi-step tasks using GRPO and rule- or judge-based rewards.
OpenRLHF	★ 9.7k	A Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters.
Alignment Handbook	★ 5.6k	A set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models.
Verifiers	★ 4.2k	A library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards.

// Overview

// What it does

// Getting started

Install from source

Prepare the dataset

Run a PPO training job

// When to use it

// How verl compares

Overview

What it does

Getting started

When to use it

How verl compares