AI/TLDR

verl

Reinforcement learning post-training for LLMs, from GRPO and PPO to large MoE models

Overview

verl is a reinforcement learning training library for large language models, started by the ByteDance Seed team and maintained by the verl community. It is the open-source version of the HybridFlow paper, and it focuses on the RL post-training stage that comes after pretraining and supervised fine-tuning.

It is aimed at researchers and ML engineers who want to run RL algorithms such as GRPO and PPO on top of existing model infrastructure. verl decouples the computation and data parts of an RL pipeline, so it can plug into training backends like FSDP and Megatron-LM and inference engines like vLLM and SGLang, and it works with HuggingFace models.

Within the RLHF and alignment category, verl handles the training loop itself: generating rollouts, scoring them, and updating the policy. It supports flexible mapping of models onto GPUs, which lets the same code scale from a single 24 GB GPU demo to large clusters training MoE models.

What it does

  • Build GRPO, PPO, and other RL dataflows with the hybrid-controller programming model in a few lines of code
  • Integrates with existing training backends (FSDP, Megatron-LM) and inference engines (vLLM, SGLang) through modular APIs
  • Flexible device mapping places models on different sets of GPUs for better resource use across cluster sizes
  • 3D-HybridEngine reshards the actor model to cut memory redundancy and communication when switching between training and generation
  • Ready integration with HuggingFace models
  • Scales from a single-GPU example up to large MoE models such as DeepSeek-671B and Qwen3-235B with the Megatron backend

Getting started

Install verl from source, preprocess a dataset, then launch a PPO training run. A GPU with at least 24 GB of memory is recommended for the demo; Python >= 3.10 and CUDA >= 12.8 are required.

Install from source

Clone the repository and install it in editable mode. Add the vllm or sglang extra to pull in an inference engine.

bashbash
git clone https://github.com/verl-project/verl.git
cd verl
pip install --no-deps -e .
pip install -e ".[vllm]"

Prepare the dataset

Preprocess the GSM8K math dataset into the parquet files verl expects.

bashbash
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k

Run a PPO training job

Launch the PPO trainer on a small Qwen model. This is the minimal single-GPU demo from the docs; tune the config keys for your hardware.

bashbash
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
  data.train_files=$HOME/data/gsm8k/train.parquet \
  data.val_files=$HOME/data/gsm8k/test.parquet \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  actor_rollout_ref.rollout.name=vllm \
  critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  trainer.n_gpus_per_node=1 \
  trainer.nnodes=1 \
  trainer.total_epochs=15 2>&1 | tee verl_demo.log

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Post-train an instruction-tuned model with GRPO or PPO to improve reasoning on math and coding tasks
  • Reproduce or build on published RL recipes such as DAPO and ReTool
  • Scale RL training of large MoE models across multi-GPU clusters using the Megatron backend
  • Experiment with custom reward functions and RL dataflows while reusing existing FSDP, vLLM, or SGLang infrastructure

How verl compares

verl alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Open-R1★ 26.3kAn open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation.
verl★ 22.1kReinforcement learning post-training for LLMs, from GRPO and PPO to large MoE models
TRL★ 18.7kHugging Face's post-training library with trainers for SFT, reward modeling, DPO, PPO, and GRPO to align language models with preferences.
Agent Lightning★ 17.3kAn open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning.
ART★ 10.1kOpenPipe's Agent Reinforcement Trainer for post-training LLM agents on multi-step tasks using GRPO and rule- or judge-based rewards.
OpenRLHF★ 9.7kA Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters.
Alignment Handbook★ 5.6kA set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models.
Verifiers★ 4.2kA library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards.