Overview
OpenRLHF is an open-source framework for reinforcement learning from human feedback (RLHF), the process of aligning a language model with human preferences after pretraining. It is built on Ray for distributed scheduling and vLLM for fast generation, with DeepSpeed handling training. This split lets it place the actor, reward, reference, and critic models on separate GPUs and scale to models with 70B+ parameters.
It is aimed at ML engineers and researchers who want to run the full RLHF pipeline—supervised fine-tuning, reward modeling, and reinforcement learning—without building the distributed plumbing themselves. It ships command-line entry points and example scripts for common setups, so you can start from a working recipe and adjust it.
Within the RLHF and alignment category, OpenRLHF focuses on the training infrastructure rather than on datasets or evaluation. It supports several RL algorithms (PPO, GRPO, REINFORCE++, RLOO) and includes a hybrid engine mode that lets models and vLLM engines share GPU resources to reduce idle time on limited hardware.
What it does
- Built on a Ray + vLLM + DeepSpeed distributed stack that separates actor, reward, reference, and critic models across GPUs
- Supports multiple RL algorithms: PPO, GRPO, REINFORCE++, and RLOO
- Scales RLHF training to models with 70B+ parameters
- Hybrid engine scheduling lets models and vLLM engines share GPUs to cut idle time
- Covers the full pipeline: supervised fine-tuning, reward modeling, and RL training
- Async RLHF and agent-based RLHF via --train.async_enable and --train.agent_func_path, plus optional LoRA
Getting started
Install OpenRLHF with pip (vLLM is an optional extra), then launch one of its CLI training entry points such as supervised fine-tuning.
Install with pip
Install the base package, or add the vLLM extra for generation acceleration used during RL training.
pip install openrlhf # Basic
pip install openrlhf[vllm] # + vLLMOr install from source
Clone the repository and install in editable mode if you want to modify the code or run the example scripts.
git clone https://github.com/OpenRLHF/OpenRLHF.git
cd OpenRLHF
pip install -e .Run supervised fine-tuning
Launch the SFT entry point with DeepSpeed. This example fine-tunes Llama 3 8B on the OpenOrca dataset.
deepspeed --module openrlhf.cli.train_sft \
--data.max_len 4096 \
--data.dataset Open-Orca/OpenOrca \
--data.input_key question \
--data.output_key response \
--train.batch_size 256 \
--train.micro_batch_size 2 \
--actor.model_name_or_path meta-llama/Meta-Llama-3-8B \
--ckpt.output_dir ./checkpoint/llama3-8b-sft \
--ds.zero_stage 2 \
--train.max_epochs 1 \
--adam.lr 5e-6Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Aligning an open-weight LLM with human preferences using PPO, GRPO, or REINFORCE++
- Running RLHF training on large models (up to 70B+) across multiple GPUs without writing the distributed scheduling yourself
- Reproducing reasoning-model RL recipes (e.g. DeepSeek-R1-style training) from the provided example scripts
- Training reward models and running supervised fine-tuning as the first stages of an alignment pipeline
How OpenRLHF compares
OpenRLHF alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Open-R1 | ★ 26.3k | An open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation. |
| verl | ★ 22.1k | Volcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM. |
| TRL | ★ 18.7k | Hugging Face's post-training library with trainers for SFT, reward modeling, DPO, PPO, and GRPO to align language models with preferences. |
| Agent Lightning | ★ 17.3k | An open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning. |
| ART | ★ 10.1k | OpenPipe's Agent Reinforcement Trainer for post-training LLM agents on multi-step tasks using GRPO and rule- or judge-based rewards. |
| OpenRLHF | ★ 9.7k | Scalable RLHF training for large language models on Ray and vLLM |
| Alignment Handbook | ★ 5.6k | A set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models. |
| Verifiers | ★ 4.2k | A library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards. |