AI/TLDR

OpenRLHF

Scalable RLHF training for large language models on Ray and vLLM

Overview

OpenRLHF is an open-source framework for reinforcement learning from human feedback (RLHF), the process of aligning a language model with human preferences after pretraining. It is built on Ray for distributed scheduling and vLLM for fast generation, with DeepSpeed handling training. This split lets it place the actor, reward, reference, and critic models on separate GPUs and scale to models with 70B+ parameters.

It is aimed at ML engineers and researchers who want to run the full RLHF pipeline—supervised fine-tuning, reward modeling, and reinforcement learning—without building the distributed plumbing themselves. It ships command-line entry points and example scripts for common setups, so you can start from a working recipe and adjust it.

Within the RLHF and alignment category, OpenRLHF focuses on the training infrastructure rather than on datasets or evaluation. It supports several RL algorithms (PPO, GRPO, REINFORCE++, RLOO) and includes a hybrid engine mode that lets models and vLLM engines share GPU resources to reduce idle time on limited hardware.

What it does

  • Built on a Ray + vLLM + DeepSpeed distributed stack that separates actor, reward, reference, and critic models across GPUs
  • Supports multiple RL algorithms: PPO, GRPO, REINFORCE++, and RLOO
  • Scales RLHF training to models with 70B+ parameters
  • Hybrid engine scheduling lets models and vLLM engines share GPUs to cut idle time
  • Covers the full pipeline: supervised fine-tuning, reward modeling, and RL training
  • Async RLHF and agent-based RLHF via --train.async_enable and --train.agent_func_path, plus optional LoRA

Getting started

Install OpenRLHF with pip (vLLM is an optional extra), then launch one of its CLI training entry points such as supervised fine-tuning.

Install with pip

Install the base package, or add the vLLM extra for generation acceleration used during RL training.

bashbash
pip install openrlhf            # Basic
pip install openrlhf[vllm]      # + vLLM

Or install from source

Clone the repository and install in editable mode if you want to modify the code or run the example scripts.

bashbash
git clone https://github.com/OpenRLHF/OpenRLHF.git
cd OpenRLHF
pip install -e .

Run supervised fine-tuning

Launch the SFT entry point with DeepSpeed. This example fine-tunes Llama 3 8B on the OpenOrca dataset.

bashbash
deepspeed --module openrlhf.cli.train_sft \
   --data.max_len 4096 \
   --data.dataset Open-Orca/OpenOrca \
   --data.input_key question \
   --data.output_key response \
   --train.batch_size 256 \
   --train.micro_batch_size 2 \
   --actor.model_name_or_path meta-llama/Meta-Llama-3-8B \
   --ckpt.output_dir ./checkpoint/llama3-8b-sft \
   --ds.zero_stage 2 \
   --train.max_epochs 1 \
   --adam.lr 5e-6

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Aligning an open-weight LLM with human preferences using PPO, GRPO, or REINFORCE++
  • Running RLHF training on large models (up to 70B+) across multiple GPUs without writing the distributed scheduling yourself
  • Reproducing reasoning-model RL recipes (e.g. DeepSeek-R1-style training) from the provided example scripts
  • Training reward models and running supervised fine-tuning as the first stages of an alignment pipeline

How OpenRLHF compares

OpenRLHF alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Open-R1★ 26.3kAn open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation.
verl★ 22.1kVolcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM.
TRL★ 18.7kHugging Face's post-training library with trainers for SFT, reward modeling, DPO, PPO, and GRPO to align language models with preferences.
Agent Lightning★ 17.3kAn open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning.
ART★ 10.1kOpenPipe's Agent Reinforcement Trainer for post-training LLM agents on multi-step tasks using GRPO and rule- or judge-based rewards.
OpenRLHF★ 9.7kScalable RLHF training for large language models on Ray and vLLM
Alignment Handbook★ 5.6kA set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models.
Verifiers★ 4.2kA library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards.