Overview
Open-R1 is a Hugging Face project that rebuilds the missing pieces of the DeepSeek-R1 reasoning model pipeline so anyone can reproduce it and build on top. The repo is small by design: it provides training scripts (`sft.py`, `grpo.py`), a synthetic-data generation script (`generate.py`), and a Makefile that wires each step together.
It is aimed at ML researchers and engineers who want to train reasoning models that think step by step. The project follows the DeepSeek-R1 tech report in three stages: distill a corpus from R1, replicate the pure-RL pipeline behind R1-Zero, and show a path from base model to RL-tuned model through multi-stage training.
As an RLHF and alignment toolkit, it leans on the wider Hugging Face stack — accelerate, DeepSpeed, vLLM, and Distilabel — and ships ready-to-use recipes and datasets such as Mixture-of-Thoughts and OpenR1-Math-220k for training and evaluating reasoning models.
What it does
- SFT training script (`sft.py`) for supervised fine-tuning on R1-distilled reasoning datasets
- GRPO training script (`grpo.py`) to apply reinforcement learning on a chosen dataset
- Synthetic reasoning-data generation (`generate.py`) built on Distilabel
- Multi-GPU training with accelerate, supporting DDP and DeepSpeed ZeRO-2 / ZeRO-3
- Makefile commands that chain each step of the R1 pipeline together
- Released recipes and datasets (Mixture-of-Thoughts, OpenR1-Math-220k, CodeForces-CoTs) for reproduction
Getting started
Open-R1 runs on a CUDA 12.4 GPU setup and is installed into a uv virtual environment. The steps below follow the project's README.
Create a virtual environment
Create and activate a Python 3.11 environment with uv, then upgrade pip.
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pipInstall vLLM and FlashAttention
Install the pinned vLLM build (which also brings PyTorch 2.6.0) and FlashAttention.
uv pip install vllm==0.8.5.post1
uv pip install setuptools && uv pip install flash-attn --no-build-isolationInstall the project dependencies
Install Open-R1 in editable mode with the dev extras, then log in to Hugging Face and Weights & Biases.
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"
huggingface-cli login
wandb loginRun SFT training
Launch supervised fine-tuning on an R1-distilled dataset such as Mixture-of-Thoughts using accelerate with a ZeRO-3 config.
accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
--model_name_or_path open-r1/Qwen2.5-Math-7B-RoPE-300k \
--dataset_name open-r1/Mixture-of-Thoughts \
--dataset_config all \
--learning_rate 4.0e-5 \
--num_train_epochs 5 \
--max_seq_length 32768 \
--per_device_train_batch_size 2 \
--gradient_checkpointing \
--bf16 \
--output_dir data/OpenCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Reproduce DeepSeek-R1's reasoning models from open recipes and datasets
- Fine-tune a base model with SFT on distilled reasoning traces
- Apply GRPO reinforcement learning to improve a model's step-by-step reasoning
- Generate synthetic reasoning data from an R1 model to build your own training corpus
How Open-R1 compares
Open-R1 alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Open-R1 | ★ 26.3k | A fully open reproduction of the DeepSeek-R1 reasoning pipeline |
| verl | ★ 22.1k | Volcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM. |
| TRL | ★ 18.7k | Hugging Face's post-training library with trainers for SFT, reward modeling, DPO, PPO, and GRPO to align language models with preferences. |
| Agent Lightning | ★ 17.3k | An open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning. |
| ART | ★ 10.1k | OpenPipe's Agent Reinforcement Trainer for post-training LLM agents on multi-step tasks using GRPO and rule- or judge-based rewards. |
| OpenRLHF | ★ 9.7k | A Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters. |
| Alignment Handbook | ★ 5.6k | A set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models. |
| Verifiers | ★ 4.2k | A library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards. |