AI/TLDR

Open-R1

A fully open reproduction of the DeepSeek-R1 reasoning pipeline

Overview

Open-R1 is a Hugging Face project that rebuilds the missing pieces of the DeepSeek-R1 reasoning model pipeline so anyone can reproduce it and build on top. The repo is small by design: it provides training scripts (`sft.py`, `grpo.py`), a synthetic-data generation script (`generate.py`), and a Makefile that wires each step together.

It is aimed at ML researchers and engineers who want to train reasoning models that think step by step. The project follows the DeepSeek-R1 tech report in three stages: distill a corpus from R1, replicate the pure-RL pipeline behind R1-Zero, and show a path from base model to RL-tuned model through multi-stage training.

As an RLHF and alignment toolkit, it leans on the wider Hugging Face stack — accelerate, DeepSpeed, vLLM, and Distilabel — and ships ready-to-use recipes and datasets such as Mixture-of-Thoughts and OpenR1-Math-220k for training and evaluating reasoning models.

What it does

  • SFT training script (`sft.py`) for supervised fine-tuning on R1-distilled reasoning datasets
  • GRPO training script (`grpo.py`) to apply reinforcement learning on a chosen dataset
  • Synthetic reasoning-data generation (`generate.py`) built on Distilabel
  • Multi-GPU training with accelerate, supporting DDP and DeepSpeed ZeRO-2 / ZeRO-3
  • Makefile commands that chain each step of the R1 pipeline together
  • Released recipes and datasets (Mixture-of-Thoughts, OpenR1-Math-220k, CodeForces-CoTs) for reproduction

Getting started

Open-R1 runs on a CUDA 12.4 GPU setup and is installed into a uv virtual environment. The steps below follow the project's README.

Create a virtual environment

Create and activate a Python 3.11 environment with uv, then upgrade pip.

bashbash
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip

Install vLLM and FlashAttention

Install the pinned vLLM build (which also brings PyTorch 2.6.0) and FlashAttention.

bashbash
uv pip install vllm==0.8.5.post1
uv pip install setuptools && uv pip install flash-attn --no-build-isolation

Install the project dependencies

Install Open-R1 in editable mode with the dev extras, then log in to Hugging Face and Weights & Biases.

bashbash
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"
huggingface-cli login
wandb login

Run SFT training

Launch supervised fine-tuning on an R1-distilled dataset such as Mixture-of-Thoughts using accelerate with a ZeRO-3 config.

bashbash
accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path open-r1/Qwen2.5-Math-7B-RoPE-300k \
    --dataset_name open-r1/Mixture-of-Thoughts \
    --dataset_config all \
    --learning_rate 4.0e-5 \
    --num_train_epochs 5 \
    --max_seq_length 32768 \
    --per_device_train_batch_size 2 \
    --gradient_checkpointing \
    --bf16 \
    --output_dir data/Open

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Reproduce DeepSeek-R1's reasoning models from open recipes and datasets
  • Fine-tune a base model with SFT on distilled reasoning traces
  • Apply GRPO reinforcement learning to improve a model's step-by-step reasoning
  • Generate synthetic reasoning data from an R1 model to build your own training corpus

How Open-R1 compares

Open-R1 alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Open-R1★ 26.3kA fully open reproduction of the DeepSeek-R1 reasoning pipeline
verl★ 22.1kVolcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM.
TRL★ 18.7kHugging Face's post-training library with trainers for SFT, reward modeling, DPO, PPO, and GRPO to align language models with preferences.
Agent Lightning★ 17.3kAn open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning.
ART★ 10.1kOpenPipe's Agent Reinforcement Trainer for post-training LLM agents on multi-step tasks using GRPO and rule- or judge-based rewards.
OpenRLHF★ 9.7kA Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters.
Alignment Handbook★ 5.6kA set of recipes and scripts from Hugging Face showing how to run the full SFT-then-preference-alignment pipeline used to build aligned chat models.
Verifiers★ 4.2kA library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards.