Overview
The Alignment Handbook is a collection of training recipes and scripts from Hugging Face for building aligned chat models. It walks through the whole pipeline used to produce models like Zephyr-7B: continued pretraining, supervised fine-tuning (SFT) for chat, and preference alignment with DPO or ORPO.
It is aimed at ML engineers and researchers who want to reproduce known aligned models or train their own on custom datasets. Each training run is described by a single YAML recipe file that holds all the parameters, so you change configs rather than rewriting code.
Within the fine-tuning and RLHF/alignment space, it sits one level above raw training libraries: the scripts are thin wrappers over the Hugging Face stack, and they support full-weight distributed training with DeepSpeed ZeRO-3 as well as parameter-efficient LoRA/QLoRA.
What it does
- Covers the full pipeline: continued pretraining, SFT, reward modeling, rejection sampling, DPO, and ORPO
- Reproducible recipes as YAML files (e.g. Zephyr-7B, SmolLM, StarChat2) that hold every parameter for a run
- Ready-made scripts for SFT (sft.py) and preference alignment (dpo.py) launched with accelerate
- Supports full-weight distributed training via DeepSpeed ZeRO-3, plus LoRA/QLoRA for parameter-efficient tuning
- Instructions and formatting guidance for fine-tuning chat models on your own datasets
- Built on the Hugging Face ecosystem with documented dataset and model collections
Getting started
Set up a Python environment with the pinned dependencies, then launch a training run from one of the provided recipe YAML files. The example below reproduces the Zephyr-7B-beta pipeline.
Create a virtual environment and install dependencies
Use uv to create an environment, install the pinned PyTorch build, then install the handbook package and Flash Attention 2.
uv venv handbook --python 3.11 && source handbook/bin/activate && uv pip install --upgrade pip
uv pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu126
uv pip install .
uv pip install "flash-attn==2.7.4.post1" --no-build-isolationLog in to Hugging Face
Authenticate so you can pull base models and datasets and push your trained model.
huggingface-cli loginRun supervised fine-tuning (SFT)
Launch the SFT script with a recipe config, using the ZeRO-3 accelerate config for distributed training.
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml scripts/sft.py --config recipes/zephyr-7b-beta/sft/config_full.yamlAlign with DPO
Take the SFT model and align it to preferences using direct preference optimization.
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml scripts/dpo.py --config recipes/zephyr-7b-beta/dpo/config_full.yamlCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Reproduce a published aligned chat model such as Zephyr-7B from its recipe
- Fine-tune and align an open base model on your own instruction and preference datasets
- Compare alignment methods (DPO vs. KTO vs. IPO, or ORPO) using the included recipes
- Adapt a model to a new language or domain through continued pretraining, then SFT and DPO
How Alignment Handbook compares
Alignment Handbook alongside other open-source rlhf & alignment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Open-R1 | ★ 26.3k | An open reproduction of the DeepSeek-R1 reasoning pipeline, with scripts for GRPO training and reasoning-data generation. |
| verl | ★ 22.1k | Volcano Engine's RL post-training framework (HybridFlow) for building GRPO, PPO, and other RL pipelines on top of FSDP, Megatron, and vLLM. |
| TRL | ★ 18.7k | Hugging Face's post-training library with trainers for SFT, reward modeling, DPO, PPO, and GRPO to align language models with preferences. |
| Agent Lightning | ★ 17.3k | An open-source trainer from Microsoft that improves AI agents built with any framework using reinforcement learning, prompt optimization, and supervised fine-tuning. |
| ART | ★ 10.1k | OpenPipe's Agent Reinforcement Trainer for post-training LLM agents on multi-step tasks using GRPO and rule- or judge-based rewards. |
| OpenRLHF | ★ 9.7k | A Ray- and vLLM-based RLHF framework that scales PPO, GRPO, and REINFORCE++ training to models with 70B+ parameters. |
| Alignment Handbook | ★ 5.6k | Hugging Face recipes and scripts for the full SFT-then-preference-alignment pipeline |
| Verifiers | ★ 4.2k | A library for defining verifiable-reward environments and running reinforcement-learning fine-tuning of LLMs against those rewards. |