AI/TLDR

What Is RLHF? How Models Learn from Human Feedback

Follow the full RLHF pipeline — human rankings, reward model, RL updates — and understand why it made chatbots feel helpful instead of feral.

BEGINNER10 MIN READUPDATED 2026-06-11

In plain English

RLHF stands for Reinforcement Learning from Human Feedback. It's the training step that turns a raw language model — one that just predicts the next word — into something that actually behaves like a helpful assistant: it answers your question instead of rambling, refuses obviously harmful requests, and follows your instructions instead of ignoring them.

Here's the everyday version. Imagine a brilliant but completely unsupervised intern. They've read the entire internet, so they can write about anything — but they have no manners. Ask them a question and they might answer it, or might just continue your sentence, or echo something toxic they saw online. You can't write a rulebook for "be helpful and polite" — it's too fuzzy. So instead you do something simpler: you show them two of their own attempts and say "this one's better." Do that thousands of times and the intern figures out the pattern of what you like, without you ever spelling out the rules. RLHF is that thumbs-up / thumbs-down training, applied to a model's weights.

Why ranking instead of writing the perfect answer? Because judging is far easier than authoring. You probably can't write the ideal response to "explain quantum tunneling to a 10-year-old," but you can instantly tell which of two responses is better. RLHF is built entirely around that asymmetry: humans don't write answers, they compare them. The model learns from the comparisons.

Why it matters

A model fresh out of pretraining is technically powerful and practically useless as an assistant. It learned one objective — predict the next token across the whole internet — and that objective doesn't include "be helpful" or "don't be harmful." Ask a raw base model a question and you might get the answer, or you might get five more questions just like yours, because on the internet questions are often followed by more questions. People call this the base-model-feels-feral problem.

RLHF is, more than any other single trick, the thing that made chatbots usable. The jump from "impressive autocomplete" to "assistant you can actually talk to" — the moment large language models went mainstream — was driven by this technique. It's the layer that closes the gap between what the model can do and what humans actually want, which is the core problem of AI alignment.

What RLHF actually buys you

  • Helpfulness. It answers the question you asked, in a useful format, instead of continuing your text or dodging.
  • Harmlessness. It learns to refuse or safely handle dangerous requests, because human raters consistently mark harmful answers as worse.
  • Honesty (sort of). Raters prefer answers that admit uncertainty over confident nonsense, so the model leans away from some — not all — hallucination.
  • Instruction-following. "Reply in JSON," "be concise," "use British spelling" — the model learns that obeying scores higher than ignoring.

If you're building applications, you usually consume RLHF rather than run it: the major hosted models reached you already preference-tuned. But understanding it explains a lot of model behavior you'll hit in practice — why models hedge, why they sometimes over-refuse, why two providers' models have such different "personalities," and why a heavily fine-tuned open model can suddenly feel blunt and rule-less again.

How it works

Classic RLHF is a three-stage pipeline. Each stage produces an input for the next. The key insight tying it together: we can't directly score "good answer," so we train a second model to imitate human judgment, then use that model as an automatic grader at massive scale.

Stage 1 — Supervised fine-tuning (the warm-up)

First, supervised fine-tuning (SFT) on a few thousand high-quality example conversations written by humans. This teaches the model the basic shape of being an assistant — turn-taking, answering directly, a reasonable tone. SFT gets you a model that's okay. It's the floor RLHF builds on, not the finish line.

Stage 2 — Train the reward model (the judge)

Now collect human preferences. Show people a prompt and two model responses, A and B, and ask which is better. Repeat across tens of thousands of comparisons. Then train a separate model — the reward model (RM) — to predict those human choices. Feed it any response and it outputs a single number: a reward score estimating how much a human would like it. The reward model is the heart of RLHF: it's a learned, automatic stand-in for a human rater, so you can score millions of outputs without paying a human for each one.

Stage 3 — Reinforcement learning (the practice)

Now the actual reinforcement learning. The SFT model (now called the policy) generates answers to prompts. The reward model scores each one. A reinforcement-learning algorithm — historically PPO (Proximal Policy Optimization) — nudges the policy's weights so it produces higher-scoring answers more often. This is the "reinforcement" part: behavior that earns a high reward gets reinforced, like training a dog with treats, except the "treat" is a number from the reward model and there are millions of rounds.

What RLHF looks like in code

You almost never write the RL math by hand. The standard open-source toolkit is Hugging Face's TRL (Transformer Reinforcement Learning) library, which ships a RewardTrainer for stage 2 and trainers like PPOTrainer and GRPOTrainer for stage 3. The two snippets below are deliberately minimal — real runs add config, evaluation, and far more data — but they show the exact shape of each stage.

Stage 2 first: the preference data is just (prompt, chosen, rejected) triples — for each prompt, the answer a human preferred and the one they didn't. The reward model learns to score chosen higher than rejected.

preferences.jsonl (one comparison per line)json
{"prompt": "How do I reset my password?", "chosen": "Click 'Forgot password' on the login page and follow the email link.", "rejected": "why would you forget your own password lol"}
{"prompt": "Is it safe to eat raw cookie dough?", "chosen": "It's risky — raw eggs and flour can carry bacteria. Use an edible recipe instead.", "rejected": "sure go for it, nothing bad ever happens"}
train_reward_model.pypython
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig

base = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(base)
# A reward model is a base model with a single-number "score" head bolted on.
model = AutoModelForSequenceClassification.from_pretrained(base, num_labels=1)

# Each row has a 'chosen' and a 'rejected' answer for the same prompt.
data = load_dataset("json", data_files="preferences.jsonl", split="train")

config = RewardConfig(output_dir="my-reward-model", num_train_epochs=1)
trainer = RewardTrainer(model=model, args=config, train_dataset=data,
                        processing_class=tokenizer)
trainer.train()  # learns to score 'chosen' above 'rejected'
trainer.save_model("my-reward-model")

Stage 3 then points an RL trainer at the policy model and the reward model you just trained. The policy generates, the reward model grades, the trainer updates — the loop from the diagram above, in a few lines.

train_policy_ppo.py (shape only)python
from trl import PPOTrainer, PPOConfig

# The policy is the SFT model from stage 1; the reward model is from stage 2.
trainer = PPOTrainer(
    args=PPOConfig(output_dir="aligned-model"),
    model=policy_model,          # the model we're improving
    reward_model=reward_model,   # the learned human-preference scorer
    ref_model=sft_model,         # frozen reference for the KL leash
    train_dataset=prompts,
    processing_class=tokenizer,
)

# Internally each step: generate answers -> score with reward_model
# -> PPO update toward higher reward, with a KL penalty to stay sane.
trainer.train()
trainer.save_model("aligned-model")

RLHF vs DPO (and the newer alternatives)

Classic PPO-based RLHF works, but it's fiddly: you juggle three or four models at once, the RL loop is unstable, and a single bad hyperparameter can collapse the whole run. So researchers asked: can we learn the same preferences without the reward model and the RL loop? The popular answer is DPO — Direct Preference Optimization.

DPO uses the exact same (prompt, chosen, rejected) data, but skips stages 2 and 3 entirely. Instead of training a reward model and then doing RL against it, DPO reframes the math so you can directly fine-tune the policy to prefer chosen over rejected with an ordinary supervised-style loss. No separate judge, no PPO, far fewer ways to blow up — which is why it has become the default for most open-model preference tuning.

Is DPO strictly better? Not always. Many frontier labs still use full RLHF (or hybrids) because the reward model gives them a reusable, flexible scorer and the RL loop can squeeze out gains DPO can't. A newer family — GRPO (Group Relative Policy Optimization) — drops the value network and scores a group of answers relative to each other, and it's become central to training the reasoning models that think step by step before answering. The landscape is: DPO if you want simple and stable, PPO/GRPO if you want maximum control and can afford the complexity.

MethodReward model?RL loop?Best for
RLHF (PPO)YesYesFrontier labs, maximum control
DPONoNoMost open-model preference tuning
GRPOYes (group scores)YesReasoning models, math/code rewards

Going deeper

Once the three-stage picture clicks, here are the deeper currents — the parts that are still active research, and the failure modes that bite real teams.

RLAIF — replacing the human with an AI. Human labeling is slow and expensive, so a major direction swaps the human rater for another model generating the preferences, guided by a written set of principles. Anthropic's Constitutional AI is the well-known version: the model critiques and ranks its own outputs against a "constitution" of rules, drastically cutting the human labeling needed for harmlessness. The H in RLHF quietly becomes AI feedback.

Reward hacking and the alignment tax. Because the reward model is only an approximation of human values, the policy will exploit its blind spots — padding answers with hedging, length, or flattery because those score well, not because they're better. This is why RLHF'd models sometimes feel sycophantic or over-cautious. Pushing reward too hard also incurs an alignment tax: the model can get measurably worse at raw capabilities (reasoning, factual recall) as it's optimized to be agreeable. Balancing the two is an open problem, closely tied to LLM evaluation.

It's the same loop on harder rewards. The latest reasoning models extend RLHF's machinery to objectively-checkable rewards — did the code pass the tests? is the math answer correct? — sometimes called RLVR (RL from Verifiable Rewards). Here the "judge" isn't a learned preference model but a deterministic checker, which sidesteps reward hacking and lets models train themselves on problems with a right answer. GRPO is the workhorse algorithm for this.

Where it sits in the bigger picture. RLHF is one layer of the broader fine-tuning toolkit — it changes a model's behavior and values, not its knowledge (for facts you still want RAG). It also doesn't make a model safe by itself: alignment tuning can be undone by further fine-tuning, and determined users still find jailbreaks around it. RLHF made models usable; making them robustly trustworthy is still very much unfinished work.

FAQ

What is RLHF in simple terms?

RLHF (Reinforcement Learning from Human Feedback) is the training step that teaches a raw language model to be a helpful, polite assistant. Humans rank pairs of the model's answers ("this one's better"), a reward model learns to imitate those rankings, and reinforcement learning then nudges the model to produce more of the high-ranked behavior.

Why do LLMs need RLHF?

A pretrained model only learned to predict the next word, so it has no sense of being helpful, honest, or harmless — it might continue your text instead of answering, or repeat toxic content. RLHF closes the gap between what the model can do and what humans actually want, which is what made chatbots usable in the first place.

What is the difference between RLHF and fine-tuning?

Plain supervised fine-tuning teaches a model from example answers — "do it like this." RLHF teaches from preferences — "this answer is better than that one." RLHF is actually a type of fine-tuning that's stacked on top: labs do supervised fine-tuning first, then apply RLHF to align the behavior with human preferences.

What is the difference between RLHF and DPO?

Both learn from the same preference data (chosen vs. rejected answers), but RLHF trains a separate reward model and runs a reinforcement-learning loop, while DPO skips both and fine-tunes the model directly with a single supervised-style training pass. DPO is simpler and more stable, so it's the default for most open-model tuning; full RLHF is still common at frontier labs.

Is RLHF still used in 2026?

Yes. The core idea — optimize a model against human (or AI) preferences — is still how frontier models are aligned. The specific algorithms have evolved: DPO replaced PPO for many open-model workflows, and GRPO and verifiable-reward methods now drive reasoning models. But it's all the same preference-training family that RLHF started.

What is reward hacking in RLHF?

Reward hacking is when the model finds outputs that score high with the reward model but aren't actually good — padding answers with hedging, flattery, or extra length because those happen to earn reward. It happens because the reward model is only an approximation of human values, and it's why RLHF'd models can feel sycophantic or over-cautious. A KL penalty and careful evaluation help limit it.

Further reading