What Is Model Distillation? Big-Model Quality at Small-Model Prices

Learn how a small student model copies a big teacher's behavior, and why distillation is behind most of the fast, cheap models you use.

BEGINNER12 MIN READUPDATED 2026-06-11

In plain English

Model distillation is how you take a huge, slow, expensive language model and bottle its behavior into a tiny, fast, cheap one. The big model is the teacher. The small model is the student. You let the teacher answer thousands of questions, then train the student to give the same answers. The student ends up punching far above its size — not because it's secretly huge, but because it learned from someone who was.

Here's the everyday version. A world-class chess grandmaster can beat almost anyone, but you can't carry one around in your pocket. So instead, the grandmaster writes a pile of lessons: in this position, play this move, and here's roughly how good every other move was. A talented club player studies those lessons. They'll never out-think the grandmaster on the hardest puzzles — but on the positions that come up 99% of the time, they now play almost identically, and they do it instantly. Distillation is that apprenticeship, applied to neural networks.

The technical version: distillation is a form of fine-tuning where the training targets come from another model instead of from humans. You feed the teacher a big list of prompts, capture its outputs, and use those outputs as the "correct answers" to train the student. The student copies not just what the teacher said, but increasingly how it decided — its style, its reasoning shape, its quirks. The result is a model that feels like a smaller echo of a much bigger one.

Why it matters

Big models are accurate but painful to run. Every extra parameter means more memory, more compute, higher latency, and a bigger bill on every single API call. For a chatbot you ping a few times a day that's fine. For a feature that runs on every user message, every uploaded document, or every row in a database, the big model is a non-starter — too slow and too expensive at scale.

Distillation breaks that trade-off for narrow tasks. You can't shrink a model and keep it brilliant at everything. But you usually don't need everything — you need it to be great at the one job you actually ship. A student model can match the teacher on that job while being a fraction of the size, which means it's cheaper to run, faster to respond, and small enough to deploy in places the teacher could never fit.

Cost. A model with 10x fewer parameters can cost roughly an order of magnitude less to run per request. At high volume that's the difference between a profitable feature and a money pit.
Latency. Smaller models generate tokens faster. For anything user-facing — autocomplete, live chat, voice — the speed-up is the whole point.
Where it can run. A distilled model may fit on a single GPU, a phone, or in a browser, enabling local, offline inference that a frontier model could never do.
Privacy and control. Once you own a distilled open model, you run it on your own hardware. Data never leaves your environment, and no provider can deprecate it out from under you.

This is not a niche trick — it's everywhere. A large fraction of the small, fast, cheap models you use day to day were distilled, in whole or in part, from larger siblings. The famous early example is DistilBERT, a compressed version of BERT that kept most of the quality at a fraction of the size and speed cost. Many of today's small open models lean on distillation from a bigger teacher to feel smarter than their parameter count suggests.

How it works

At the core, distillation is just fine-tuning the student — the same training loop, the same nudging of weights — with one twist: the training targets are generated by the teacher instead of written by hand. The whole pipeline is three stages.

// The distillation pipeline

Collect promptsmany realistic inputsTeacher answersbig model generates outputsTrain studentsmall model copies themDeploy studentcheap + fast in prod

First, you gather a big set of representative prompts — ideally close to what real users will send. Second, you run the teacher over all of them and save its responses. That pile of (prompt, teacher output) pairs is now a synthetic dataset: examples written by a model, not a human. Third, you fine-tune the student on that dataset until it reliably produces teacher-like answers. The teacher is only needed during step two; once training is done, you ship the student alone.

Hard labels vs. soft labels

There are two depths of distillation, and the difference is what the student gets to copy.

Response distillation (hard labels). The student trains only on the teacher's final text answer. This is the simple, common approach for LLMs — you just need the teacher's outputs, which you can get from any API. No special access required.
Logit distillation (soft labels). Instead of just the final answer, the student copies the teacher's full probability distribution over next tokens — its confidence across every option, not just its top pick. This carries far more information, but you need internal access to the teacher's logits, so it only works on open models you control.

Why are soft labels so much richer? When a teacher answers a multiple-choice question, the hard label says only "the answer is B." The soft label says "B is 80% likely, C is 15%, A is 4%, D is basically 0." That extra signal — these wrong answers are plausible, these are absurd — is sometimes called dark knowledge, and it lets the student learn the teacher's whole sense of the problem from far fewer examples.

Temperature: blurring the teacher's confidence

Soft-label distillation uses a knob called temperature to control how much of that dark knowledge gets through. A confident teacher's raw probabilities are spiky — 99% on one token, near-zero on the rest — which hides the interesting differences between the runner-up options. Dividing the scores by a temperature greater than 1 softens the distribution, spreading the probability so the student can see the relative ordering of the also-rans. The student is then trained to match that softened distribution (typically by minimizing KL divergence, a measure of how far apart two probability distributions are).

A minimal example: distill via a synthetic dataset

The most accessible form of distillation for LLM beginners needs no special model internals at all — just the teacher's text outputs. The plan: pick a hard task, have a strong teacher produce gold answers for a batch of inputs, save them as a training file, and fine-tune a small student on it. Step one is generating the synthetic dataset.

1_generate_dataset.pypython

import json
from openai import OpenAI  # any LLM SDK works the same way

client = OpenAI(api_key="sk-...")
TEACHER = "a-large-capable-model"   # the big, expensive model

# Real prompts your small model will need to handle in production.
prompts = [
    "Classify the sentiment of: 'shipping was slow but the product is great'",
    "Classify the sentiment of: 'never buying from here again'",
    # ...thousands more, ideally drawn from real traffic
]

with open("distill.jsonl", "w") as f:
    for p in prompts:
        # Let the TEACHER produce the gold answer.
        resp = client.chat.completions.create(
            model=TEACHER,
            messages=[{"role": "user", "content": p}],
        )
        answer = resp.choices[0].message.content
        # Save (prompt -> teacher answer) as a training example.
        row = {"messages": [
            {"role": "user", "content": p},
            {"role": "assistant", "content": answer},
        ]}
        f.write(json.dumps(row) + "\n")

Now distill.jsonl is an ordinary fine-tuning file — except every answer came from the teacher, not a human labeler. Step two is fine-tuning a small student on it, exactly as you would any other fine-tune. With a small open model you can do this locally, often with a LoRA adapter to keep it cheap.

2_train_student.pypython

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model

student_name = "Qwen/Qwen3-0.6B"   # a tiny, cheap student model
tok = AutoTokenizer.from_pretrained(student_name)
student = AutoModelForCausalLM.from_pretrained(student_name, dtype="auto")

# Train a small LoRA adapter instead of all the weights.
student = get_peft_model(student, LoraConfig(r=8, lora_alpha=16))

# The synthetic data the teacher generated in step 1.
data = load_dataset("json", data_files="distill.jsonl", split="train")
data = data.map(lambda b: tok(b["text"], truncation=True, max_length=512),
                batched=True)

args = TrainingArguments(output_dir="student", num_train_epochs=3,
                         per_device_train_batch_size=4, learning_rate=2e-4)
Trainer(model=student, args=args, train_dataset=data).train()
student.save_pretrained("student")  # your distilled model

Distillation vs. its cousins

Distillation is one of several ways to make a model smaller or cheaper, and beginners mix them up constantly. They're not competitors — they're often combined — but they do different things.

Technique	What it changes	What you get
Distillation	Trains a new, smaller model to imitate a bigger one	A cheaper model with similar behavior on your task
Quantization	Stores the same model's weights in lower precision (e.g. 4-bit)	The same model, smaller in memory, slightly less precise
Fine-tuning	Adjusts an existing model's weights on new examples	Same-size model, better at your task
RAG	Feeds the model documents at query time	Same model, given fresh knowledge it can look up

The cleanest way to keep them straight: distillation makes a smaller model, quantization makes a model take less space, fine-tuning makes a model better at a task, and RAG gives a model knowledge. Production teams routinely distill a model and then quantize the student to squeeze it even further — the techniques stack.

// Distillation vs. quantization (both shrink, differently)

Distillation

Builds a brand-new, smaller model
Student can change architecture entirely
Needs training data + compute up front
Big drop in size and cost possible

Quantization

Keeps the exact same model
Just stores weights in fewer bits
No training — applied after the fact
Modest size drop, fast to apply

Common mistakes beginners make

Distillation looks deceptively easy — generate data, train, ship — and that's exactly why it goes wrong. Most failures come from the dataset, not the training.

Inheriting the teacher's mistakes. The student copies everything, including the teacher's hallucinations, biases, and bad habits. A student can never reliably exceed its teacher — garbage teacher, garbage student.
Prompts that don't match reality. If your synthetic prompts don't resemble real traffic, the student aces a test that never happens in production. Draw prompts from actual usage whenever you can.
Expecting it to stay general. A student distilled on sentiment classification will be great at sentiment and worse at everything else. Distillation narrows the model on purpose; don't be surprised when breadth disappears.
Skipping evaluation. "The loss went down" tells you almost nothing. You need a held-out evaluation set measuring the student against the teacher on the real task — that's the only number that matters.

Going deeper

The synthetic-data recipe above is the beginner on-ramp. Once it clicks, here's the frontier of how distillation is actually used to build the models you rely on.

White-box vs. black-box distillation. Everything you can do with just an API is black-box distillation — you only see the teacher's text. White-box distillation requires the teacher's internals: its logits, and sometimes its intermediate hidden states. Matching those internals (not just the final answer) transfers far more knowledge per example, which is why labs distilling their own open models can do it so efficiently. The original distillation paper by Hinton and colleagues used exactly this soft-label, logit-matching approach.

Reasoning distillation. A modern and powerful variant: let a strong teacher solve hard problems with its full chain-of-thought reasoning shown, then train a small student on those traces. The student learns not just final answers but the step-by-step process of getting there. This is a big reason small open models have recently become startlingly good at math and code — they were trained on the worked solutions of much larger reasoning models.

Self-distillation and data filtering. You can distill a model from itself or a same-size sibling — generating many candidate answers, keeping only the best ones (often picked by an LLM-as-judge), and retraining on that filtered set. The model effectively teaches itself its own best behavior. Combined with preference methods like RLHF, this is a core loop in how labs iteratively improve models without armies of human labelers.

The open problems. Distillation has real limits. The student is capped by the teacher — you can't distill new capabilities into existence, only transfer existing ones. Picking which prompts to distill on is still mostly art: cover too little and the student has blind spots, cover too much and you've rebuilt an expensive general model. And measuring whether a distilled model is truly safe — not just accurate — remains genuinely hard, which is why a serious evaluation and red-teaming pass is non-negotiable before any distilled model ships.

FAQ

What is model distillation in simple terms?

It's training a small, cheap model (the student) to copy the answers of a big, expensive model (the teacher). You run the teacher over many prompts, save its outputs, then fine-tune the student on those outputs. The student ends up behaving like a smaller version of the teacher on your task, while being far faster and cheaper to run.

What's the difference between the teacher and student model?

The teacher is the large, high-quality model whose behavior you want to capture. The student is the smaller model you're actually going to deploy. The teacher generates the training examples; the student learns from them. After training you ship only the student — the teacher's job is done.

Is distillation the same as fine-tuning?

Distillation is a type of fine-tuning. Ordinary fine-tuning trains a model on human-written examples; distillation trains it on examples generated by another model (the teacher). The training loop is the same — what changes is where the 'correct answers' come from.

How is distillation different from quantization?

Distillation creates a brand-new, smaller model that imitates a bigger one. Quantization keeps the exact same model but stores its weights in lower precision (like 4-bit) to save memory. Distillation needs training; quantization is applied after the fact. Teams often do both — distill first, then quantize the student.

Can a distilled model be as good as the original?

On a narrow task, it can come close — that's the whole point. But it can never reliably beat its teacher, and it loses the teacher's breadth across other tasks. Distillation trades general ability for efficiency on the specific job you trained it for.

Is it legal to distill from a commercial model's outputs?

Often not. Several commercial providers' terms of service forbid using their model outputs to train a competing model, so check the license first. Distilling from a permissively licensed open model avoids that problem entirely and is the safe default for experimentation.

// In plain English

// Why it matters

// How it works

Hard labels vs. soft labels

Temperature: blurring the teacher's confidence

// A minimal example: distill via a synthetic dataset

// Distillation vs. its cousins

// Common mistakes beginners make

// Going deeper

// FAQ

// Further reading

// Related