AI/TLDR

LLM Pretraining vs Fine-Tuning: How Models Are Actually Trained

Understand the two-stage paradigm behind every modern LLM: expensive pretraining that builds a base model, then cheap fine-tuning that shapes its behavior.

INTERMEDIATE9 MIN READUPDATED 2026-06-12

In plain English

Every modern large language model is built in two very different stages, and the gap between LLM pretraining vs fine-tuning is the single most useful mental model you can have about how these systems are made.

Think of raising a person. Pretraining is like the first eighteen years: the model reads a huge slice of the internet and slowly absorbs grammar, facts, code, and reasoning patterns. It is slow, brutally expensive, and produces a base model that knows an enormous amount but has no idea how to behave in a conversation. Fine-tuning (also called post-training) is like job training: a comparatively tiny, carefully chosen set of examples that teaches the already-knowledgeable model how to follow instructions, stay helpful, and act the way you want.

Why it matters

If you are building with LLMs, this distinction decides where your money and effort go. Pretraining a frontier model from scratch costs millions of dollars and is done by a handful of labs. Fine-tuning is something a small team can do on a weekend for the price of a few hundred GPU-hours. Confusing the two leads to one of the most common and costly mistakes in applied AI: trying to teach the model new facts with fine-tuning, when fine-tuning is mostly about teaching behavior and format.

  • You will almost never pretrain. Pretraining is a lab-scale activity. Knowing what it does, though, tells you what is already baked into the model you are using and what is not.
  • Fine-tuning is a real lever, but a narrow one. It is excellent for changing tone, output format, or specializing on a task. It is a poor and expensive way to inject knowledge that changes often.
  • Most teams should reach for prompting or RAG first. Understanding the training pipeline is exactly what lets you make that call correctly instead of fine-tuning by reflex.

It also explains real behaviors you have already seen. A model's knowledge cutoff is set during pretraining. Its refusal style, its chatty persona, and its tendency to use bullet points all come from post-training. Once you can attribute a behavior to the right stage, debugging gets dramatically easier.

How it works

The full pipeline is a sequence: a randomly initialized transformer is pretrained into a base model, then post-trained through one or more behavior-shaping steps. Here is the canonical flow.

Stage 1: Pretraining (capability)

Pretraining uses a self-supervised objective called next-token prediction: given a chunk of text, predict the following token, check the answer, and nudge billions of weights to do better. The genius is that the data labels itself. Every sentence on the internet is simultaneously an input (everything before a given word) and a label (that word), so no humans need to annotate anything. That is what makes training on trillions of tokens feasible.

The scale is the headline. Pretraining is the most compute-intensive stage by far. As a public, verifiable reference point, DeepSeek's own technical report states DeepSeek-V3 was pretrained on 14.8 trillion tokens using roughly 2.66 million H800 GPU-hours, with total training around 2.79 million GPU-hours — about $5.6M at $2/GPU-hour. Frontier runs from larger labs are widely estimated to cost tens to hundreds of millions of dollars. This is why pretraining lives behind a moat of capital and GPUs (see why LLMs need GPUs).

Stage 2: Fine-tuning / post-training (behavior)

The base model that falls out of pretraining is a brilliant autocomplete and a terrible assistant. Post-training fixes that in two broad moves. First, supervised fine-tuning (SFT): show the model tens of thousands of high-quality (instruction, ideal response) pairs so it learns the shape of being helpful — answering questions, following formats, refusing harmful requests. Second, preference training: collect comparisons where humans (or another model) rank responses, then push the model toward the preferred ones. The classic version is RLHF, and a simpler, increasingly common alternative is DPO (Direct Preference Optimization), which skips the separate reward model and optimizes preferences directly.

Crucially, post-training touches a tiny fraction of the data and compute of pretraining, yet it is responsible for almost everything users perceive as the model's personality, safety, and usefulness.

LLM pretraining vs fine-tuning, side by side

The two stages share the same underlying training machinery (gradient descent on a transformer) but differ on every dimension that matters in practice.

DimensionPretrainingFine-tuning / post-training
GoalBuild broad capability (language, facts, reasoning)Shape behavior, format, tone, or domain specialization
DataTrillions of tokens, mostly unlabeled web/code/booksThousands to millions of curated, labeled examples
ObjectiveSelf-supervised next-token predictionSupervised (SFT) + preference optimization (RLHF/DPO)
ComputeMillions of GPU-hours; weeks to monthsHundreds to thousands of GPU-hours; hours to days
CostMillions to hundreds of millions of dollarsTens to thousands of dollars (with LoRA, often less)
OutputA base model (raw, not chat-ready)An instruct/chat/aligned model
Who does itA handful of well-funded labsAlmost any team with a GPU and a dataset

When to fine-tune vs prompt vs RAG

This is the decision that actually shows up in your work. Reaching for fine-tuning first is the most common anti-pattern in applied LLMs. Walk this ladder from cheapest and fastest to most involved, and stop at the first rung that solves your problem.

You need to...Reach forWhy
Change tone, format, or stylePrompting, then fine-tuningOften a system prompt is enough; fine-tune to make it consistent and cheaper
Give the model current or private factsRAGKnowledge changes too often to bake into weights; retrieval keeps it fresh
Hit a strict output schema every timeFine-tuning (or structured outputs)Behavior, not knowledge — exactly what fine-tuning is good at
Specialize on a narrow domain taskFine-tuningTeaches a repeatable skill the base model can already half-do
Cut tokens / latency from a huge promptFine-tuningBake the instructions into weights instead of resending them every call
Teach a fundamentally new capabilityRealistically, a bigger base modelFine-tuning amplifies existing ability; it rarely creates new ability
the shape of a fine-tuning example (SFT)python
# Pretraining data is raw text. Fine-tuning data is structured
# conversations that teach the model how to BEHAVE.
example = {
    "messages": [
        {"role": "system", "content": "You are a terse SQL assistant."},
        {"role": "user", "content": "Top 5 customers by revenue?"},
        {"role": "assistant",
         "content": "SELECT customer_id, SUM(amount) AS rev\n"
                    "FROM orders GROUP BY customer_id\n"
                    "ORDER BY rev DESC LIMIT 5;"}
    ]
}
# A few thousand examples like this teach FORMAT and STYLE.
# They do NOT teach the model new facts about your database.

Going deeper

Why post-training is wildly cost-effective

The most striking result in this whole area is from the InstructGPT work: outputs from a 1.3B-parameter post-trained model were preferred by humans over the raw 175B-parameter base model — a model over 100x larger. The lesson is that pretraining loads the model with latent capability, and a small, well-aimed dose of post-training elicits far more of that latent value than throwing more pretraining compute at it would. Capability and helpfulness are different axes.

SFT vs RLHF vs DPO, briefly

SFT does imitation: copy the demonstrated good answers. But imitation can't easily express "answer A is better than answer B," and it can't reward responses better than the ones in your dataset. Preference methods fix that. RLHF trains a separate reward model on human comparisons, then uses reinforcement learning to maximize that reward. DPO mathematically reframes the same preference data as a direct classification-style loss, removing the reward model and RL loop — simpler and cheaper, though RLHF reward models can generalize better out-of-distribution. A typical modern recipe is SFT first, then one preference stage on top.

Catastrophic forgetting and the fine-tuning tightrope

Fine-tune too hard and the model forgets general skills it learned during pretraining — it gets great at your narrow task and worse at everything else. This is catastrophic forgetting. It is a major reason LoRA and other parameter-efficient methods are popular: by freezing the original weights and training small adapters, they preserve the expensive pretraining knowledge while still steering behavior.

The blurring middle: mid-training and continued pretraining

The clean two-stage story is getting fuzzier. Labs now run continued pretraining (extra next-token training on a domain corpus, e.g. legal or biomedical text) and mid-training steps that sit between pretraining and post-training. These are still capability-building, self-supervised steps — just applied after the main run. For a builder, the practical takeaway is unchanged: capability comes from large-scale next-token training; behavior comes from small-scale, labeled, preference-shaped training. Knowing which stage owns which behavior is what makes you effective with these models, whether you ever train one or just call the API.

FAQ

What is the difference between LLM pretraining vs fine-tuning?

Pretraining trains a model from scratch on trillions of unlabeled tokens using self-supervised next-token prediction, producing a base model with broad capability. Fine-tuning (post-training) then adapts that base model with a much smaller set of curated examples to shape behavior, tone, format, or domain skill. Pretraining builds capability; fine-tuning shapes behavior.

Is fine-tuning cheaper than pretraining?

Dramatically. Pretraining a frontier model runs into millions of GPU-hours and millions of dollars. Fine-tuning typically uses hundreds to thousands of GPU-hours, and with parameter-efficient methods like LoRA it can cost tens to a few hundred dollars — well within reach of a small team.

Should I fine-tune or use RAG to add knowledge?

For knowledge that changes or is private, use RAG. Fine-tuning bakes facts into frozen weights that are hard to update and prone to confident errors. Fine-tuning shines for behavior — consistent format, tone, or a repeatable task — not for injecting information that has a date on it.

What is the difference between SFT and RLHF?

SFT (supervised fine-tuning) trains the model to imitate good demonstrated answers. RLHF (and the simpler DPO) goes further by using human preference comparisons to push the model toward better responses than the demonstrations alone could teach. A common recipe runs SFT first, then a preference-training stage.

What is a base model versus an instruct model?

A base model is the raw output of pretraining: it predicts text well but does not reliably follow instructions or behave like a chat assistant. An instruct (or chat/aligned) model is a base model after post-training — SFT plus preference training — which is the version you actually converse with.

Further reading