In plain English
Training a large language model happens in two broad phases: pretraining and post-training. Pretraining is long and expensive — the model reads hundreds of billions of words and learns to predict what comes next. Post-training is shorter and more targeted — the model is shaped into an assistant that follows instructions and behaves safely.

A useful analogy: think of a new employee who just graduated. Pretraining is the degree — years of reading textbooks, news, code, and arguments, absorbing the structure of knowledge and language. Post-training is the onboarding at their first job: here's how we answer the phone, here's what we don't say to customers, here's when to escalate. The degree is what makes them capable; the onboarding is what makes them useful and professional.
The output of pretraining is called a base model (sometimes a foundation model). It can complete text, but it isn't an assistant — ask it a question and it might write more questions. Post-training converts the base model into an instruct model or a chat model: something that understands requests and generates helpful, on-topic responses.
Why it matters for builders
If you're building on top of an LLM API, understanding the training pipeline tells you what you're actually working with and where behaviors come from. That matters in at least three ways.
- Prompt design: a base model completes text; an instruct model responds to instructions. Prompts written for one don't always transfer to the other.
- Fine-tuning decisions: adding your own supervised fine-tuning (SFT) layer on top of a base model is a different undertaking than fine-tuning an already-instruction-tuned model. The starting point shapes the outcome.
- Behavioral expectations: biases, knowledge cutoffs, and capability limits all trace back to specific training decisions — what data was included, when training stopped, how the reward model was set up.
The pipeline also explains why training large models costs hundreds of millions of dollars. Pretraining Llama 3 (405B parameters) involved training on roughly 15 trillion tokens — that's an estimated 15 quadrillion characters of text, processed on thousands of GPUs over months. Post-training is comparatively cheap, but it's where most of the behavioral investment happens, including the human labelers who evaluate outputs.
For smaller teams, the practical implication is that you almost never pretrain from scratch — you start from a publicly released base model or a hosted instruct model and add your own SFT or prompting layer on top. Knowing where in the pipeline you're entering helps you pick the right tool.
How the pipeline works, stage by stage
Stage 1: Pretraining
Pretraining is self-supervised learning at a scale that is hard to fathom. The training objective is simple: given a sequence of tokens, predict the next one. That's it. But doing this billions of times across trillions of tokens forces the model to implicitly learn grammar, facts, logic, code syntax, and countless other patterns — because they are all needed to predict text accurately.
The data comes primarily from the public web. Common Crawl, a non-profit that snapshots the web monthly, underpins most major training runs. It currently holds over 300 billion webpages and grows by 3–5 billion pages per month. Labs filter out near-duplicates, low-quality pages, hate speech, and adult content, then mix in higher-quality sources: books, academic papers (e.g. ArXiv), Wikipedia, and curated code repositories.
Training at this scale requires thousands of GPUs running in parallel for months. The weights are updated via gradient descent: the model makes a prediction, compares it against the actual next token, and nudges its billions of parameters slightly in the direction that reduces the error. Repeat for 15 trillion examples and the model gradually encodes a rich, compressed model of human knowledge.
Stage 2: Supervised fine-tuning (SFT)
A freshly pretrained base model is a powerful text completer, not an assistant. SFT is the first post-training step that changes this. The model is trained — the same gradient-descent process, but on a much smaller, carefully curated dataset — on prompt–response pairs written or approved by human annotators.
Example: a prompt might be "Summarize the following article in three bullet points" followed by an ideal summary. By seeing thousands of such pairs, the model learns the general format of following instructions: here is a request, here is the appropriate kind of response. SFT is fast by comparison with pretraining — the whole run might take hours rather than months — but the quality of the demonstrations matters enormously. Garbage in, garbage out applies even more sharply here.
The output is an SFT model that can follow instructions competently but hasn't yet been shaped by preference feedback. It may still produce verbose, sycophantic, or occasionally harmful text because it's learned to imitate good responses, not to be rewarded by human judgment.
Stage 3: Alignment — RLHF and its alternatives
Reinforcement Learning from Human Feedback (RLHF) was popularized by OpenAI's InstructGPT paper (2022) and remains the canonical approach for the final alignment stage. The process has two sub-steps.
- Train a reward model. Human raters compare pairs of model outputs for the same prompt and pick the better one. A separate neural network — the reward model — learns to predict which output a human would prefer. It becomes an automated proxy for human judgment.
- Optimize the policy. The SFT model (now called the 'policy') generates outputs, the reward model scores them, and reinforcement learning (typically PPO — Proximal Policy Optimization) updates the policy's weights to produce higher-scoring outputs. A KL-divergence penalty stops the model drifting too far from its SFT starting point.
RLHF is powerful but complex: the RL training loop is unstable, the reward model can be exploited (the policy finds outputs that score well but aren't actually good — a phenomenon called reward hacking), and it requires a significant human labeling budget.
Direct Preference Optimization (DPO), introduced in 2023 and now widely used, simplifies this by eliminating the separate reward model entirely. DPO directly fine-tunes the policy on pairs of preferred vs. rejected outputs, using a mathematical reformulation that achieves similar alignment without the RL instability. By 2025, DPO-based approaches had become the dominant alignment method in open-source model training.
What data goes in — and what gets filtered out
The composition of the pretraining corpus has an outsized effect on what the model knows, which languages it handles well, and what biases it carries. Labs are increasingly deliberate about corpus design.
| Data source | What it contributes | Typical mix notes |
|---|---|---|
| Web crawl (Common Crawl) | Broad world knowledge, contemporary language | Heavily filtered — raw crawl contains spam, hate speech, and near-duplicates |
| Books and long-form text | Coherent long-range reasoning, narrative structure | Hard to license at scale; some labs use open-access sources like Project Gutenberg |
| Code repositories | Programming ability, structured logic | GitHub and similar; strongly boosts coding and reasoning benchmarks |
| Wikipedia and encyclopedias | High-accuracy factual knowledge | Small volume but high signal; often upweighted |
| Academic papers | STEM reasoning, scientific knowledge | ArXiv, PubMed; boosts domain depth |
| Synthetic data | Targeted capability gaps, reasoning chains | Increasingly used for math and logic; must be validated carefully to avoid noise amplification |
Because English dominates the web, most training corpora are heavily English-weighted — often 40–50% even after diversity efforts. This directly explains why frontier models perform better in English than in lower-resource languages. Some labs now deliberately oversample other languages during pretraining to improve multilingual performance.
Base model vs instruct model: what the difference feels like
The practical difference between a base model and an instruct model is large enough that many builders never interact with a base model at all. Here is what each does when you send the same input.
- Continues text as if completing a document
- May output more questions instead of answering
- Does not follow a system prompt
- Useful for: researchers, fine-tuning starting points
- Examples: Llama 4 base, Mistral base
- Responds to the prompt as a request
- Formats output to match user intent
- Respects system prompts and safety rules
- Useful for: production apps, assistants, agents
- Examples: Llama 4 Instruct, GPT-5.5, Claude Sonnet 4.6
For most production use cases, you start with an instruct model and add your own behavior via prompting or a thin SFT layer on top. Starting from a base model and doing the full alignment stack yourself is reserved for research labs or teams building specialized models where general-purpose alignment would conflict with the use case.
Going deeper
Constitutional AI and RLAIF
Anthropic's Constitutional AI (CAI) approach, described in their 2022 paper, replaces some human preference labeling with a set of written principles (the 'constitution'). The model critiques and revises its own outputs according to these principles, and those revised outputs become training data. This scales the feedback process without requiring a human to evaluate every response pair.
More broadly, Reinforcement Learning from AI Feedback (RLAIF) replaces human raters with a capable AI model. Google DeepMind research showed RLAIF can match RLHF performance on several benchmarks while dramatically cutting the labeling cost. As models improve, their self-critiques become more reliable — so RLAIF quality tends to rise with each generation.
Group Relative Policy Optimization (GRPO) and reasoning models
DeepSeek's open release in early 2025 popularized GRPO, a variant of the RL alignment step that eliminates the need for a separate critic model. Instead of comparing a policy output against a learned value function, GRPO compares a batch of outputs against each other — averaging out noise across the group. This is simpler and more memory-efficient than PPO and has been adopted by several open-source training pipelines.
Reasoning models — such as those in OpenAI's GPT-5 series and DeepSeek-R1 — add an additional stage: test-time compute scaling, where the model generates long internal reasoning chains (sometimes called 'thinking' or 'chain-of-thought') before producing its final answer. These reasoning models are trained with RL rewards specifically designed to reinforce correct reasoning steps, not just correct final answers. This is a distinct post-training stage on top of the standard SFT + RLHF pipeline.
Mid-training and continued pretraining
The clean three-stage picture — pretrain, SFT, RLHF — is an approximation. In practice, large labs insert additional stages. Meta described a mid-training phase for Llama 3 that runs after initial pretraining and before SFT: the model is trained on synthetic reasoning data and domain-specific corpora to sharpen capabilities before the alignment work begins. Similarly, continued pretraining lets you take a publicly released base model and run additional pretraining on a domain-specific corpus (medical records, legal documents, code in a specific language) before adding SFT on top.
Why the cost asymmetry matters
Pretraining a frontier model costs hundreds of millions of dollars in compute. The entire post-training pipeline — SFT data curation, reward model training, RLHF or DPO runs, safety red-teaming — typically costs a small fraction of that. Yet post-training is where most of the behavioral differentiation between models happens. Two models pretrained on identical data will behave very differently after different post-training choices. This is why fine-tuning an existing base model is so economically attractive: you're piggybacking on the expensive pretraining and spending your budget where it changes behavior most.
FAQ
How long does it take to pretrain an LLM?
For frontier models, pretraining runs for weeks to months on thousands of specialized GPUs or TPUs. Llama 3's 405B model, for example, used clusters of tens of thousands of H100 GPUs. Smaller models (7–13B parameters) can be pretrained in days on a few hundred GPUs, which is why open-source mid-size models are practical for research teams.
What is the difference between fine-tuning and pretraining?
Pretraining trains a model from scratch on a massive general-purpose corpus — this is where foundational language knowledge is acquired. Fine-tuning starts from an already-trained model and continues training on a smaller, more targeted dataset to shift behavior toward a specific task, style, or domain. Fine-tuning is orders of magnitude cheaper because the model's weights are already close to useful.
Can I pretrain my own LLM?
Technically yes, practically almost never for frontier-scale models. Pretraining at the scale of a frontier model like the GPT-5 series or Claude's Opus models requires infrastructure that costs hundreds of millions of dollars. Research teams working with smaller budgets typically use 1–7B parameter models trained on filtered web data, or start from a publicly released base model like Llama 4. For most builders, continued pretraining or SFT on top of an existing base is the realistic path.
What is RLHF and why is it needed?
Reinforcement Learning from Human Feedback (RLHF) is an alignment technique where human raters compare model outputs and a reward model is trained on their preferences. The main model is then optimized to produce outputs the reward model scores highly. It's needed because SFT alone teaches the model to imitate good responses, but doesn't reliably surface which outputs humans actually prefer — especially for subjective qualities like helpfulness, tone, and safety.
What replaced RLHF in modern models?
Direct Preference Optimization (DPO) is the most widely adopted alternative. It achieves similar alignment to RLHF without training a separate reward model, by directly fine-tuning the policy on preferred vs. rejected output pairs. DPO is simpler, more stable, and has become dominant in open-source training pipelines. Some labs also use Constitutional AI (Anthropic) or RLAIF, where an AI model rather than human raters provides the preference signal.
Does post-training change what the model knows?
Not in a fundamental way. Post-training shapes how the model uses its knowledge — following instructions, being concise, refusing harmful requests — but the underlying knowledge base comes from pretraining. You can teach a model new behaviors in SFT, but reliably injecting new factual knowledge is hard and often causes the model to hallucinate. For up-to-date knowledge, retrieval-augmented generation (RAG) is more reliable than fine-tuning.