In plain English
Every AI model lives two completely separate lives. Training is when the model is built: it reads enormous amounts of text and slowly adjusts billions of internal numbers until it gets good at predicting what comes next. Inference is when the model is used: you send it a prompt, it runs the math, and an answer comes back. Training happens once, in a lab, over weeks or months. Inference happens millions of times a day, every time anyone hits send.
Here's the analogy that makes it stick. Training is medical school: ten brutal years of studying, exams, and corrections, costing a fortune, producing a doctor. Inference is the appointment: you walk in, describe your symptoms, and the doctor applies everything they already know to your specific case in fifteen minutes. The doctor does not re-attend medical school for each patient — and crucially, seeing you does not change what the doctor knows. They walk out of your appointment with exactly the knowledge they walked in with.
That last part trips people up constantly. During inference, the model's knowledge — its parameters, or weights — is completely frozen. ChatGPT answering your question is not "learning" anything in the sense of permanently updating itself. It's a doctor seeing a patient, not a student in a lecture.
Why it matters
Once you separate the two phases, a whole cluster of confusing AI facts suddenly makes sense:
- Why models have a knowledge cutoff. The model only knows what was in its training data, and training ended on a specific date. Asking about yesterday's news fails because no inference request can add knowledge to frozen weights. See knowledge cutoffs explained.
- Why you pay per token, not per model. The training bill was paid once by the lab. Your API bill covers inference: the GPU-seconds spent running your specific request. More tokens in and out means more compute means more cost.
- Why correcting a chatbot doesn't fix it for next time. Your correction lives in the conversation's context window, not in the weights. Open a new chat and the model is back to its trained state.
- Why "the model got smarter overnight" usually means a new model. Improving the actual weights requires another training run, after which the provider ships a new checkpoint. The old one didn't learn; it got replaced.
Who should care? Anyone budgeting an AI product, because inference is the recurring cost that scales with users — for a successful product, total inference spend dwarfs what training cost the lab. Anyone debugging weird model behavior, because "is this a training problem or an inference problem?" is the first triage question. And anyone evaluating vendor claims, because "our model learns from your feedback" almost always means "we'll use your data in a future training run," not "the model updates live."
Before this split was cleanly industrialized, classic software had no equivalent: a program's logic was written by hand and executed as written. LLMs replaced hand-written logic with learned logic, which forced the build phase (training) and the run phase (inference) apart — different hardware, different teams, different economics.
How it works
Training: the expensive loop
Training a large language model is one loop, repeated trillions of times. Feed the model a chunk of text from the training data. Ask it to predict the next token. Measure how wrong it was (the loss). Then run backpropagation — an algorithm that traces the error backwards through the network and computes, for every single one of the billions of weights, which direction to nudge it to be slightly less wrong. An optimizer applies those nudges, and the loop starts over with the next chunk.
Each pass through the loop changes the model only microscopically, which is why training takes weeks or months on clusters of thousands of GPUs working in lockstep. The backward pass roughly doubles the compute of the forward pass, and the optimizer needs extra memory for gradients and statistics — so training a model takes several times more GPU memory than merely running it. This is a big part of why LLMs need GPUs in such absurd quantities, and why training costs run so high.
Inference: the cheap-ish loop
Inference keeps only the first step. Your prompt goes in, the model runs a forward pass — one trip through the network, weights untouched — and out comes a probability for every possible next token. A sampler picks one, the token is appended to the text, and the model runs again to pick the token after that. Generation is this loop: one forward pass per token, until the model emits a stop token or hits a length limit.
No loss, no gradients, no optimizer. That's why a model that needed a 10,000-GPU cluster to train can answer questions on a handful of GPUs — or, for smaller open models, on a laptop. The flip side: because generation is one-token-at-a-time and every step must read all the weights from memory, inference speed is usually limited by memory bandwidth, not raw compute. That single fact drives most of modern inference engineering.
The split, in eight lines of code
The cleanest way to see the difference is in PyTorch, where the two phases are literally different function calls. A training step does forward, backward, and update. An inference step does forward only, inside torch.no_grad(), which tells PyTorch not to track gradients at all:
import torch
import torch.nn as nn
model = nn.Linear(10, 2) # stand-in for a billion-parameter LLM
opt = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
# --- TRAINING STEP: forward + backward + weight update ---
x, y = torch.randn(32, 10), torch.randint(0, 2, (32,))
logits = model(x) # forward pass
loss = loss_fn(logits, y) # how wrong were we?
loss.backward() # backprop: a gradient for EVERY weight
opt.step() # nudge all weights — the model just learned
opt.zero_grad()
# --- INFERENCE STEP: forward only, weights frozen ---
model.eval()
with torch.no_grad(): # skip gradient tracking: less memory, no learning
prediction = model(torch.randn(1, 10)).argmax()
Everything else — the trillion-token datasets, the GPU clusters, the serving infrastructure — is scaling and plumbing around these two code paths. loss.backward() plus opt.step() is learning. Its absence is inference.
Training vs inference: the cheat sheet
| Training | Inference | |
|---|---|---|
| What happens | Weights are repeatedly updated to reduce prediction error | Frozen weights transform a prompt into output tokens |
| How often | Once per model version (weeks to months) | Every single request, millions per day |
| Compute shape | Forward + backward pass + optimizer, huge batches | Forward pass only, one token at a time |
| Hardware | Thousands of tightly interconnected GPUs | One to a few GPUs per replica (or a laptop for small models) |
| Memory needs | Weights + gradients + optimizer state (several times the model size) | Weights + KV cache (model size plus conversation state) |
| Bottleneck | Total compute (FLOPs) and interconnect | Memory bandwidth and latency |
| Who pays | The lab, as a one-time capital expense | You, per token, on every API call |
| Done by | A handful of labs and well-funded teams | Everyone who uses AI, constantly |
One row deserves a flag: the KV cache. During a long conversation, the model stores intermediate attention data for every token of context so it doesn't recompute it on each step. For long chats this cache can rival the model weights in size — it's the main reason long contexts make inference slower and more expensive. The KV cache article unpacks it.
Going deeper
The line between the phases is blurrier than the cheat sheet admits. Fine-tuning is a second, small training phase run on top of a finished model — same backward pass, same optimizer, just fewer steps and less data. The pretraining vs post-training pipeline (instruction tuning, RLHF) is also training, which is why providers can honestly say your thumbs-down feedback "improves the model": it gets folded into the next training run, weeks later.
Inference itself has two distinct regimes. Prefill processes your whole prompt in one parallel pass — compute-bound, and the reason long prompts add latency before the first token appears. Decode then generates tokens one at a time — memory-bandwidth-bound, since each step re-reads all the weights to produce a single token. Production serving systems like vLLM and Hugging Face's TGI attack the decode problem with continuous batching (interleaving many users' decode steps so the GPU never idles) and PagedAttention (managing KV cache memory like an operating system manages RAM, so more concurrent conversations fit on one GPU).
Squeezing the frozen model is its own discipline. Quantization stores weights in fewer bits — 8-bit or 4-bit instead of 16 — shrinking memory and speeding up decode with modest quality loss; this is how llama.cpp fits capable models onto laptops. Distillation goes further: a large "teacher" model generates outputs that train a small "student" to imitate it, trading a one-time training cost for permanently cheaper inference. Speculative decoding runs a small draft model ahead of the big one and lets the big model verify several drafted tokens in a single forward pass — same output distribution, fewer expensive steps.
And the newest twist: spending more at inference on purpose. Reasoning models generate long hidden chains of thought before answering — deliberately burning extra inference compute to buy accuracy. This test-time compute trade is the mirror image of scaling laws, which only governed training: now labs choose between making the model bigger (training cost, paid once) and letting it think longer (inference cost, paid every request). The open production problem is routing — deciding per-request how much thinking a question actually deserves, because most questions don't need a hundred thousand tokens of deliberation, and at scale the inference bill, not the training bill, is what decides whether an AI product survives.
FAQ
Does ChatGPT learn from my conversations?
Not during the conversation. At inference time the model's weights are frozen — it adapts to your chat via the context window only, and that adaptation evaporates when the conversation ends. Providers may collect conversations (subject to their data settings) to use as data in a future training run, but that improves a later model version, not the one you're talking to.
Why is inference expensive if training is already paid for?
Because every request consumes GPU time. Generation runs one forward pass per output token, and each pass has to read the model's billions of weights from GPU memory. Multiply that by millions of users and inference becomes a recurring cost that, for any successful product, eventually exceeds what the original training run cost the lab.
Is fine-tuning training or inference?
Training. Fine-tuning runs the same loop — forward pass, loss, backpropagation, weight update — just on a much smaller dataset for far fewer steps, starting from finished weights. Anything that changes weights is training; anything that runs frozen weights is inference. In-context learning, despite the name, is inference.
Why does training need thousands of GPUs but inference only a few?
Training adds a backward pass and optimizer state, multiplying both compute and memory several times over, and it must churn through trillions of tokens in a reasonable timeframe — so the work is sharded across thousands of tightly networked GPUs. Inference only needs enough memory to hold the weights plus the KV cache, and enough speed to serve requests at acceptable latency.
What is an inference engine like vLLM?
Software that serves a trained model efficiently to many simultaneous users. Engines like vLLM, TGI, and llama.cpp don't change what the model knows — they optimize how the frozen forward pass executes, using techniques like continuous batching, PagedAttention for KV cache memory, quantization, and speculative decoding to get more tokens per second out of the same hardware.
What's the difference between prefill and decode in LLM inference?
Prefill processes your entire prompt in one parallel pass and determines the wait before the first token appears; it's limited by compute. Decode then generates output tokens one at a time, each step re-reading the full model weights; it's limited by memory bandwidth. That's why a long prompt delays the start of a response, while a long answer takes time token by token.