AI/TLDR

How Do LLMs Actually Work? Next-Token Prediction Explained

See the single mechanism — predicting the next token — that powers every answer an LLM has ever given you.

BEGINNER10 MIN READUPDATED 2026-06-12

In plain English

A large language model has exactly one trick: given some text, it guesses what comes next. Not the next idea, not the next fact — the next token, a chunk of text roughly the size of a short word or word-piece. That's it. Everything ChatGPT, Claude, or Gemini has ever told you came out of that single repeated guess.

Think of the autocomplete on your phone. You type "running" and it suggests "late". An LLM is the same idea scaled up by a factor of millions: instead of one tiny suggestion, it has read a huge slice of the internet and learned, for any stretch of text, which token is most likely to follow. Type "The capital of France is" and the most probable next token is "Paris" — not because the model "knows geography" the way you do, but because in everything it read, that pattern showed up overwhelmingly.

Why it matters

Understanding next-token prediction isn't trivia — it's the single most useful mental model you can carry into every other AI topic. Once you see that an LLM is a probability machine, a lot of confusing behavior stops being mysterious.

  • Hallucinations make sense. The model produces the most plausible-sounding next token, not the true one. When plausible and true diverge, it confidently makes something up. That's the root of why LLMs hallucinate.
  • Randomness has a dial. The same prompt can give different answers because the model samples from a probability distribution. That dial is temperature.
  • Counting letters is hard. The model sees tokens, not letters, which is why it fumbles the "how many R's in strawberry" question.
  • Context is everything. The model only predicts based on the text in front of it, so giving it the right context — and staying inside the context window — is most of the job.

If you build, buy, or prompt AI for a living, this is the floor everything else stands on. Prompt engineering, RAG, agents, fine-tuning — they're all just clever ways of steering one relentless next-token guesser. Start with what an LLM is if you want the bird's-eye view first.

How it works

Let's trace one token end to end. You send a prompt; the model turns it into tokens, runs them through a giant neural network, gets a score for every possible next token, turns those scores into probabilities, and picks one. Then it does the whole thing again with your prompt plus the new token. Repeat until it decides to stop.

Step 1: text becomes tokens

The model never sees raw letters. A tokenizer chops your text into tokens — sub-word chunks — and maps each to a number (a token ID). "Tokenization" might become ["Token", "ization"]. If this is new, read what is a token and tokens vs words. This matters because every limit and price you'll ever hit is counted in tokens, not words.

Step 2: the network produces logits

The token IDs flow through the model — a stack of transformer layers using attention to let each token "look at" the others. The final layer outputs logits: one raw, unbounded score for every token in the vocabulary. Vocabularies are large — often on the order of 100,000+ tokens — so this is a list of 100,000+ numbers, one per candidate. A high logit means "this token fits well here."

Step 3: softmax turns logits into probabilities

Logits aren't probabilities yet — they can be any number, positive or negative. The softmax function fixes that: it exponentiates each logit (so everything is positive) and divides by the total, producing a clean probability distribution that sums to exactly 1. A big logit becomes a big probability; the long tail of unlikely tokens splits the leftover sliver. Now the model can say things like "Paris: 92%, France: 3%, the: 1%, …"

Step 4: sample one token, then repeat

Finally the model picks a token from that distribution. The simplest rule, greedy decoding, always grabs the single highest-probability token — deterministic, but it tends to loop and sound flat. Most real systems instead sample: roll a weighted die so likely tokens win often but not always. The chosen token is appended to the input, and the entire loop runs again for the next one. This token-by-token, feed-the-output-back-in style is called autoregressive generation.

See it in code

You don't need a GPU to feel how this works. Here's the entire core of an LLM's output step — logits to softmax to a sampled token — in a few lines of NumPy. This is the real math, just with a toy 5-token vocabulary instead of 100,000.

next_token.pypython
import numpy as np

# A toy vocabulary and the model's raw scores (logits)
# for what should follow: "The capital of France is"
vocab  = ["Paris", "France", "the", "a", "London"]
logits = np.array([8.0, 3.0, 2.0, 1.5, 4.0])

def softmax(z, temperature=1.0):
    z = z / temperature          # temperature reshapes the curve
    e = np.exp(z - z.max())      # subtract max for numerical stability
    return e / e.sum()           # normalize so it sums to 1

probs = softmax(logits)
for tok, p in zip(vocab, probs):
    print(f"{tok:>7}: {p:6.1%}")

# Sample one token using the probabilities as weights
choice = np.random.choice(vocab, p=probs)
print("\npicked:", choice)

Run it and Paris wins the lion's share of the probability, with London a distant second. Crank temperature up toward 2.0 and the gap shrinks — the model gets more willing to gamble on London. Push it toward 0 and it always picks Paris. That one parameter is the difference between a boring, repeatable assistant and a creative, occasionally-off-the-rails one. The same idea, with extra controls, becomes top-p vs top-k sampling.

Where the knowledge comes from

If the model is just predicting the next token, why does it seem to know things? Because of how it was trained. The training itself is next-token prediction, run on a staggering amount of text. To get good at guessing the next word across books, code, and conversations, the network is forced to absorb grammar, facts, and reasoning patterns into its parameters. There's no separate "knowledge database" — the knowledge is the learned weights that shape those probabilities.

But raw next-token training only gives you a base model — a brilliant autocomplete that will happily continue your text without ever answering your question. Turning it into a helpful assistant takes two more stages.

  • Pretraining — self-supervised next-token prediction on a huge corpus. This is where almost all the raw capability and world knowledge gets baked in. It's also where scaling laws kick in: more data and compute reliably buy more capability.
  • Instruction tuning — supervised fine-tuning on curated prompt → good answer pairs, which flips the model from "continue this text" to "answer this request."
  • Preference training — methods like RLHF nudge the model toward responses humans actually prefer: helpful, honest, and not toxic.

After all three stages, the underlying mechanism is unchanged — it's still predicting one token at a time. Tuning just reshapes the probabilities so the most likely next token is now a useful, well-mannered one.

The 2026 landscape

As of mid-2026, every mainstream chat model you can name — across the major labs — is a next-token predictor at heart. What's changed isn't the core mechanism; it's the scale and the plumbing around it. The frontier families and a headline fact each:

Model family (mid-2026)MakerHeadline context window
Claude Opus / Sonnet 4.xAnthropic200K standard, ~1M in beta
GPT-5.xOpenAI~1M tokens
Gemini 2.5 ProGoogle1M tokens
Llama 4 Scout (open weights)MetaUp to 10M tokens

Two trends are reshaping the plumbing without touching the next-token idea. First, context windows have exploded — Meta's open-weight Llama 4 Scout advertises up to a 10M-token window, while Claude, GPT, and Gemini all sit around the 1M mark. Bigger windows mean the model can condition its next-token guess on far more text, though "lost in the middle" issues mean more isn't automatically better.

Second, mixture-of-experts (MoE) has gone mainstream: Llama 4 is natively MoE, activating only a slice of its parameters per token. It still predicts one token at a time — it just routes each prediction through a smaller, specialized subset of the network, which is cheaper to run. The mechanism in this article hasn't changed since the first GPT; the engineering around it has.

Going deeper

A few subtleties that separate a working mental model from a precise one.

"Predicting the next token" is doing real reasoning

It's tempting to dismiss next-token prediction as shallow pattern-matching. But to predict the next token in "The murderer was, in the final twist, revealed to be the…" well, a model has to track the whole plot. Researchers in 2026 are actively studying how much lookahead and planning is implicitly happening inside a single forward pass — some work even reframes autoregressive models as energy-based models to explain why they appear to "plan" beyond the immediate token. Compressing the world well enough to predict it turns out to require something that looks a lot like understanding.

The bottleneck: one token at a time

Autoregression is inherently sequential — token N+1 can't be computed until token N exists. That's why generation feels like typing and why long answers take time, and it's a big reason LLMs need GPUs. It's also why speculative and multi-token prediction are hot research areas: a small draft model proposes several tokens, the big model verifies them in one pass, and you get a speed-up without changing the output distribution.

Sampling order matters

In production serving, the order of operations is usually: compute logits, divide by temperature, apply top-p / top-k truncation, then sample. Temperature reshapes the curve before you chop off the tail, so the two interact. Setting temperature to 0 collapses everything to greedy decoding — which, counterintuitively, can produce more repetition loops than moderate sampling on some models. The takeaway: these knobs are part of how the model 'works,' not afterthoughts.

FAQ

Is ChatGPT just autocomplete?

Mechanically, yes — it predicts the next token one at a time, just like phone autocomplete. The difference is scale: it was trained on a huge slice of human text with billions of parameters, then tuned to follow instructions, so its 'autocomplete' includes grammar, reasoning, and a lot of factual recall. 'Just' undersells what next-token prediction at that scale can do.

How does an LLM generate text step by step?

It tokenizes your prompt, runs it through the network to get a logit (raw score) for every possible next token, applies softmax to turn those scores into probabilities, samples one token, appends it to the text, and repeats the whole loop. This token-by-token feedback process is called autoregressive generation.

Why do LLMs give different answers to the same question?

Because they sample from a probability distribution instead of always picking the single most likely token. The temperature setting controls how random that sampling is. Set temperature to 0 and you get near-deterministic 'greedy' output; raise it and the model takes more chances, producing varied (and sometimes more creative or more wrong) answers.

If an LLM only predicts the next word, how does it know facts?

The facts are baked into its parameters during pretraining. To get good at predicting the next token across the entire internet, the network had to internalize grammar, world facts, and reasoning patterns. There's no lookup database — the 'knowledge' is the learned weights that shape each next-token probability, which is also why it can confidently state things that are wrong.

What is a logit in an LLM?

A logit is the raw, unbounded score the model assigns to each possible next token before any normalization. There's one logit per token in the vocabulary (often 100,000+ of them). Softmax converts the full list of logits into probabilities that sum to 1, and the model samples its next token from that distribution.

What does temperature do when an LLM generates text?

Temperature divides the logits before softmax. A value below 1 sharpens the distribution so the top token dominates (more focused, repeatable). A value above 1 flattens it so less likely tokens get a real chance (more diverse, riskier). Temperature 0 is effectively greedy decoding — always the single most probable token.

Further reading