AI/TLDR

What Is Next-Token Prediction? How LLMs Actually Generate Text

The one idea behind every LLM: predict the next token, append it, and repeat.

BEGINNER9 MIN READUPDATED 2026-06-12

In plain English

Next-token prediction is the single trick behind every large language model. Given everything written so far, the model guesses what small chunk of text comes next, writes it down, and then asks the same question again with that new chunk added. That is it. There is no deeper magic layered on top.

Think of the phone keyboard that suggests the next word as you type. You write "I'll be there in five" and it offers minutes. A large language model is that idea taken to an extreme: trained on a vast amount of text, predicting not whole words but tokens (word-pieces), and doing it well enough that the suggestions chain together into essays, code, and answers.

Why it matters

If you understand next-token prediction, the rest of how an LLM behaves stops being mysterious. Lots of confusing things suddenly make sense:

  • Why models sometimes make things up. The model is choosing a plausible next token, not looking up a verified fact. A fluent-sounding wrong answer is, to the model, just a high-probability continuation. This is the root of why LLMs hallucinate.
  • Why prompting works. Everything you type becomes the context the model conditions its prediction on. Better context steers the probabilities toward better next tokens.
  • Why output appears word by word. The model genuinely produces one token at a time and feeds each one back in, which is why streamed responses arrive in a trickle.
  • Why models have no built-in memory. Each prediction only sees the text currently in the context window. Nothing outside it exists for the model.

Next-token prediction is the foundation the whole how an LLM works picture is built on. Master this one idea and concepts like temperature, hallucination, and context engineering all click into place faster.

How next-token prediction works

Start with a quick recap. Before any prediction happens, your text is split into tokens (numbered chunks; a token is roughly 3 to 4 characters of English on average, so a token is often a short word or a word-piece). The model never sees letters or words directly. It only ever works with these token IDs. See what is a token for the full story.

A probability distribution over the whole vocabulary

Here is the key idea most people get wrong: the model does not output one next token. For every step, it outputs a score for every single token in its vocabulary at once (a vocabulary is typically tens of thousands to a few hundred thousand tokens). Those raw scores are called logits. A function called softmax turns the whole pile of logits into a clean probability distribution: a list of numbers that are all positive and add up to 1.

So after the prompt "The capital of France is", the model's output might look like the table below: a probability assigned to every candidate token, with the obvious answer getting most of the weight.

Candidate next tokenProbability
Paris0.91
the0.03
a0.01
Lyon0.004
(every other token in vocab)the remaining ~0.046

A picker then chooses one token from this distribution. Always taking the single highest-probability token is called greedy decoding; rolling a weighted die instead is sampling. That choice is its own topic, covered in greedy decoding vs sampling and temperature. For now just remember: predict a distribution, then pick one token from it.

The predict-append-repeat loop

Generation is that single step run over and over. The chosen token is glued onto the end of the context, and the whole thing is fed back in to predict the next one. This loop is what the sibling article on autoregressive generation covers in depth, but the shape of it is simple:

The loop ends when the model predicts a special end-of-sequence token, or when it hits a length limit you set. Every word you see streaming out of a chatbot is one trip around this cycle.

Where the ability comes from: the training objective

The model only knows which tokens are likely because it was trained to. During pretraining it reads an enormous amount of text and plays one repetitive game: cover up the next token, try to predict it, and check the answer. Crucially, the correct answer is already sitting right there in the text. Nobody has to label anything by hand. This is why pretraining is called self-supervised: the data labels itself.

That comparison is measured by a loss function, almost always cross-entropy loss. In plain terms: the model is penalized in proportion to how surprised it was by the real next token. If it had given the true token a high probability, the penalty is tiny; if it gave it a near-zero probability, the penalty is large. Training repeats this across trillions of tokens, each time slightly adjusting the model's internal numbers (its weights) to make the real continuations more likely. There is no separate "understand the world" step bolted on; squeezing this loss down across a huge corpus is the entire learning signal.

A worked example

Here is the loop written out for the prompt "The sky is". Each row is one trip around the cycle: predict a distribution, pick the top token, append it, go again.

StepContext so farTop predicted token
1The sky is blue
2The sky is blue today
3The sky is blue today.
4The sky is blue today.(end of sequence)

Notice that the model never planned the sentence in advance. "blue" was simply the most probable token after "The sky is", and once it was committed, "today" became a natural continuation of that. Coherent sentences emerge from a chain of locally good guesses, not from a master outline.

The same loop in code is short. This pseudocode shows the structure (real APIs hide it inside a single call, but underneath this is what runs):

the next-token loop, simplifiedpython
context = tokenize("The sky is")

while True:
    logits = model(context)          # one score per vocab token
    probs = softmax(logits[-1])       # distribution for the NEXT token
    next_token = pick(probs)          # greedy or sampling
    if next_token == END_OF_SEQUENCE:
        break
    context.append(next_token)        # predict-append-repeat

print(detokenize(context))

Why such a simple objective yields complex ability

It feels like a cheat that guess the next word could lead to writing code or solving puzzles. The resolution is this: to predict the next token really well across all of human text, the model has no choice but to learn the patterns that produce that text.

  • To finish "The opposite of hot is" you must learn antonyms.
  • To finish "2 + 2 =" correctly you must pick up some arithmetic.
  • To finish a half-written function you must absorb syntax and logic of programming.
  • To finish "In 1969, humans first landed on the" you need some world knowledge.

None of these were taught as separate lessons. They are all just side effects of getting better at one objective. When a single simple goal, pursued at massive scale, produces abilities nobody explicitly programmed, researchers call these emergent abilities. This is also why bigger models trained on more data tend to be more capable: more capacity means more room to internalize the patterns, a relationship studied as scaling laws.

Going deeper

Training predicts every position at once; generation does not

During generation the loop is strictly one token at a time, because each new token depends on the one just produced. During training, though, the model can predict the next token for every position in a sentence simultaneously, in a single pass. The full target text is already known, so there is no need to wait. This parallelism is a big reason transformers are so efficient to train, and it is why training uses far more compute per second than chatting with the model does.

Teacher forcing and exposure bias

Because training feeds the model the real previous tokens (not its own guesses) when predicting each next one, the technique is called teacher forcing. The catch: at generation time the model must instead feed on its own outputs, mistakes included. A single early wrong token can nudge the context into a worse region and compound. This gap between training conditions and generation conditions is known as exposure bias, and it partly explains why long generations can drift off the rails.

Perplexity: scoring next-token prediction

How good is a model at next-token prediction? The classic metric is perplexity, derived directly from the cross-entropy loss. Loosely, it measures how surprised the model is by real text, you can read it as the effective number of equally-likely tokens it was choosing between. Lower perplexity means a sharper, more confident predictor. It is the most direct measure of the very objective the model was trained on, distinct from downstream benchmarks that test whole-task ability.

Pretraining is only step one

Pure next-token prediction on raw text gives you a model that continues text but does not reliably follow instructions or stay helpful. Turning a raw predictor into a usable assistant takes further stages, instruction fine-tuning and preference training like RLHF, layered on top. But even after all that polish, every modern frontier model, from the GPT, Claude, and Gemini families to open-weight models, is still, at its core, a next-token predictor running this same loop.

FAQ

What is next-token prediction in simple terms?

It is the core task every LLM performs: given all the text so far, the model predicts the most likely next token (a word or word-piece), adds it to the text, and repeats. Chaining these guesses produces fluent sentences, code, and answers.

Does the model only predict one next token?

No. At each step it produces a probability for every token in its vocabulary at once, forming a full distribution. A picker (greedy or sampling) then selects one token from that distribution to actually output.

Why is next-token prediction called self-supervised?

Because the correct answer is already present in the training text itself. To make an example, you just hide the next token and the rest of the text becomes the label. No human has to annotate anything, so it scales to trillions of tokens.

How can predicting the next token lead to reasoning and knowledge?

To predict the next token well across all human text, the model is forced to internalize grammar, facts, arithmetic, and code patterns. Those abilities emerge as side effects of minimizing one objective at massive scale.

Is next-token prediction the same as autoregressive generation?

They are two views of the same idea. Next-token prediction is the per-step objective; autoregressive generation is the loop of feeding each predicted token back in to predict the next. See the autoregressive generation article for the loop in detail.

Why do LLMs hallucinate if they predict the next token?

The objective rewards plausible-sounding continuations, not verified facts. A confident but wrong answer can be a high-probability continuation, so the model will produce it. That is the structural reason hallucinations happen.

Further reading