AI/TLDR

What Is Autoregressive Generation? How LLMs Write One Token at a Time

How LLMs generate text strictly left-to-right, one token at a time, by feeding each output back in as the next input.

INTERMEDIATE9 MIN READUPDATED 2026-06-12

In plain English

Autoregressive generation is how almost every large language model writes: it produces text one token at a time, and every token it writes gets fed right back in as part of the input for the next token. The word breaks down nicely: auto means self, and regress means to feed back on. So the model literally regresses on its own output, conditioning each new step on everything it has said so far.

Picture someone telling a story they're making up on the spot. They say one word, hear it out loud, and that word shapes the next word they choose. They can never un-say a word or skip ahead to the ending and fill in the middle later. They move strictly forward, left to right, each word committing them a little further. That is exactly what an LLM does, just with sub-word tokens instead of whole words.

Why it matters

Autoregression is the single design choice that explains the most surprising things about how LLMs behave in practice, and understanding it changes how you build with them.

  • It explains the latency. Because each token depends on the one before it, the model genuinely cannot compute your 200-token answer in one shot. It must take 200 sequential steps. This is why long answers take longer, and why streaming responses appear word by word rather than all at once.
  • It explains cost shape. Input tokens and output tokens are priced and processed differently. The whole prompt can be read in parallel, but every output token costs its own sequential step, which is why output tokens are usually more expensive. See LLM API pricing.
  • It explains why early mistakes snowball. Once the model commits to a wrong token, that token becomes part of the input for everything after it. A small slip early on can derail a whole answer, a phenomenon closely tied to why LLMs hallucinate.
  • It explains streaming. LLM streaming exists precisely because tokens are produced one at a time, so the API can hand you each token the instant it's ready.

How autoregressive generation works

At each step the model runs a full forward pass over a transformer, producing a score for every token in its vocabulary. Those raw scores are called logits. A sampling rule (greedy, or temperature/top-p sampling) picks one token from that distribution. That chosen token is then appended to the input, and the whole thing runs again. The output of step N becomes part of the input to step N+1. That feedback loop is the entire mechanism.

In real inference engines this loop is split into two distinct phases with very different performance profiles: prefill and decode.

Prefill vs decode

Prefill processes your entire prompt in parallel. Because all the prompt tokens already exist, the model can compute attention over all of them in one big batched pass on the GPU. This phase is compute-bound and fast per token. Decode is the generation phase: it produces new tokens one at a time, each requiring its own forward pass. Decode is inherently sequential, so it's memory-bandwidth-bound and dominates the latency of long answers.

PropertyPrefillDecode
Tokens handledAll prompt tokens at onceOne new token per step
ParallelismHighly parallelStrictly sequential
BottleneckCompute (matmuls)Memory bandwidth
DrivesTime to first tokenTime between tokens

The KV cache: making decode affordable

Here's the problem the KV cache solves. Attention works by comparing the current token's query vector against a key and value vector for every prior token. Naively, at each decode step the model would recompute the keys and values for the entire sequence from scratch. Across a full answer that's repeated work that grows quadratically (roughly O(n²) in sequence length), which would make generation painfully slow.

The insight: for any token that's already in the sequence, its key and value vectors never change on later steps. So compute them once and store them. That store is the KV cache. On each new decode step the model computes only the new token's query, key, and value, appends the new key/value to the cache, and attends against everything cached. The per-step work becomes roughly constant instead of growing with the sequence, turning total decode cost from O(n²) down to O(n).

A worked example

Let's watch the loop run on a tiny prompt. Suppose the prompt is The cat sat on the and the model generates three tokens. Each row is one decode step; the new token is appended and becomes input for the next step.

StepInput the model seesTop predictionNew cache entry
PrefillThe cat sat on thematK,V for 5 prompt tokens
Decode 1The cat sat on the matandK,V for 'mat'
Decode 2The cat sat on the mat andfellK,V for 'and'
Decode 3The cat sat on the mat and fell<eos>K,V for 'fell'

Notice three things. The input grows by exactly one token each step. The model never revisits or rewrites earlier tokens. And generation stops when it emits a special end-of-sequence token (or hits your max-tokens limit). Below is a minimal pseudo-implementation of the same loop.

autoregressive_loop.pypython
tokens = tokenizer.encode("The cat sat on the")
cache = None  # KV cache: empty at first

for step in range(max_new_tokens):
    # Forward pass. On step 0 this is prefill (all tokens);
    # afterward only the last token is new (decode).
    logits, cache = model.forward(tokens, kv_cache=cache)

    # Look only at the LAST position: that's the next-token prediction.
    next_logits = logits[-1]

    # Sampling rule turns logits into one chosen token.
    next_token = sample(next_logits, temperature=0.7)

    if next_token == EOS:
        break

    tokens.append(next_token)   # output becomes next input

print(tokenizer.decode(tokens))

Autoregressive generation vs non-autoregressive and diffusion alternatives

Autoregression's strength (each token sees all prior tokens) is also its bottleneck (steps can't run in parallel). Researchers have long chased ways to break the sequential chain so that hardware can fill in multiple tokens at once.

Non-autoregressive (NAR) generation tries to predict many output positions simultaneously rather than left-to-right. The catch is that, without seeing what it just wrote, the model can produce locally fluent but globally inconsistent text. Diffusion language models take a different route: they start from a sequence of masked or noised tokens and iteratively refine the whole sequence over several denoising steps, in principle updating many positions per step. Google DeepMind's experimental Gemini Diffusion, shown at Google I/O in May 2025, is a high-profile example, pitched as generating text several times faster than a comparable autoregressive model by refining noise step by step rather than emitting one token at a time.

Going deeper

Why decode is memory-bound, not compute-bound

During decode the model processes a single new token but must load the entire set of model weights (and read the growing KV cache) from GPU memory for each step. The arithmetic per step is tiny relative to the bytes moved, so the GPU's compute units sit idle waiting on memory. That's why decode is memory-bandwidth-bound: you're limited by how fast you can move data, not how fast you can multiply. It's also why batching many requests together helps so much, you amortize one weight-load across many sequences, raising GPU utilization. This is part of why LLMs need GPUs with high memory bandwidth, not just high raw FLOPs.

Speculative decoding: cheating the sequential limit

One clever way to speed up the sequential decode loop without abandoning autoregression is speculative decoding. A small, fast 'draft' model proposes several tokens ahead; the large target model then verifies all of them in a single parallel forward pass and accepts the longest correct prefix. When the draft guesses well, you get multiple tokens for roughly the cost of one big-model step, while the output stays mathematically identical to plain autoregressive decoding. It exploits the fact that verification can be parallel even when generation can't.

Exposure bias and error accumulation

Models are trained on perfect ground-truth prefixes (a setup called teacher forcing) but at inference they must condition on their own possibly-flawed outputs. This mismatch is called exposure bias: once the model strays off the kind of text it saw in training, errors can compound step after step. It's one structural reason long generations can wander, and it connects autoregression directly to reliability concerns like hallucination.

The causal mask: how training stays autoregressive

Decoder-only LLMs enforce the left-to-right rule with a causal attention mask that blocks each position from attending to any future position. This lets the model train on a whole sequence in parallel (predicting every next token at once) while guaranteeing that, at any given position, it only ever 'sees' tokens to its left, exactly the constraint it must obey at inference time. The same mask that makes training efficient is what makes generation autoregressive. To see the broader picture of training and inference, read how LLMs work.

FAQ

What is autoregressive generation in simple terms?

It's the way LLMs write text one token at a time, strictly left to right. Each token the model produces is fed back in as part of the input for predicting the next token, so the model conditions every new step on everything it has written so far.

Why is autoregressive generation slow?

Because it's inherently sequential. Each output token depends on the previous one, so the model genuinely cannot compute a long answer in a single shot; it must take one forward pass per token. Latency scales with the number of output tokens, which makes long answers slower and the process latency-bound.

What is the difference between prefill and decode?

Prefill reads your entire prompt in parallel in one batched pass and produces the first token, driving time-to-first-token. Decode then generates new tokens one at a time, each requiring its own forward pass; it's sequential and memory-bandwidth-bound, and it dominates the latency of long responses.

What does the KV cache do?

It stores the key and value vectors for tokens already in the sequence so the model doesn't recompute them on every decode step. Since those vectors never change once a token exists, caching them turns total decode cost from roughly O(n squared) down to O(n), at the price of extra GPU memory.

Is non-autoregressive or diffusion text generation faster?

It can be, because diffusion and non-autoregressive models refine many positions per step instead of one token at a time, decoupling latency from output length. But as of 2026, matching autoregressive quality, especially on reasoning, remains an open problem, and fast diffusion text models often drift back toward left-to-right decoding.

Further reading