AI/TLDR

How Does Attention Work in LLMs? A Visual Beginner's Guide

Build an intuition for self-attention — queries, keys, and values — and understand the paper title everyone quotes.

BEGINNER10 MIN READUPDATED 2026-06-12

In plain English

Attention is the trick that lets a language model figure out, for every word it reads, which other words actually matter. Take the sentence "The trophy didn't fit in the suitcase because it was too big." What does "it" refer to — the trophy or the suitcase? You know instantly: the trophy. Attention is the machinery that lets a model make the same call, by having the word "it" glance back across the whole sentence and decide where to lean.

Here's the everyday analogy. Imagine a roomful of people at a conference. You (a single word) walk in holding a sticky note that says what you're looking for — "I need the long-context expert." That's your query. Everyone else wears a name tag describing what they know — "vector databases," "GPU kernels," "context windows." Those tags are the keys. You read every tag, find the best matches, and then actually listen to what those people say — that spoken content is the value. You walk away with a blend of information, weighted toward the people who matched your note best. That, almost exactly, is self-attention.

Why it matters

Before attention, the leading way to process text was to read it strictly left to right, squeezing everything seen so far into a single fixed-size memory and updating it word by word. That had two fatal problems. First, by the time the model reached the end of a long paragraph, the start had faded — early words were a blurry memory. Second, you had to process word 1 before word 2 before word 3, which made training painfully slow because nothing could run in parallel.

Attention fixes both at once. Every word can look directly at every other word, no matter how far apart they sit — so the connection between "it" and "trophy" twenty words back is just as cheap as the connection to its neighbor. And because every pair of words is compared in one big simultaneous operation, the whole thing runs in parallel on a GPU. That parallelism is exactly why these models are so hungry for hardware, a story we tell in why LLMs need GPUs.

This is the breakthrough behind the 2017 paper "Attention Is All You Need." Its headline claim — quoted endlessly ever since — is that you can throw away the old sequential machinery entirely and build a model based solely on attention. That architecture is the transformer, and every major model in mid-2026 — Claude, GPT, Gemini, Llama, Qwen, DeepSeek — is a descendant of it. If attention is the engine, the transformer is the car built around it; see what is a transformer for the full vehicle.

How it works

Let's walk through one round of attention for a single word. The model has already turned each word into a list of numbers — an embedding (more on that in what is a token). From each word's embedding, the model produces three new vectors by multiplying it with three learned weight matrices: a query, a key, and a value.

The score step is the heart of it. The query for "it" is compared against the key of every word in the sentence using a dot product — a single number measuring how aligned two vectors are. A big number means "strong match." The word "trophy" will score high; "the" will score low.

Those raw scores get divided by a scaling factor and passed through softmax, which turns them into positive weights that add up to 1 — think of it as splitting 100% of attention across all the words. Finally the model takes a weighted sum of the value vectors: mostly "trophy"'s value, a little of everything else. The result is a brand-new vector for "it" that has quietly absorbed the meaning of "trophy." The word now knows what it refers to.

If you like one line of math, the whole operation is the famous formula softmax(Q·Kᵀ / √dₖ)·V. Q, K, and V are the stacks of query, key, and value vectors; Q·Kᵀ is every query dotted with every key (the score grid); √dₖ is the scaling factor that keeps the numbers from blowing up; softmax normalizes; and multiplying by V blends the values. That's it — no magic beyond "compare, weight, blend."

Why multiple heads?

One round of attention captures one kind of relationship. But language has many at once: "it" needs to track what noun it refers to, while a verb needs to track its subject, and an adjective needs to find the thing it describes. Asking a single attention pass to juggle all of these is like asking one highlighter to mark grammar, topic, and tone in three different colors — it can't.

Multi-head attention runs several independent attention operations in parallel — the original transformer used 8 — each with its own query/key/value projections. One head might learn to track pronoun references, another long-range grammar, another local word order. Their outputs are concatenated and mixed back together. More heads, more kinds of relationship the model can notice simultaneously.

Modern models don't store every head's keys and values separately, though — that would be ruinously expensive during generation, because the model caches the keys and values of every token it has already produced (the KV cache). To shrink that cache, today's models share key/value projections across groups of heads. The spectrum below is the standard vocabulary you'll meet in any 2026 model card.

VariantHow keys/values are sharedTrade-off
Multi-head (MHA)Every query head has its own K and VHighest quality, biggest KV cache
Multi-query (MQA)All query heads share one K and VTiny cache, some quality loss
Grouped-query (GQA)Groups of query heads share a K and VThe sweet spot — the de-facto default in mid-2026

As of mid-2026, grouped-query attention (GQA) is the default across open-weight families like Llama, Mistral, and Qwen — it keeps nearly all of MHA's quality while slashing the memory the cache eats. This directly affects how big a context window a model can serve affordably.

See it in code

Self-attention is shorter than people expect. Here is the entire scaled dot-product attention in a few lines of NumPy — no deep-learning framework, no GPU, just the math from the how it works section. Run it and watch the attention weights for each word add up to 1.

self_attention.pypython
import numpy as np

def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

def attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)   # every query dot every key
    weights = softmax(scores)         # rows sum to 1
    return weights @ V, weights       # blended values + the weights

# 3 words, each a tiny 4-dim vector (pretend these are Q, K, V projections)
rng = np.random.default_rng(0)
Q = K = V = rng.standard_normal((3, 4))

out, weights = attention(Q, K, V)
print("attention weights:\n", weights.round(2))
print("each row sums to:", weights.sum(axis=1).round(2))  # -> [1. 1. 1.]

That weights matrix is the whole story: row i tells you how much word i paid attention to every other word. In a real model the Q, K, and V matrices are learned projections of the embeddings rather than the same random vector, and this block is stacked dozens of times — but the arithmetic you just ran is exactly what's happening inside the engine.

Attention in mid-2026 models

The original formula is mathematically simple but expensive: comparing every word to every other word means the cost grows with the square of the sequence length. Double the context and you roughly quadruple the attention work. Most progress since 2022 has been about making that cheaper without changing the answer.

  • FlashAttention (and FlashAttention-3, used widely as of mid-2026) computes the exact same attention result, but reorders the work to minimize slow reads and writes to GPU memory — same math, far faster. See what is FlashAttention.
  • Grouped-query attention (GQA) shrinks the KV cache by sharing keys and values across head groups — now the default in Llama, Mistral, Qwen, and other open families.
  • Multi-head latent attention (MLA), introduced by DeepSeek in V2 and used in V3, compresses keys and values into a small shared latent space and caches only that — a different route to the same goal of a smaller cache.
  • These efficiency tricks are part of why frontier models in mid-2026 can offer 1M-token contexts (e.g. Claude Opus 4.6, Gemini 2.5 Pro) and even 2M (e.g. Gemini 3.1 Pro) — see million-token context windows.

Going deeper

A few subtleties separate the beginner picture from the real thing. First, attention is permutation-blind: the dot-product mechanism has no built-in notion of word order, so "dog bites man" and "man bites dog" would look identical to it. Models fix this by adding positional information to the embeddings before attention runs. Modern models mostly use rotary position embeddings (RoPE), which rotate the query and key vectors by an angle that depends on position — a clean way to bake in "how far apart are these two words?" directly into the scores.

Second, the √dₖ scaling factor isn't decoration. When the query and key vectors are high-dimensional, raw dot products can grow large, which pushes softmax into a near-flat region where gradients vanish and learning stalls. Dividing by the square root of the key dimension keeps the scores in a sane range. It's a small detail with an outsized effect on whether the model trains at all.

Third, attention alone isn't the whole transformer block. Each attention layer is followed by a small feed-forward network applied to every position, plus residual connections and normalization that keep deep stacks trainable. Attention moves information between words; the feed-forward layers think about each word with that gathered context. Stack the pair dozens of times and you get an LLM.

Finally, a frontier of 2025-2026 research is sparse and linear attention — schemes where each token attends to only a carefully chosen subset of others, or where the quadratic cost is approximated away, so context can grow toward millions of tokens without the cost exploding. The exact-vs-approximate trade-off is one of the most active areas in model architecture today. If you want to keep climbing, mixture-of-experts is the natural next stop — a different way of making giant models affordable to run.

FAQ

What are queries, keys, and values in attention?

For every word, the model makes three vectors. The query says what that word is looking for, the key advertises what each word offers, and the value is the actual content a word hands over. The query is matched against all keys to decide how much of each value to blend in. A handy analogy: a query is a search box, keys are the titles of documents, and values are the documents themselves.

What does 'Attention Is All You Need' actually mean?

It's the title of the 2017 paper that introduced the transformer. The claim is literal: you can build a powerful sequence model using only attention, dropping the older recurrent and convolutional machinery entirely. That design — the transformer — underpins essentially every major LLM in mid-2026, from Claude and GPT to Gemini, Llama, and DeepSeek.

What is multi-head attention and why use more than one head?

Multi-head attention runs several attention operations in parallel, each with its own learned projections. Different heads specialize in different relationships — one tracks which noun a pronoun refers to, another tracks subject-verb agreement, and so on. The original transformer used 8 heads; their outputs are combined so the model can capture many kinds of word relationship at once.

Is self-attention the same as the attention in older translation models?

Not quite. Earlier 'attention' connected two different sequences (for example, a source sentence and its translation). Self-attention applies the same idea within a single sequence — every word attends to the other words in the same text. That self-referential version is what powers transformers and modern LLMs.

Why is attention so computationally expensive?

Standard attention compares every token to every other token, so the work grows with the square of the sequence length — double the input and you roughly quadruple the attention cost. That's why long contexts are pricey, and why techniques like FlashAttention, grouped-query attention, and multi-head latent attention exist: they cut the memory and compute cost without changing the result much.

Further reading