In plain English
A transformer is the neural-network design inside basically every AI model you've heard of — GPT, Claude, Gemini, Llama, all of them. It was introduced in a 2017 Google research paper called Attention Is All You Need, and it's not a product or a model. It's a blueprint: a repeatable recipe for building a machine that reads a sequence of tokens and predicts what comes next.
Here's the everyday version. Take the sentence "The bank by the river was muddy." To understand the word bank, you need to notice river — otherwise you might picture a building with ATMs. Older language models read like one person with a flashlight, one word at a time, left to right, trying to hold everything in short-term memory. By the time the flashlight reaches muddy, the memory of bank has gone fuzzy.
A transformer works more like a conference table. Every word in the sentence sits down at once. Then each word looks around the table at every other word simultaneously and updates its own meaning based on what it sees: bank glances at river and thinks "ah, I'm a riverbank." The model repeats this look-around-and-update step dozens of times, and each pass produces a sharper understanding than the last.
That look-at-everything trick is called attention, and it's the beating heart of the architecture — we unpack it fully in how attention works. Everything else in a transformer is well-engineered plumbing wrapped around that one idea.
Why it matters
Before 2017, language AI was built on recurrent networks (RNNs and LSTMs). They read text strictly one token at a time, which created two killer problems. First, memory: information about early words degraded as the sequence got longer, so long-range connections like bank → river got lost. Second, speed: because step 500 depended on step 499, training couldn't be parallelized. You couldn't just throw more GPUs at the problem.
The transformer fixed both at once. Since every token looks at every other token directly, nothing has to survive a long relay race through memory. And since all tokens are processed at the same time, training maps beautifully onto GPUs, which are parallel machines to their core. That second fix is the quiet one that changed history: suddenly you could train on vastly more text just by buying more hardware. Scale the data, scale the model, and quality kept climbing — that's the story that leads directly from a 2017 translation paper to ChatGPT.
And it didn't stop at text. The same blueprint, almost unchanged, now powers image models (Vision Transformers), speech recognition (Whisper), code generation, and protein-structure prediction (AlphaFold leans heavily on attention). Learn this one architecture and you understand the skeleton of essentially all modern AI.
If you build with LLMs rather than building them, this still matters practically. The transformer's mechanics explain why context windows have hard limits, why long prompts cost more and run slower, why model specs talk about "layers" and "parameters," and why your API bill is measured in tokens. Every API call you make is renting time on a transformer.
How it works
A transformer is a pipeline with five stages. The pipeline is identical whether the model has 100 million parameters or a few trillion — bigger models just make each stage wider and repeat the middle stage more times.
Step 1: text becomes vectors
First the tokenizer chops your text into tokens — chunks of roughly a word or part of a word. Each token ID is swapped for an embedding: a long list of numbers (often a few thousand) that acts as the token's working representation, its "meaning so far." Position information gets mixed in too, because attention by itself is order-blind — without it, "dog bites man" and "man bites dog" would look identical.
Step 2: the transformer block — attention, then MLP
The middle of the pipeline is one block repeated over and over, and each block has just two working parts. Self-attention is the communication step: every token looks at every other token and pulls in whatever context is relevant to it. Then a feed-forward network (an MLP) is the thinking step: each token, on its own, processes what it just gathered — no cross-token chatter. Communicate, then compute. That two-beat rhythm is the whole architecture.
One more detail that makes deep stacks possible: residual connections. Each sub-layer adds its output to the incoming representation instead of replacing it. Think of it as each layer scribbling notes in the margin of a shared document rather than rewriting the document from scratch. Nothing important gets accidentally erased, and the network stays trainable even when it's a hundred layers deep.
Step 3: stack it deep
Because each block outputs vectors with exactly the same shape it received, blocks stack like LEGO. Small models use around a dozen blocks; frontier models use many dozens. Roughly speaking, early layers settle local things like grammar and word identity, while later layers handle increasingly abstract relationships — though the truth is messier than that tidy story.
Step 4: predict the next token
After the final block, the vector at the last position is compared against every token in the vocabulary, producing one score per token (the logits). A softmax turns scores into probabilities, and a sampler picks one — that's your next token. Append it, run the pipeline again, and repeat until done. That loop, covered in how LLMs actually work, is text generation.
A transformer block in code
Here's the part nobody tells beginners: the core of this world-changing architecture fits on one screen. This is a working transformer block in PyTorch — attention, MLP, residual connections, normalization. It's simplified (no causal mask, no dropout) but the skeleton is the real thing.
import torch
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, d_model=512, n_heads=8):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
# the MLP is wider in the middle — typically 4x the model dimension
self.mlp = nn.Sequential(
nn.Linear(d_model, 4 * d_model),
nn.GELU(),
nn.Linear(4 * d_model, d_model),
)
def forward(self, x):
# communicate: every token gathers context from every other token
h = self.norm1(x)
attn_out, _ = self.attn(h, h, h, need_weights=False)
x = x + attn_out # residual: add, don't replace
# compute: each token processes what it gathered, independently
x = x + self.mlp(self.norm2(x)) # residual again
return x
x = torch.randn(1, 10, 512) # 1 sequence, 10 tokens, 512 numbers per token
block = TransformerBlock()
print(block(x).shape) # torch.Size([1, 10, 512]) — same shape out
Notice the last line: same shape out as in. That's why stacking works — you can chain 12 or 96 of these and the data flows through unchanged in format, enriched in content. A production model differs in scale, not in kind: a bigger d_model, more heads, more blocks, a causal mask so tokens can't peek at the future, and an embedding layer at the bottom plus a vocabulary projection at the top. Andrej Karpathy's nanoGPT implements a full GPT-2-class model in roughly 300 lines, and reading it is the single best follow-up to this article.
What it replaced
It's easier to appreciate the transformer when you see the old world side by side. RNNs and LSTMs weren't bad engineering — they were the state of the art for years — but they had a structural ceiling the transformer simply doesn't.
- Reads one token at a time
- Memory fades over distance
- Sequential training — can't parallelize
- Hit a scaling wall
- Sees all tokens at once
- Direct line to any token
- Trains in parallel on GPUs
- Kept improving with scale
The right column's last row is the one that mattered. The transformer wasn't just better at translation benchmarks — it was the first architecture where spending more money reliably bought more intelligence. That property created the entire modern AI industry.
Going deeper
Causal masking and the three families. The original 2017 transformer had two halves: an encoder that read the input and a decoder that wrote the output. Modern LLMs like GPT and Claude are decoder-only: a mask inside attention stops each token from seeing tokens to its right, so the model can be trained to predict every next token in a document simultaneously — one of the great efficiency tricks in ML. Encoder-only models (BERT) drop the mask and read bidirectionally, which suits search and classification. The trade-offs are mapped out in encoder vs decoder models.
The residual stream view. Interpretability researchers flip the mental model: instead of layers transforming data, picture one shared "stream" of information flowing upward, with every attention head and MLP reading from and writing into it. Attention heads become routing devices that copy information between token positions; MLPs become lookup-and-compute units. This framing, developed in Anthropic's circuits research, is the foundation of most current work on understanding what's happening inside these models.
Where the parameters actually live. Attention gets the fame, but in a standard block the MLP holds roughly two-thirds of the weights. A useful caricature: attention moves information around; MLPs store and apply the knowledge. Mixture-of-experts models push this further by replacing each dense MLP with many parallel "experts" and a router that activates only a few per token — most parameters sit idle on any given forward pass, which is how frontier models grow huge without proportionally growing compute.
The quadratic tax. Every token attending to every other token means attention cost grows with the square of sequence length — double the context, quadruple the attention work. This is the deep reason long context is expensive. Production inference leans on the KV cache to avoid recomputing attention over the whole history for each new token (trading GPU memory for speed), and on FlashAttention, which reorders the computation around GPU memory hierarchy to get exact attention much faster. Positional encoding schemes like RoPE, rather than the original sinusoidal signals, are a big part of how models stretch to very long contexts.
Open problems. The quadratic tax has inspired a wave of challengers — state-space models like Mamba, linear-attention variants, and hybrid stacks that mix attention layers with cheaper ones. None has dethroned the transformer yet; the ecosystem's tooling, hardware optimizations, and accumulated training know-how all assume this architecture. Nine years after Attention Is All You Need, the most interesting question in the field is still open: is the transformer a stepping stone, or the final word on sequence modeling?
FAQ
Why are LLMs called transformers?
Because they're all built on the transformer architecture from the 2017 paper Attention Is All You Need. The "T" in GPT literally stands for Transformer (Generative Pre-trained Transformer). Claude, Gemini, and Llama are variants of the same blueprint — different sizes and training recipes, same core design.
What's the difference between a transformer and an LLM?
The transformer is the architecture — the blueprint. An LLM is one thing you can build with it: a very large transformer trained on enormous amounts of text. The same blueprint also builds image models (Vision Transformers) and speech models (Whisper), which are transformers but not LLMs.
Why did transformers replace RNNs and LSTMs?
Two reasons. RNNs read one token at a time, so distant context faded from memory and training couldn't be parallelized. Transformers let every token attend to every other token directly and process the whole sequence at once, which made GPU-scale training practical. Once scaling worked, RNNs couldn't compete.
Do I need to understand transformer math to build with LLMs?
No. The working mental model — text becomes tokens, tokens exchange context through attention, a deep stack of identical blocks refines them, the top predicts the next token — is enough to reason correctly about context limits, latency, cost, and most weird model behavior. The matrix calculus is only required if you're training or modifying models.
How many layers does a transformer model have?
It varies with scale. GPT-2's smallest version used 12 transformer blocks; GPT-3 used 96. When a model card says "layers," it almost always means how many times the attention-plus-MLP block is repeated. More blocks (and wider ones) is the main way models grow.
Is attention the only thing that matters in a transformer?
No, despite the paper's title. Attention routes information between tokens, but the feed-forward MLP layers hold roughly two-thirds of the parameters and do much of the actual "knowledge" work. Residual connections and normalization are also load-bearing — remove them and deep stacks fail to train.