What Is a Transformer Model? The Architecture Behind LLMs

Get a working mental model of the transformer — the one architecture behind every model you'll ever call — without the math degree.

BEGINNER10 MIN READUPDATED 2026-06-11

In plain English

A transformer is the neural-network design inside basically every AI model you've heard of — GPT, Claude, Gemini, Llama, all of them. It was introduced in a 2017 Google research paper called Attention Is All You Need, and it's not a product or a model. It's a blueprint: a repeatable recipe for building a machine that reads a sequence of tokens and predicts what comes next.

Here's the everyday version. Take the sentence "The bank by the river was muddy." To understand the word bank, you need to notice river — otherwise you might picture a building with ATMs. Older language models read like one person with a flashlight, one word at a time, left to right, trying to hold everything in short-term memory. By the time the flashlight reaches muddy, the memory of bank has gone fuzzy.

A transformer works more like a conference table. Every word in the sentence sits down at once. Then each word looks around the table at every other word simultaneously and updates its own meaning based on what it sees: bank glances at river and thinks "ah, I'm a riverbank." The model repeats this look-around-and-update step dozens of times, and each pass produces a sharper understanding than the last.

That look-at-everything trick is called attention, and it's the beating heart of the architecture — we unpack it fully in how attention works. Everything else in a transformer is well-engineered plumbing wrapped around that one idea.

Why it matters

Before 2017, language AI was built on recurrent networks (RNNs and LSTMs). They read text strictly one token at a time, which created two killer problems. First, memory: information about early words degraded as the sequence got longer, so long-range connections like bank → river got lost. Second, speed: because step 500 depended on step 499, training couldn't be parallelized. You couldn't just throw more GPUs at the problem.

The transformer fixed both at once. Since every token looks at every other token directly, nothing has to survive a long relay race through memory. And since all tokens are processed at the same time, training maps beautifully onto GPUs, which are parallel machines to their core. That second fix is the quiet one that changed history: suddenly you could train on vastly more text just by buying more hardware. Scale the data, scale the model, and quality kept climbing — that's the story that leads directly from a 2017 translation paper to ChatGPT.

And it didn't stop at text. The same blueprint, almost unchanged, now powers image models (Vision Transformers), speech recognition (Whisper), code generation, and protein-structure prediction (AlphaFold leans heavily on attention). Learn this one architecture and you understand the skeleton of essentially all modern AI.

If you build with LLMs rather than building them, this still matters practically. The transformer's mechanics explain why context windows have hard limits, why long prompts cost more and run slower, why model specs talk about "layers" and "parameters," and why your API bill is measured in tokens. Every API call you make is renting time on a transformer.

How it works

A transformer is a pipeline with five stages. The pipeline is identical whether the model has 100 million parameters or a few trillion — bigger models just make each stage wider and repeat the middle stage more times.

// The transformer pipeline, end to end

Prompt text"The bank by the river…"Tokenizertext → token IDsEmbeddingsIDs → vectors + positionTransformer blocksattention + MLP, ×NLogitsa score for every tokenNext tokensample, append, repeat

Step 1: text becomes vectors

First the tokenizer chops your text into tokens — chunks of roughly a word or part of a word. Each token ID is swapped for an embedding: a long list of numbers (often a few thousand) that acts as the token's working representation, its "meaning so far." Position information gets mixed in too, because attention by itself is order-blind — without it, "dog bites man" and "man bites dog" would look identical.

Step 2: the transformer block — attention, then MLP

The middle of the pipeline is one block repeated over and over, and each block has just two working parts. Self-attention is the communication step: every token looks at every other token and pulls in whatever context is relevant to it. Then a feed-forward network (an MLP) is the thinking step: each token, on its own, processes what it just gathered — no cross-token chatter. Communicate, then compute. That two-beat rhythm is the whole architecture.

// Inside one transformer block

Updated vectorssame shape as the inputFeed-forward MLPeach token thinks aloneSelf-attentiontokens exchange contextToken vectors inone vector per token

One more detail that makes deep stacks possible: residual connections. Each sub-layer adds its output to the incoming representation instead of replacing it. Think of it as each layer scribbling notes in the margin of a shared document rather than rewriting the document from scratch. Nothing important gets accidentally erased, and the network stays trainable even when it's a hundred layers deep.

Step 3: stack it deep

Because each block outputs vectors with exactly the same shape it received, blocks stack like LEGO. Small models use around a dozen blocks; frontier models use many dozens. Roughly speaking, early layers settle local things like grammar and word identity, while later layers handle increasingly abstract relationships — though the truth is messier than that tidy story.

Step 4: predict the next token

After the final block, the vector at the last position is compared against every token in the vocabulary, producing one score per token (the logits). A softmax turns scores into probabilities, and a sampler picks one — that's your next token. Append it, run the pipeline again, and repeat until done. That loop, covered in how LLMs actually work, is text generation.

A transformer block in code

Here's the part nobody tells beginners: the core of this world-changing architecture fits on one screen. This is a working transformer block in PyTorch — attention, MLP, residual connections, normalization. It's simplified (no causal mask, no dropout) but the skeleton is the real thing.

transformer_block.pypython

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, n_heads=8):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        # the MLP is wider in the middle — typically 4x the model dimension
        self.mlp = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

    def forward(self, x):
        # communicate: every token gathers context from every other token
        h = self.norm1(x)
        attn_out, _ = self.attn(h, h, h, need_weights=False)
        x = x + attn_out                  # residual: add, don't replace
        # compute: each token processes what it gathered, independently
        x = x + self.mlp(self.norm2(x))   # residual again
        return x

x = torch.randn(1, 10, 512)  # 1 sequence, 10 tokens, 512 numbers per token
block = TransformerBlock()
print(block(x).shape)        # torch.Size([1, 10, 512]) — same shape out

Notice the last line: same shape out as in. That's why stacking works — you can chain 12 or 96 of these and the data flows through unchanged in format, enriched in content. A production model differs in scale, not in kind: a bigger d_model, more heads, more blocks, a causal mask so tokens can't peek at the future, and an embedding layer at the bottom plus a vocabulary projection at the top. Andrej Karpathy's nanoGPT implements a full GPT-2-class model in roughly 300 lines, and reading it is the single best follow-up to this article.

What it replaced

It's easier to appreciate the transformer when you see the old world side by side. RNNs and LSTMs weren't bad engineering — they were the state of the art for years — but they had a structural ceiling the transformer simply doesn't.

// RNN era vs transformer era

RNN / LSTM (pre-2017)

Reads one token at a time
Memory fades over distance
Sequential training — can't parallelize
Hit a scaling wall

Transformer

Sees all tokens at once
Direct line to any token
Trains in parallel on GPUs
Kept improving with scale

The right column's last row is the one that mattered. The transformer wasn't just better at translation benchmarks — it was the first architecture where spending more money reliably bought more intelligence. That property created the entire modern AI industry.

Going deeper

Causal masking and the three families. The original 2017 transformer had two halves: an encoder that read the input and a decoder that wrote the output. Modern LLMs like GPT and Claude are decoder-only: a mask inside attention stops each token from seeing tokens to its right, so the model can be trained to predict every next token in a document simultaneously — one of the great efficiency tricks in ML. Encoder-only models (BERT) drop the mask and read bidirectionally, which suits search and classification. The trade-offs are mapped out in encoder vs decoder models.

The residual stream view. Interpretability researchers flip the mental model: instead of layers transforming data, picture one shared "stream" of information flowing upward, with every attention head and MLP reading from and writing into it. Attention heads become routing devices that copy information between token positions; MLPs become lookup-and-compute units. This framing, developed in Anthropic's circuits research, is the foundation of most current work on understanding what's happening inside these models.

Where the parameters actually live. Attention gets the fame, but in a standard block the MLP holds roughly two-thirds of the weights. A useful caricature: attention moves information around; MLPs store and apply the knowledge. Mixture-of-experts models push this further by replacing each dense MLP with many parallel "experts" and a router that activates only a few per token — most parameters sit idle on any given forward pass, which is how frontier models grow huge without proportionally growing compute.

The quadratic tax. Every token attending to every other token means attention cost grows with the square of sequence length — double the context, quadruple the attention work. This is the deep reason long context is expensive. Production inference leans on the KV cache to avoid recomputing attention over the whole history for each new token (trading GPU memory for speed), and on FlashAttention, which reorders the computation around GPU memory hierarchy to get exact attention much faster. Positional encoding schemes like RoPE, rather than the original sinusoidal signals, are a big part of how models stretch to very long contexts.

Open problems. The quadratic tax has inspired a wave of challengers — state-space models like Mamba, linear-attention variants, and hybrid stacks that mix attention layers with cheaper ones. None has dethroned the transformer yet; the ecosystem's tooling, hardware optimizations, and accumulated training know-how all assume this architecture. Nine years after Attention Is All You Need, the most interesting question in the field is still open: is the transformer a stepping stone, or the final word on sequence modeling?

FAQ

Why are LLMs called transformers?

Because they're all built on the transformer architecture from the 2017 paper Attention Is All You Need. The "T" in GPT literally stands for Transformer (Generative Pre-trained Transformer). Claude, Gemini, and Llama are variants of the same blueprint — different sizes and training recipes, same core design.

What's the difference between a transformer and an LLM?

The transformer is the architecture — the blueprint. An LLM is one thing you can build with it: a very large transformer trained on enormous amounts of text. The same blueprint also builds image models (Vision Transformers) and speech models (Whisper), which are transformers but not LLMs.

Why did transformers replace RNNs and LSTMs?

Two reasons. RNNs read one token at a time, so distant context faded from memory and training couldn't be parallelized. Transformers let every token attend to every other token directly and process the whole sequence at once, which made GPU-scale training practical. Once scaling worked, RNNs couldn't compete.

Do I need to understand transformer math to build with LLMs?

No. The working mental model — text becomes tokens, tokens exchange context through attention, a deep stack of identical blocks refines them, the top predicts the next token — is enough to reason correctly about context limits, latency, cost, and most weird model behavior. The matrix calculus is only required if you're training or modifying models.

How many layers does a transformer model have?

It varies with scale. GPT-2's smallest version used 12 transformer blocks; GPT-3 used 96. When a model card says "layers," it almost always means how many times the attention-plus-MLP block is repeated. More blocks (and wider ones) is the main way models grow.

Is attention the only thing that matters in a transformer?

No, despite the paper's title. Attention routes information between tokens, but the feed-forward MLP layers hold roughly two-thirds of the parameters and do much of the actual "knowledge" work. Residual connections and normalization are also load-bearing — remove them and deep stacks fail to train.

// In plain English

// Why it matters

// How it works

Step 1: text becomes vectors

Step 2: the transformer block — attention, then MLP

Step 3: stack it deep

Step 4: predict the next token

// A transformer block in code

// What it replaced

// Going deeper

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

A transformer block in code

What it replaced

Going deeper

FAQ

Further reading

Related