AI/TLDR

Self-Attention vs Cross-Attention: What's the Difference?

Clearly separate self-attention (a sequence looking at itself) from cross-attention (one sequence looking at another), and see where each appears in real architectures.

ADVANCED9 MIN READUPDATED 2026-06-13

In plain English

Both self-attention and cross-attention are the same machine — the attention mechanism that lets one position in a sequence pull information from other positions. The only thing that changes between them is where the information comes from. Self-attention lets a sequence look at itself. Cross-attention lets one sequence look at a different one.

Self vs Cross-Attention — illustration
Self vs Cross-Attention — astconsulting.in

Picture a person editing a single paragraph. To decide what the word "it" refers to, they glance back and forth at the other words in that same paragraph. That re-reading of one text against itself is self-attention — every word is allowed to weigh every other word in the same sentence.

Now picture a translator writing a French sentence while a finished English sentence sits open on the desk beside them. For each French word they produce, they glance over at the relevant English words to stay faithful to the source. The text they are writing is looking at a separate text they are reading. That is cross-attention — one sequence (the output being built) attends to another sequence (the source).

Why it matters

If you only ever use a modern chat LLM, you might think attention is one thing. But the original transformer had both kinds, doing two different jobs, and the distinction explains a lot about how real architectures are wired.

  • Self-attention builds understanding of a single sequence. It is how a model resolves "it" to the right noun, links a verb to its subject, or notices that "bank" means a riverbank because "river" appeared earlier. Every position gets a context-aware representation built from its neighbours.
  • Cross-attention connects two different things. It is how a translation model keeps its French output tied to the English input, how an image-captioning model lets each generated word look at the relevant pixels, and how multimodal models ground text in images, audio, or other modalities.
  • Knowing which is which tells you how a model is shaped. When you read that a model is "decoder-only" or "encoder-decoder," you are really being told which attention blocks it contains. Decoder-only models (most chat LLMs) use self-attention only; encoder-decoder models add a cross-attention bridge between the two stacks.

For a builder, the practical payoff is reading architecture diagrams and papers without getting lost. The moment you can tell whether an attention block is pulling from the same stream or a different one, the whole picture of how information flows through a model snaps into focus.

How it works

Attention works by turning each token into three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (what information do I carry?). Every query is compared against every key to produce attention weights, and those weights are used to mix the values into a new representation. (For the full mechanism, see how attention works.)

The single knob that separates self- from cross-attention is which sequence the Q, K, and V come from.

Self-attention: Q, K, V all from one sequence

In self-attention, the queries, keys, and values are all computed from the same input sequence. Token 5 forms a query and compares it against the keys of tokens 1 through N (the whole sequence, including itself), then blends in their values. The sequence is, in effect, talking to itself.

Cross-attention: Q from one sequence, K and V from another

In cross-attention, the queries come from one sequence (call it the target — the text being generated) while the keys and values come from a different sequence (the source — e.g. the encoder's output). Each target token asks, "which parts of the source are relevant to me?" and pulls information across the gap. The Q stream and the K/V stream are two separate things.

That is the entire difference. The math — scaled dot-product attention — is identical in both cases. Only the wiring of where Q, K, and V are read from changes. Below, the same function is called two ways.

same op, different inputspython
import torch
import torch.nn.functional as F

def attention(q, k, v):
    # q: (target_len, d), k/v: (source_len, d) -- identical math either way
    scores = q @ k.transpose(-2, -1) / (q.size(-1) ** 0.5)
    weights = F.softmax(scores, dim=-1)
    return weights @ v

# SELF-attention: q, k, v all come from the SAME sequence x
x = torch.randn(10, 64)          # one sequence, 10 tokens
self_out = attention(x, x, x)    # x looks at x

# CROSS-attention: queries from the target, keys/values from the source
target = torch.randn(7, 64)      # sequence being generated (7 tokens)
source = torch.randn(10, 64)     # a DIFFERENT sequence (encoder output)
cross_out = attention(target, source, source)  # target looks at source

Notice the shapes. In self-attention the query and key sequences have the same length (10 and 10). In cross-attention they can differ (7 queries, 10 keys/values) — the target and source are independent sequences that needn't be the same size.

Side by side

The contrast laid out across the dimensions that actually differ:

AspectSelf-attentionCross-attention
Query (Q) comes fromthe sequence itselfthe target sequence
Key/Value (K/V) come fromthe same sequencea different (source) sequence
Number of sequencesonetwo
Q length vs K/V lengthalways equalcan differ
Typical jobbuild context within a sequencefuse one sequence into another
Where it livesencoder blocks, decoder self-attn blocksthe encoder→decoder bridge, multimodal fusion
The math itselfscaled dot-productscaled dot-product (same)

Where each shows up in real models

Both kinds appear together in the classic encoder-decoder transformer used for machine translation. Walking through one translation makes the roles concrete.

  • Encoder self-attention. The English source sentence attends to itself, so each English word gets a representation that knows its full sentence context.
  • Decoder self-attention. The French words generated so far attend to themselves (with masking, so a token can't peek at future tokens). This keeps the output fluent and grammatical on its own terms.
  • Encoder-decoder cross-attention. Here is the bridge. Each French position forms a query and attends over the English encoder's keys and values, so every word it produces stays faithful to the source. Remove this block and the decoder would write fluent French with no idea what it was supposed to be translating.

Decoder-only LLMs: self-attention only

Most modern chat models (the GPT and Claude families) are decoder-only. They have no separate encoder and no cross-attention at all — just stacks of masked self-attention. The prompt and the response live in one sequence, so the model only ever needs to look at itself. This is partly why decoder-only models became dominant: dropping the cross-attention bridge makes the architecture simpler and easier to scale. See encoder vs decoder models for the full comparison.

Multimodal models: cross-attention as the bridge between senses

Cross-attention is the natural way to connect two different kinds of data. In many vision-language models, image features (from a vision encoder) become the keys and values, while the text tokens supply the queries — so each word can attend to the relevant regions of the picture. The same trick grounds text in audio, video, or other modalities: the text stream queries the other stream. Whenever you see information flow from one modality into another, cross-attention is usually doing the work.

Common confusions

A handful of mix-ups trip up nearly everyone learning this distinction.

  • "Masked self-attention is a third type." It isn't. The causal mask used in decoders is still self-attention — one sequence looking at itself, just forbidden from looking forward in time. The Q, K, and V still come from the same sequence.
  • "Cross-attention means the model is multimodal." No. Plain text-to-text translation uses cross-attention between two text sequences. Cross-attention is about two sequences, not two modalities — multimodality is one common use, not the definition.
  • "Multi-head attention is different from self/cross." Different axis entirely. Multi-head describes splitting attention into several parallel heads; self/cross describes where Q, K, V come from. A block can be multi-head self-attention or multi-head cross-attention.
  • "Decoder-only models do cross-attention on the prompt." They don't. The prompt and generated tokens are one sequence, so the model uses self-attention over the combined sequence — there is no second sequence to cross to.

Going deeper

Once the core distinction clicks, a few subtler points are worth knowing.

The asymmetric shapes have real consequences. Because cross-attention lets the query and key/value sequences differ in length, the source can be cached and reused. In encoder-decoder translation, the encoder runs once over the source; its keys and values are then attended to by every decoder step. This is a meaningful efficiency win — you don't re-encode the source for each generated token.

Cross-attention is how adapters bolt new modalities onto frozen LLMs. A popular pattern keeps a pretrained, frozen language model and inserts new cross-attention layers that let text queries attend to image features. Only the new cross-attention layers (and the vision encoder's projection) are trained. The language model's self-attention stays untouched, so you graft on vision without retraining the whole model — a clean example of cross-attention as a connector.

Both kinds share attention's cost profile. Self-attention's cost grows with the square of the sequence length, which is why long contexts are expensive and why kernels like FlashAttention exist. Cross-attention's cost scales with target length times source length — still quadratic in the general case, and optimized by the same techniques. The where-does-Q-K-V-come-from distinction doesn't change the underlying compute story.

Where to go next. To see how these blocks stack into a full model, read what is a transformer. To compare whole architectures rather than individual blocks, read encoder vs decoder models. And to ground all of it in how these models are built, see how LLMs are trained.

FAQ

What is the difference between self-attention and cross-attention?

In self-attention, the queries, keys, and values all come from the same sequence, so a sequence relates its own tokens to each other. In cross-attention, the queries come from one sequence (the target) while the keys and values come from a different sequence (the source), so one sequence attends to another. The underlying scaled dot-product math is identical; only the source of Q, K, and V changes.

What is cross-attention used for?

Cross-attention connects two different sequences. The classic use is the encoder-decoder bridge in machine translation, where the generated output attends to the encoded source sentence. It is also how multimodal models fuse modalities — for example, letting text tokens attend to image features so each word can look at the relevant pixels.

Do decoder-only LLMs like GPT use cross-attention?

No. Decoder-only models such as the GPT and Claude families use only self-attention. The prompt and the response live in a single sequence, so the model only ever looks at itself (with a causal mask). There is no separate source sequence to cross-attend to.

Is masked attention the same as cross-attention?

No. Masked (causal) attention is still self-attention — one sequence looking at itself, just prevented from peeking at future tokens. Cross-attention is defined by having a separate source sequence for the keys and values, which is a different thing entirely. A block can be masked self-attention or unmasked self-attention; neither is cross-attention.

Can self-attention and cross-attention appear in the same model?

Yes. The original encoder-decoder transformer uses both: self-attention in the encoder, masked self-attention in the decoder, and cross-attention as the bridge where the decoder attends to the encoder's output. Many translation and multimodal models combine all three kinds of attention blocks.

Further reading