AI/TLDR

What Is the 'Lost in the Middle' Problem in Long-Context Models?

Learn why models skim the middle of long prompts and where to position critical information so it actually gets used.

INTERMEDIATE9 MIN READUPDATED 2026-06-12

In plain English

"Lost in the middle" is the tendency of a language model to recall information near the start and end of a long prompt much better than information buried in the middle. Two facts that are equally important to your question get treated very unequally: the one near an edge gets used, the one in the dead center often gets skimmed past — even when the model technically "read" both.

Here is the everyday version. You hand a coworker a 40-page report and ask them to find one detail. If that detail is on page 1 or page 40, they nail it. If it is on page 20, sandwiched between dozens of similar-looking pages, they get tired, skim, and miss it. The information was there. Their attention just thinned out in the middle. Models do the same thing, and the bigger the document, the wider that soft, easy-to-miss middle gets.

This is not the same as running out of room. Your prompt can fit comfortably inside the context window — well under the limit — and the model can still quietly ignore the middle of it. Fitting is necessary; it is not sufficient.

Why it matters

Modern models advertise enormous context windows — as of mid-2026, frontier models like Claude Opus 4.x, Google's Gemini 3.x Pro, GPT-5.x, and DeepSeek's latest all accept roughly a million tokens or more, with a couple stretching toward two million. The marketing implies you can dump everything in and the model will use all of it equally. Lost in the middle is the reason that assumption quietly breaks.

  • RAG pipelines. If you stuff 20 retrieved chunks into the prompt and the answer lives in chunk 11, the model may never surface it — no matter how good your retriever was.
  • Long documents. Asking a question about the middle of a contract, a transcript, or a codebase is exactly where recall is weakest.
  • Multi-turn chat. Earlier instructions sink into the middle of a growing conversation and get forgotten, so the model 'drifts' from what you told it 30 messages ago.
  • Agents. A long tool-call history pushes the original goal into the murky middle, and the agent loses the plot.

The practical sting: a model can score 99% on a simple recall test and still fail your real workload, because real workloads put critical facts in the worst possible spot. Anyone building on top of LLMs — RAG, agents, coding assistants — has to design around this, not assume it away. It is closely tied to what happens when you exceed the context window, but it bites before you ever hit the limit.

How it works

The mechanism starts with attention — the operation that lets each token look at every other token and decide what to focus on. In theory attention is uniform: any position can attend to any other. In practice, trained models develop strong positional biases. Two of them push information out of the middle.

Two biases that hollow out the middle

  • Primacy bias (the beginning). The very first tokens — your system prompt, the top of a document — get disproportionate attention. Many models develop strong 'attention sinks' on early tokens during training, so the start stays vivid.
  • Recency bias (the end). Next-token prediction is trained to lean on what came just before the prediction point. Tokens near the end are freshest and most influential — which is why putting your actual question last works so well.

The middle gets neither boost. It is too far from the start to ride primacy and too far from the end to ride recency, so its attention weight thins out. Plot accuracy against position and you get the now-famous U.

How do researchers measure this? The standard probe is the needle in a haystack test: hide one out-of-place fact (the "needle") at a known position inside a long block of filler (the "haystack"), then ask the model to retrieve it. Sweep the needle across every position and every context length, and you get a heat map of where recall holds and where it collapses.

Effective context vs. the number on the box

The single most useful idea here is the gap between a model's advertised context window and its effective context window. Advertised is how many tokens you are allowed to send. Effective is how many you can send before recall and reasoning quietly degrade. They are not the same number, and the difference is large.

TermWhat it meansWho sets it
Advertised windowThe hard token limit the API acceptsThe vendor's spec sheet
Effective windowLength where multi-fact recall stays reliableBenchmarks like RULER, in practice
The gapTokens that fit but get skimmedLost in the middle

As of mid-2026, independent long-context benchmarks keep finding the same pattern: most frontier models hold strong recall through only a fraction of their advertised window on multi-needle and multi-hop tasks — often well under the full million-plus tokens they accept. A handful of the very strongest reasoning models hold up notably better across the full window, but no one should assume a model uses 100% of what it accepts. The lesson isn't 'pick the biggest number' — it's 'measure the effective window for your task.'

Where to put the important stuff (with code)

Once you know the curve is U-shaped, the fixes are mostly about placement and length. You want critical content near an edge and the total payload short enough that there is barely a middle to lose.

  1. Put instructions at the top, the question at the bottom. Sandwich the bulk of your material between a clear system/instruction block and the actual user query.
  2. Reorder retrieved chunks to a 'V'. Don't dump 20 chunks in relevance order. Rerank, keep the top few, and place the very best at the start and end, the weakest in the middle.
  3. Retrieve fewer, better chunks. Every extra low-value chunk widens the middle. Aggressive reranking to 3–5 chunks usually beats stuffing 20.
  4. Compress. Summarize or trim filler so the signal-to-noise ratio rises and the prompt shrinks.
  5. Repeat the key instruction at the end. Cheap insurance: restate the most important constraint right before the question.

Here is the reorder-to-a-V trick. Given chunks already sorted best-to-worst by a reranker, we interleave them so the strongest land on the outer edges and the weakest sit in the middle:

reorder.pypython
def reorder_for_edges(chunks_best_to_worst):
    """Place strongest chunks at the start/end, weakest in the middle.
    Input is sorted most-relevant first (e.g. from a reranker)."""
    front, back = [], []
    for i, chunk in enumerate(chunks_best_to_worst):
        # alternate: rank 0 -> front, rank 1 -> back, rank 2 -> front...
        (front if i % 2 == 0 else back).append(chunk)
    # back was filled in increasing-rank order; flip so the
    # 2nd-best ends up at the very end, weakest stays central.
    return front + back[::-1]


ranked = ["best", "2nd", "3rd", "4th", "5th"]  # from your reranker
print(reorder_for_edges(ranked))
# ['best', '3rd', '5th', '4th', '2nd']
# strongest at both ends, weakest ('5th'/'4th') buried in the middle

Then assemble the final prompt with the question last, so it rides the recency boost:

prompt.pypython
def build_prompt(instructions, ordered_chunks, question):
    context = "\n\n".join(f"[doc {i}] {c}" for i, c in enumerate(ordered_chunks))
    return (
        f"{instructions}\n\n"      # top: primacy
        f"Context:\n{context}\n\n"  # middle: where weak chunks live
        f"Question: {question}"     # bottom: recency
    )

Going deeper

Why does the U-shape persist even in models designed for long context? It traces back to how positional information is encoded. Most transformers use rotary position embeddings (RoPE), which encode where a token sits by rotating its query/key vectors. When models trained on, say, 8K tokens are stretched to 128K or 1M (via tricks like position interpolation or YaRN), the rotations get squeezed into ranges the model barely saw in training. Long-range positions become blurry, and that blur lands hardest on the middle — far from both the strongly-anchored start and the recency-favored end.

There is also the attention sink phenomenon: models learn to dump a large, near-constant share of attention onto the first few tokens (often the start-of-sequence token) as a kind of no-op pressure valve. That entrenches primacy. Combined with the autoregressive training objective that rewards leaning on recent tokens, you get strong anchors at both ends and a starved middle by construction — not by accident.

Where the field is heading

The deeper takeaway for builders: treat 'fits in the window' and 'will actually be used' as two different questions. Run your own mini needle-in-a-haystack on representative data — plant a fact you know the answer to at several positions and lengths, and watch where recall falls off. That single experiment tells you the effective budget for your use case, which is the number you should actually design around. It also connects to why models hallucinate: when the real answer is lost in the middle, the model doesn't say 'I missed it' — it confidently fills the gap with a plausible guess.

FAQ

Do LLMs actually read the whole prompt?

Technically yes — every token is processed by attention. But 'processed' isn't 'used equally.' Models attend much more strongly to the start and end of a long prompt, so facts in the middle can be effectively skimmed and ignored even though the model 'saw' them. Reading and reliably using are two different things.

Where should I put the most important information in a prompt?

Near an edge. Put your core instructions at the very top and the actual question at the very bottom, with supporting material in between. If you have several key facts, place the strongest at both the start and the end of the context block and let weaker material sit in the middle, where recall is naturally weakest.

What is the needle in a haystack test?

It's a benchmark for long-context recall. You hide one specific fact (the 'needle') at a known position inside a long block of unrelated filler (the 'haystack'), then ask the model to retrieve it. By sweeping the needle across positions and context lengths, you can map exactly where a model's recall holds up and where it collapses.

Does a bigger context window fix the lost-in-the-middle problem?

No — it often makes it worse in practice. A larger window just gives you a wider middle for information to get lost in. The advertised window (tokens you can send) is usually much larger than the effective window (tokens where recall stays reliable). As of mid-2026, most frontier models hold strong multi-fact recall through only a fraction of their advertised length.

How do I fix lost in the middle in a RAG pipeline?

Retrieve fewer, higher-quality chunks; use a reranker to keep only the top 3–5; reorder them so the most relevant land at the start and end of the context; and compress filler. Don't dump 20 chunks in raw relevance order — that buries the best evidence in the dead center where the model is least likely to use it.

Further reading