AI/TLDR

What Is Contextual Retrieval? Adding Context to Chunks

You'll learn how prepending document-level context to each chunk fixes the 'orphaned chunk' problem and lifts retrieval accuracy.

INTERMEDIATE11 MIN READUPDATED 2026-06-13

In plain English

A normal RAG pipeline slices your documents into small passages called chunks, turns each one into an embedding, and stores it for search. The problem: a chunk pulled out of a long report often forgets where it came from. A passage might say "The company grew revenue 3% over the previous quarter" — but which company? Which quarter? The surrounding pages had the answer; the chunk on its own does not.

Contextual Retrieval — illustration
Contextual Retrieval — maginative.com

Contextual retrieval fixes this by giving every chunk a short note explaining its situation before you embed it. You prepend one or two sentences like "This chunk is from ACME Corp's Q2 2024 earnings filing; it discusses quarterly revenue." Now the chunk carries its own context. When someone searches for "ACME Q2 revenue growth," this passage finally matches — because the words it was missing are now attached to it.

Think of a box of old photographs with no labels. A single photo of a beach tells you almost nothing — whose holiday, what year, which country? Someone who writes a caption on the back of each photo ("Maria's trip to Portugal, summer 2019") makes the whole box searchable. Contextual retrieval is that caption-writer, run automatically over every chunk in your knowledge base.

Why it matters

Retrieval is the part of RAG that most often fails quietly. The model can only answer from chunks the retriever actually finds — if the right passage never surfaces, no amount of clever prompting saves the answer. Contextual retrieval attacks the single biggest cause of missed chunks: passages that lost the words a searcher would use to find them.

The orphaned-chunk problem

When you split a document, references that pointed to earlier text get stranded. These orphaned references show up constantly:

  • Pronouns and back-references"it," "this method," "the company," "as described above" — point to something that lived in a different chunk.
  • Implicit subjects — a paragraph deep in a contract about "the tenant" never repeats whose contract it is.
  • Shared headings — a row in a financial table means nothing once the column headers and the year are three chunks away.
  • Section context — a troubleshooting step assumes you read the section title that named the product.

Each of these makes a chunk semantically poorer than it should be. Its embedding ends up pointing at a vague region of meaning, so a real user query that names the company or the quarter sails right past it. The chunk is in your database; it just never wins the similarity contest.

A builder cares because this is free recall you are leaving on the table. The fix doesn't require a fancier embedding model, a bigger vector store, or a smarter retriever. It just makes each chunk describe itself honestly before indexing — and that lifts the share of questions where the correct passage actually shows up in the top results.

How it works

Contextual retrieval is an ingestion-time technique. It changes nothing about how you query — it only changes what you store. You add one extra step between chunking and embedding: for each chunk, ask a cheap LLM to write a short summary that situates the chunk inside its full document, then prepend that summary to the chunk text.

Generating the situating summary

For each chunk you send the LLM two things: the whole document (or a large slice of it) and the one chunk you want to describe. You ask it for a 1–2 sentence summary whose only job is to place that chunk in context — never to answer a question or add facts. The prompt looks roughly like this:

the contextualizing prompttext
Here is the full document:
<document>
{{WHOLE_DOCUMENT}}
</document>

Here is one chunk taken from it:
<chunk>
{{CHUNK_TEXT}}
</chunk>

Write a short (1-2 sentence) context that situates this chunk
within the overall document, so it can be retrieved on its own.
State what document and section it comes from and what it covers.
Answer with the context only, nothing else.

The model returns something like "From ACME Corp's Q2 2024 10-Q, Financial Results section: this passage reports quarterly revenue and its change versus Q1." You then store context + "\n\n" + original_chunk as the text you embed and index. The original chunk stays intact; you only prepended a header.

Why it lifts recall

The combined text now contains the named entities, dates, and section labels a real query is likely to use. Its embedding shifts toward the right region of meaning, and — just as important — the added words also help keyword search. That is why contextual retrieval pairs naturally with hybrid search: you index the contextualized chunk in both a vector index and a keyword (BM25) index, so an exact term like "10-Q" or "ACME" can match even when the embedding is fuzzy. The two methods reinforce each other on the same enriched text.

A worked example

Take this raw chunk pulled from the middle of a 40-page employee handbook. On its own it is nearly unsearchable:

raw chunk (orphaned)text
Employees may carry over up to 5 unused days into the next year.
Anything beyond that is forfeited unless approved by a manager.

What is this about? A human reading the section heading knows it is the vacation policy, but the chunk never says the word "vacation," "leave," or "PTO." Embed it as-is and a query like "how many vacation days can I roll over?" may never retrieve it. Now here is the contextualized version that gets stored instead:

contextualized chunk (indexed)text
From the ACME 2024 Employee Handbook, "Paid Time Off" section:
this passage explains the annual carry-over rule for unused
vacation (PTO) days.

Employees may carry over up to 5 unused days into the next year.
Anything beyond that is forfeited unless approved by a manager.

The chunk now contains vacation, PTO, carry-over, and handbook — the exact words a real question would use. Here is the small ingestion loop that produces it. The retrieval code afterward is unchanged from ordinary RAG.

contextualize.pypython
from anthropic import Anthropic

client = Anthropic()

CONTEXT_PROMPT = (
    "Here is the full document:\n<document>\n{doc}\n</document>\n\n"
    "Here is a chunk from it:\n<chunk>\n{chunk}\n</chunk>\n\n"
    "Write 1-2 sentences situating this chunk in the document so it\n"
    "can be retrieved on its own. Answer with the context only."
)

def contextualize(document: str, chunk: str) -> str:
    msg = client.messages.create(
        model="claude-haiku-4-5",   # cheap, fast model is enough
        max_tokens=120,
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(doc=document, chunk=chunk),
        }],
    )
    context = msg.content[0].text.strip()
    return f"{context}\n\n{chunk}"   # prepend, keep original intact

# INGESTION: build the enriched text you embed AND keyword-index.
enriched = [contextualize(full_doc, c) for c in chunks]
# index `enriched` in both a vector store and a BM25 store, then
# query exactly as in any other RAG pipeline.

Contextual retrieval vs semantic chunking

These two are easy to confuse because both aim to make chunks retrieve better — but they work on different axes. Semantic chunking is about where to cut: it tries to draw chunk boundaries along topic shifts so each chunk is internally coherent. Contextual retrieval is about what to do with the chunk after you cut it: enrich it with situating context. They are complementary, not rivals — you can split smartly and contextualize.

AspectSemantic chunkingContextual retrieval
Question it answersWhere should the chunk boundaries go?What context is each chunk missing?
What it changesThe split points between chunksThe text stored for each chunk
Main goalKeep each chunk on one topicRestore lost document-level context
Extra costEmbedding/compute at split timeOne LLM call per chunk at ingest
Helps keyword search?IndirectlyDirectly — adds searchable terms
Can combine with the other?YesYes

If your chunks are already a sensible size, contextual retrieval usually buys you more than re-tuning split points. For the deeper tradeoffs on how to slice in the first place, see chunking strategies compared and chunk size and overlap.

The cost tradeoff and when to use it

Nothing here is free. The whole technique trades more ingestion work for better retrieval, and you should be clear-eyed about that bill before turning it on for a million documents.

  • One LLM call per chunk. A corpus of 100,000 chunks means 100,000 generation calls during ingestion. With a small model and prompt caching this is cheap per call, but it is a real, one-time pre-processing cost and adds latency to indexing.
  • Slightly larger index. Each stored chunk grows by a sentence or two, so embedding and storage costs tick up a little.
  • Re-runs on updates. Change a document and you must re-contextualize its chunks — the context summary depends on the whole document, so an edit can shift it.
  • Zero added cost at query time. This is the upside: retrieval and generation are exactly as fast and cheap as before. You paid once, up front.

Use contextual retrieval when chunks are full of back-references and shared context — long reports, contracts, financial filings, technical manuals, knowledge bases with deep sections. Skip it when each chunk is already self-contained, like a FAQ where every entry repeats its own subject, or short standalone product blurbs. As always, measure before and after: run a real query set and check whether the correct chunk shows up in your top results more often once contextualized.

Going deeper

Contextual retrieval is one improvement in a stack of them, and it composes well with the rest of the retrieval pipeline. A few directions once the basics click.

Stack it with reranking. Contextualizing improves which chunks get recalled into the candidate set; a reranker then improves which of those candidates rise to the top. Retrieve broadly over contextualized, hybrid-indexed chunks, then rerank narrowly with a cross-encoder. The two stages fix different failure modes — recall vs precision — so using both compounds the gains rather than overlapping.

Index the context, but show the model the chunk. A subtle design choice: you embed and keyword-index the contextualized text so it is findable, but at generation time you can feed the model the original chunk (or the chunk plus its short context). The situating summary's job is retrieval, not necessarily the final prompt — though including a one-line context often helps the model understand the snippet too.

Document-level vs section-level context. Sending the entire document for every chunk is wasteful on very long files and can dilute the summary. A common refinement is to situate a chunk within its immediate section or a sliding window of neighbors rather than the whole document — cheaper, and often sharper, when documents are huge.

Related ideas worth knowing. Contextual chunk headers (deterministically prepending the document title and heading path, no LLM needed) are a cheaper cousin that catches the easy cases. Sentence-window and parent-document retrieval attack the same orphaning problem differently — they retrieve on small units but return a larger surrounding span to the model. And combining context with hybrid search is the configuration most production teams converge on.

The durable lesson is the same one that governs all of RAG: your system is only as good as what the retriever surfaces. Contextual retrieval is valuable precisely because it improves that surfacing at the cheapest possible layer — the chunk's own text — without touching the model, the embeddings, or the query path.

FAQ

What is contextual retrieval in RAG?

It is an ingestion-time technique where you use an LLM to write a short summary situating each chunk inside its full document, then prepend that summary to the chunk before embedding and indexing it. This restores context the chunk lost when it was split out, so retrieval finds it more reliably. It changes only what you store, not how you query.

How is contextual retrieval different from semantic chunking?

Semantic chunking decides where to cut a document so each chunk stays on one topic. Contextual retrieval decides what to add to a chunk after cutting — a situating context line. One changes boundaries, the other enriches text. They are complementary and can be used together.

Does contextual retrieval actually improve accuracy?

It improves recall — the chance the correct chunk appears in your retrieved results — by adding the named entities, dates, and section labels that real queries use. It helps most on documents full of back-references like reports, contracts, and manuals. You should still measure it on your own query set, and it stacks well with hybrid search and reranking.

What does contextual retrieval cost?

Mainly one extra LLM call per chunk during ingestion, plus a slightly larger index. Using a small, fast model and prompt caching keeps the per-call cost low, but it is a real one-time pre-processing expense and adds indexing latency. There is no added cost at query time — retrieval and generation run exactly as before.

Do I embed the context plus the chunk, or just the context?

You prepend the short context to the original chunk and embed (and keyword-index) the combined text. The original chunk stays intact; the context is just a header that makes the whole thing more findable. At generation time you can feed the model the chunk with or without its context line.

Should I use a big or small model to generate the context?

A small, cheap model is usually enough — you are asking it to describe where a chunk sits, not to reason or answer. Keep the prompt strictly descriptive so it never invents facts, and use prompt caching on the full document to drive the per-chunk cost down across a large corpus.

Further reading