AI/TLDR

What Is Chunking in RAG? Why Document Splitting Matters

You'll understand why documents must be split before embedding and how chunk boundaries shape everything downstream in a RAG system.

BEGINNER10 MIN READUPDATED 2026-06-11

In plain English

Chunking is the act of cutting your documents into smaller pieces — "chunks" — before a RAG system stores them. Instead of indexing a 40-page PDF as one giant blob, you slice it into a few hundred bite-sized passages, each a few sentences or paragraphs long. Those passages are what the system actually searches over and feeds to the model later.

Think of a cookbook. Nobody photocopies the entire book when they want to make pancakes — they tear out the one page with the pancake recipe. Chunking is tearing the book into pages in advance, so that when someone asks "how do I make pancakes?", you can hand them exactly the right page instead of the whole 400-page volume. The recipe is the same either way; what changed is how findable and how portable it is.

There are two reasons you can't skip this step. First, the retriever finds information by turning text into embeddings — number lists that capture meaning — and a single embedding for a whole document is a smeared average of everything in it, useless for pinpointing one fact. Second, even if you found the right document, you can't stuff all 40 pages into the model's context window on every question. Small, focused chunks fix both problems at once.

Why it matters

Chunking is the quietest, highest-leverage decision in a RAG pipeline. It happens once, up front, during ingestion — and every search and every answer afterward inherits its consequences. Get the boundaries right and a cheap retriever looks brilliant. Get them wrong and no reranker, no bigger model, and no clever prompt can fully recover, because the information the model needed was never sitting together in one retrievable piece.

Here's the failure that bites everyone. Suppose a policy reads: "Refunds are processed within 30 days. This does not apply to digital goods." If your chunker splits between those two sentences, the retriever might fetch the first one for the query "refund time for an ebook" — and the model confidently answers "30 days," which is exactly wrong. The fact wasn't hallucinated; it was amputated at ingestion. That single bad boundary becomes a wrong answer no downstream component knows to question.

Two opposite-sized mistakes drive most of the pain:

  • Chunks too big. A 3,000-word chunk has one embedding that averages many topics, so it matches everything weakly and nothing strongly. It also burns context-window space and money, and buries the one relevant sentence in noise the model has to read past.
  • Chunks too small. A single-sentence chunk loses the context that gave it meaning. "It increased 12% year over year" is unsearchable and unanswerable once it's torn away from the sentence naming what increased.

Who should care? Anyone building RAG — a docs chatbot, an internal knowledge assistant, a support tool. What did chunking replace? The naive instinct to dump whole files into the model. That worked for one short document and fell apart the moment you had a thousand of them: you can't fit them all, and even if you could, accuracy drops as the window fills. Chunking is what makes a large, messy corpus retrievable one relevant slice at a time.

How it works

Chunking lives in the ingestion stage — the offline pipeline that runs once when you load your data, long before any user asks a question. The flow is short and mechanical:

The chunker reads raw text and emits a list of passages. Two knobs control almost everything: chunk size (how long each piece is) and overlap (how much text neighboring chunks share). Overlap is the cheap insurance against the refund bug above — by repeating the last sentence or two of one chunk at the start of the next, a fact that lands on a boundary still appears whole somewhere.

Beyond the knobs, the real choice is strategy — the rule that decides where the cuts go. From crudest to smartest:

The workhorse in practice is recursive character splitting: it tries to split on the biggest natural break first (double newlines = paragraphs), and only if a piece is still too long does it fall back to smaller separators (single newlines, then sentences, then words). This keeps related text together while respecting a size limit. Tools like LangChain's RecursiveCharacterTextSplitter and LlamaIndex's SentenceSplitter implement exactly this, and have format-aware cousins that split Markdown by headings or source code by function.

Chunking in code

Here's recursive splitting with overlap in plain Python, no framework, so you can see there's no magic — just a loop that respects natural boundaries and a size budget:

chunk.pypython
def chunk_text(text: str, max_tokens: int = 500, overlap: int = 60):
    # Rough token estimate: ~4 chars per English token. In production,
    # use your embedding model's real tokenizer instead.
    def n_tokens(s: str) -> int:
        return max(1, len(s) // 4)

    # Prefer to break on paragraphs, then sentences (recursive idea).
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]

    chunks, current = [], ""
    for para in paragraphs:
        candidate = (current + "\n\n" + para).strip()
        if n_tokens(candidate) <= max_tokens:
            current = candidate
        else:
            if current:
                chunks.append(current)
            # Carry the tail of the last chunk forward as overlap so a fact
            # on the boundary still appears whole in the next chunk.
            tail = current[-overlap * 4:] if current else ""
            current = (tail + "\n\n" + para).strip()
    if current:
        chunks.append(current)
    return chunks


doc = open("policy.md", encoding="utf-8").read()
for i, c in enumerate(chunk_text(doc)):
    print(f"--- chunk {i} ({len(c)} chars) ---\n{c}\n")

In a real project you wouldn't hand-roll this — you'd reach for a library. The same idea in LlamaIndex is two lines, and the chunks flow straight into embedding and your vector database:

ingest.pypython
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=500, chunk_overlap=60)

# `documents` is a list loaded from your files.
nodes = splitter.get_nodes_from_documents(documents)

print(f"Split {len(documents)} docs into {len(nodes)} chunks")
# Each node carries the chunk text + metadata (source file, position).
# Next steps: embed each node, then upsert into the vector store.

Notice what each chunk carries besides text: metadata — the source filename and position. Keep that. It powers citations ("this came from page 12 of the handbook"), lets you filter searches by document, and is your only way to debug a bad answer back to the exact passage that caused it.

Picking a chunk size

There is no universal right size — it depends on your content and your questions. But there are sane starting points, and the trade-off is always the same: bigger chunks carry more context but match less precisely; smaller chunks match precisely but lose context.

Content typeStarting chunk sizeWhy
FAQ / short Q&A~150–300 tokensEach entry is already a self-contained unit
Docs / articles / policies~400–600 tokensA paragraph or two keeps an idea whole
Dense technical / legal text~600–1000 tokensMeaning spans long, interlocking passages
Source codeby function / classSplit on structure, not arbitrary length

Set overlap to roughly 10–20% of chunk size as a default — enough to catch boundary facts, not so much that you bloat the index with duplicated text. Then stop guessing and test. Build a small set of real questions with known correct answers, and check whether the retriever actually returns the chunk containing each answer. If it misses, your boundaries are the first suspect. This is exactly what RAG evaluation is for — chunking is not a set-and-forget decision, it's a dial you tune against measured retrieval quality.

Common pitfalls

  • Splitting blind on character count. Fixed-size splitting that ignores structure guillotines sentences and tables mid-row. Use a recursive/structural splitter so cuts land on natural breaks.
  • Zero overlap on flowing prose. With no overlap, any fact that straddles a boundary is broken in two and findable in neither half. Add 10–20% overlap unless your chunks are already self-contained (like FAQ entries).
  • Embedding the garbage. Navigation menus, cookie banners, page headers and footers get chunked and embedded right alongside real content, then surface as junk search hits. Clean the text before you chunk.
  • Throwing away structure. Flattening a Markdown doc to plain text drops the heading hierarchy that told you which section a paragraph belongs to. Prepend the section title to each chunk so an orphaned paragraph still knows its topic.
  • One size for everything. A 500-token default applied uniformly to FAQs, contracts, and code will be wrong for at least two of the three. Match the strategy to the content type.

Going deeper

The frontier of chunking is the realization that the unit you retrieve on and the unit you feed to the model need not be the same. Parent-document retrieval (also called small-to-big or auto-merging) embeds small, precise child chunks for the search step, then — once a child matches — hands the model its larger parent chunk or surrounding window for full context. You get the precision of small chunks and the completeness of big ones, dodging the core trade-off instead of splitting the difference.

Semantic chunking drops fixed sizes entirely: it embeds sentences, walks through the document, and starts a new chunk wherever consecutive sentences' embeddings diverge past a threshold — cutting on meaning shifts rather than character counts. It's more compute up front and not always worth it, but it shines on unstructured prose with no reliable paragraph breaks. A more radical variant, studied in the Dense X Retrieval work, indexes propositions — atomic, self-contained factual statements rewritten by an LLM — as the retrieval unit, trading ingestion cost for sharper retrieval.

Contextual retrieval attacks the lost-context problem from another angle: before embedding each chunk, an LLM prepends a one-line summary situating it within the whole document ("This chunk is from the Q3 refund policy and concerns digital goods exceptions"). The chunk now carries its own context, so it matches queries even when the original text relied on something three paragraphs up. It costs an LLM call per chunk at ingestion — a price you pay once, offline.

Two production realities to plan for. First, re-chunking is expensive: changing your strategy means re-embedding and re-indexing the entire corpus, so treat chunk size and overlap as part of your schema and version them. Second, chunking interacts with everything downstream — your retriever, reranking, and even whether the model can cite sources cleanly all depend on chunk shape. The open problem is that the optimal boundary depends on the question, which you don't know at ingestion time. That uncertainty is exactly why agentic RAG — letting the model issue its own follow-up searches and merge results across chunks at query time — is a popular escape hatch from getting boundaries perfect in advance.

FAQ

What is the difference between chunking and embedding in RAG?

Chunking splits documents into small passages; embedding turns each passage into a vector of numbers that captures its meaning. Chunking comes first and decides what text gets embedded — they're sequential steps in ingestion, not alternatives.

What is a good chunk size for RAG?

A common default for documents and articles is roughly 400–600 tokens with 10–20% overlap, but it depends on your content: short FAQs want smaller chunks, dense legal or technical text wants larger ones, and code should be split by function. Treat it as a dial to tune against measured retrieval quality, not a fixed rule.

What is chunk overlap and why does it matter?

Overlap means neighboring chunks share some text — the end of one chunk is repeated at the start of the next. It matters because a fact that lands exactly on a chunk boundary would otherwise be split in half and findable in neither chunk; overlap ensures it appears whole somewhere.

Do I still need to chunk documents if the model has a huge context window?

Yes. Even with a million-token window you usually have far more data than fits, every token costs money and latency, and retrieval accuracy degrades as the window fills with noise. Chunking is what lets you find and feed the few relevant passages instead of everything.

Why not just embed the whole document as one chunk?

A single embedding for a long document is an average of every topic in it, so it matches many queries weakly and none strongly — you can't pinpoint one fact. Whole documents also don't fit cleanly into the context window. Small, focused chunks give precise matches and portable context.

Further reading