AI/TLDR

RAG Chunking Strategies Compared

You'll know the real trade-offs between fixed-size, recursive, semantic, and document-level chunking — and how to pick the right one for your corpus and query patterns.

INTERMEDIATE13 MIN READUPDATED 2026-06-12

In plain English

Every RAG system chunks its documents before embedding them, but how you draw those cut-lines changes retrieval quality more than almost any other single decision. There are four major strategies in common use today, each with a different answer to the question "where should I split?":

  • Fixed-size — cut every N tokens, regardless of content.
  • Recursive / structural — prefer natural language boundaries (paragraphs → sentences → words), falling back to size only when needed.
  • Semantic — embed sentences and cut wherever the meaning measurably shifts.
  • Document-level — treat whole documents as retrieval units, usually with a summary layer on top.

Think of it like portioning a report for a team of researchers. A fixed-size strategy is like a paper guillotine: fast, indiscriminate, cuts mid-sentence. A recursive strategy is like tearing along the perforations — it respects paragraphs and only trims when a section is too long. A semantic strategy reads a draft and marks topic changes with a highlighter before cutting. Document-level is handing each researcher the whole chapter with a one-paragraph summary on the cover.

Why the strategy choice matters

The chunking strategy is the single parameter the retriever can never compensate for. A bad embedding model can be swapped; a bad vector database can be migrated; but bad chunk boundaries are baked in at ingestion time and propagate to every query forever — unless you re-index the entire corpus.

The core tension every strategy navigates is precision vs context:

  • Smaller chunks match queries more precisely because their embedding represents one focused idea, but they strip context — a number without a unit, a claim without its qualifier.
  • Larger chunks carry enough context for the model to reason correctly, but their embedding blurs across multiple topics, so they score weakly against specific queries and may never surface.

The right strategy shifts depending on your content and query patterns. A FAQ corpus of 100-word entries needs a completely different approach than a library of 50-page technical whitepapers or a codebase of 300-line files. Getting this wrong at the outset means poor retrieval that no prompt engineering, reranker, or LLM upgrade will fix — because the evidence the model needed was never sitting together in one retrievable unit.

How each strategy works

The four strategies differ in what information they use to decide where to cut. The diagram below shows the decision axis each one operates on:

Fixed-size chunking

Fixed-size chunking splits text every N tokens (or characters), optionally with an overlap window. It needs only a tokenizer — no language model, no structural parsing. That simplicity makes it the fastest option at ingestion, with no unpredictable behavior at scale.

The problem is that natural language does not respect token boundaries. A 512-token window cuts mid-sentence roughly as often as it lands on a period. The resulting chunks are semantically noisy: the embedding must represent a partial idea at the start and a partial idea at the end, diluting the signal of what the chunk is actually about.

Fixed-size with LangChainpython
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=512,        # in characters (use TokenTextSplitter for tokens)
    chunk_overlap=64,      # ~12% overlap
    separator="",          # no structural preference — pure size
)
chunks = splitter.split_text(document_text)
print(f"{len(chunks)} chunks, avg size: {sum(len(c) for c in chunks)//len(chunks)} chars")

Recursive / structural chunking

Recursive splitting uses a priority-ordered list of separators: first try to split on \n\n (paragraphs); if the result is still too long, try \n (lines); then . (sentences); then (words). Each piece is recursively split until it fits the size budget. This keeps logically related text together at the natural level of granularity without ever needing to understand what the text means.

LangChain's RecursiveCharacterTextSplitter and LlamaIndex's SentenceSplitter both implement this pattern. Format-aware variants extend it with domain-specific separators: the Markdown splitter adds ## headings to the list; the code splitter uses function/class definitions as the primary boundary.

Recursive splitting with LangChainpython
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,        # characters (≈ 400–500 tokens for English prose)
    chunk_overlap=150,      # ~10% overlap
    separators=["\n\n", "\n", ". ", " ", ""],  # default priority list
    length_function=len,    # swap for a real tokenizer in production
)

chunks = splitter.create_documents(
    texts=[document_text],
    metadatas=[{"source": "policy.pdf", "page": 1}],  # carry provenance
)
print(f"Split into {len(chunks)} chunks")
print(chunks[0].page_content[:200])  # inspect first chunk

Semantic chunking

Semantic chunking does not use size or structure as the primary signal — it uses meaning. The algorithm embeds every sentence individually, then walks through the document comparing each consecutive pair's cosine similarity. When the similarity drops past a threshold (a "breakpoint"), a new chunk begins. The result is variable-length chunks that each contain one coherent topic.

The breakpoint threshold is the main tuning knob. A percentile-based approach is more robust than a fixed value: compute all pairwise similarities, then cut wherever a gap exceeds the 75th or 95th percentile. This adapts to documents with many vs few topic changes.

Semantic chunking with LangChainpython
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# breakpoint_threshold_type: "percentile" | "standard_deviation" | "interquartile"
splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90,  # cut at the top 10% largest gaps
)

chunks = splitter.create_documents([document_text])
print(f"{len(chunks)} semantic chunks")
for c in chunks[:3]:
    # Variable chunk size is expected — that's the point
    print(f"  {len(c.page_content.split())} words: {c.page_content[:80]}...")

The trade-off: semantic chunking costs one embedding call per sentence at ingestion time — roughly 10-30x more compute than recursive splitting. A 500-page corpus might produce 50,000 sentences, which means 50,000 embedding API calls or a non-trivial GPU job. That cost is paid once at build time, but it is real. Some recent benchmarks have found that well-tuned recursive splitting at 512 tokens matches or beats semantic chunking on structured content, while semantic chunking outperforms on dense, topic-shifting prose.

Document-level chunking

Document-level chunking treats each document (or major section like a chapter) as a single retrieval unit. It is the right choice when each document is short, self-contained, and semantically distinct — think a product FAQ, a news brief, or a standalone API reference page. It is wrong for long multi-topic documents where a single "document" chunk blurs together many ideas.

In practice, document-level chunking is rarely used alone. It is most powerful as the parent layer in a hierarchical retrieval scheme: the index contains fine-grained child chunks for search precision, but when a child chunk is matched, the system fetches and sends the full parent document (or a larger parent window) to the model. This is called parent-document retrieval or small-to-big retrieval, and it resolves the precision-vs-context trade-off instead of compromising it.

Overlap, chunk size, and retrieval quality

Regardless of which strategy you choose, two knobs have the largest effect on end-to-end retrieval quality: chunk size and overlap. Getting these right matters more than which specific strategy you use.

How chunk size affects retrieval

Chunk size operates a precision-recall trade-off. Smaller chunks produce embeddings that tightly represent one idea, so a query that asks about that idea scores high. Larger chunks represent many ideas at once, so they tend to rank well across a broader range of queries but never rank at the top for any specific one. The risk profile is asymmetric:

Chunk sizeRetrieval precisionContext qualityCommon failure mode
~100–200 tokensHighLow — context strippedAnswer is correct but the model lacks surrounding context to reason with it
~300–512 tokensGood balanceGoodOccasional topic bleed at boundaries
~800–1500 tokensLowerHighChunk matches weakly; relevant passage buried in noise the model reads past
Full documentLowestMaximumEmbedding averages all topics; relevant query rarely surfaces the document

The practical default for most English-language article-style documents is 400–600 tokens with recursive splitting. For dense technical text (legal contracts, medical literature, specifications) where ideas span multiple paragraphs, 600–1000 tokens is more appropriate. For very short units like FAQ entries or code docstrings, 100–300 tokens avoids padding chunks with unrelated neighbours.

How overlap affects retrieval

Overlap copies the tail of one chunk onto the head of the next. It is cheap insurance against the most common chunking failure: a fact that lands on a boundary gets split in two and appears complete in neither half. With overlap, the fact appears whole in both chunks — whichever one the retriever ranks higher, the model gets the full text.

The right overlap size is relative to chunk size, not absolute. 10–15% overlap is a reliable default: on a 512-token chunk, that is 50–75 tokens — enough to capture a multi-sentence spanning thought without bloating the index. Going above 20% has diminishing returns and increases index size and retrieval cost proportionally.

Overlap as % of chunk sizeWhen to use itCaveat
0%Self-contained units (FAQ entries, product listings)Any fact on a boundary is lost in two halves
5–10%Structured docs with clear paragraph breaksMinimal safety net; fine if paragraphs are reliable boundaries
10–20%Most prose documents — the general defaultGood balance of boundary safety and index efficiency
>20%Dense technical text where multi-sentence spans are commonIndex grows; de-duplicate before reranking if overlap causes near-duplicate hits

The chunk size sweet spot is query-dependent

One underappreciated insight from retrieval research: the optimal chunk size for precision retrieval is often smaller than the chunk size needed for complete answers. A query like "what is the refund window for digital goods?" is answered by one sentence — a 100-token chunk would retrieve it perfectly. But the full answer the model needs to give a safe response might span three paragraphs.

This is why parent-document retrieval has become the dominant production pattern: index small child chunks (128–256 tokens) for search precision, but when a child chunk scores in the top-k, return its parent context (512–1500 tokens) to the model. LlamaIndex implements this as ParentDocumentRetriever; LangChain has ParentDocumentRetriever under langchain.retrievers. You get the precision of small chunks and the context completeness of large ones, at the cost of a slightly more complex ingestion pipeline.

Decision guide: which strategy fits your use case

The right strategy is determined by two things: the structure of your source documents and the shape of your queries. Here is a decision guide grounded in common production patterns:

ScenarioRecommended strategyKey settings
Prototyping / quick PoCFixed-size512 tokens, 10% overlap
Documentation site / knowledge baseRecursive (MarkdownSplitter)400–600 tokens, 15% overlap
News articles / blog posts / unstructured proseSemantic chunkingPercentile breakpoint 90th
Legal / financial contractsRecursive, larger chunks800–1000 tokens, 15% overlap
Source codeCode-aware splitter (by function)Per-function, no arbitrary size
Short FAQ or product catalogueDocument-level or minimal fixedEach entry as-is
Mixed corpus needing precision + contextHierarchical parent-documentChild 200t, parent 1000t

One rule holds across all scenarios: measure before you commit. Build a set of 20–50 representative questions with known correct answers from your corpus, run the retriever, and check whether the right chunk appears in the top-3 results. A RAG evaluation harness like RAGAS makes this straightforward and gives you a retrieval recall number you can track across strategy changes.

Going deeper

The current frontier moves past the four strategies above toward approaches that blur the line between chunking and retrieval itself.

Proposition-based retrieval (Dense X Retrieval, arXiv 2312.06648) rewrites documents into atomic factual statements — "The refund window for digital goods is seven days" — using an LLM at ingestion time, then indexes those propositions as the retrieval unit. Each retrieval unit is a single verifiable claim with no context bleed. This dramatically improves precision for fact-checking and question-answering but requires an LLM call per proposition, making ingestion cost proportional to corpus size.

Contextual retrieval (Anthropic, 2024) prepends a one-sentence situating summary to each chunk before embedding it: "This chunk describes the digital-goods exception to the standard refund policy in section 4.2." The chunk now carries its own context, so it matches queries that depend on surrounding sentences the chunk itself does not contain. Anthropic reported a 35–49% reduction in retrieval failures in their internal evaluations. The cost is one LLM call per chunk at ingestion — paid once, offline.

Late chunking flips the order: embed the full document first (preserving cross-sentence attention), then pool the token-level embeddings into chunk-sized windows. Because the embeddings were computed with full document context, each chunk embedding already "knows" what came before it. This eliminates the context-stripping problem without requiring LLM preprocessing, but requires an embedding model that exposes token-level output — not all do.

For teams building at scale, the most important shift is treating chunking as a measurable, versioned pipeline stage rather than a one-time decision. Use a retrieval evaluation harness to track retrieval recall across strategy changes. Version your chunk schema alongside your embedding model, because changing one without the other invalidates the index. And consider agentic RAG for queries that span multiple chunks — letting the model issue follow-up searches to fill gaps is often a better investment than trying to perfect chunk boundaries for every possible query in advance.

FAQ

Which RAG chunking strategy is best for most use cases?

Recursive character splitting at 400–600 tokens with 10–15% overlap is the practical default for most document-based RAG applications. It respects natural language boundaries, adds near-zero compute overhead, and outperforms fixed-size splitting in retrieval quality without the ingestion cost of semantic chunking. Start here, measure retrieval recall, then switch strategies if results are poor.

When does semantic chunking actually outperform recursive chunking?

Semantic chunking outperforms on unstructured, topic-shifting prose that lacks reliable paragraph or sentence structure — think journalistic long-form, transcripts, or poorly formatted documents where \n\n boundaries are missing or inconsistent. On well-structured Markdown, documentation, or technical writing with clear paragraphs, recent benchmarks show recursive splitting at 512 tokens matches or beats semantic chunking, at a fraction of the ingestion cost.

What is parent-document retrieval and when should I use it?

Parent-document retrieval indexes small child chunks (e.g. 128–256 tokens) for search precision, but returns a larger parent chunk (e.g. 800–1500 tokens) to the LLM when a child matches. Use it whenever your queries are specific (favouring small chunks) but the correct answer requires surrounding context (requiring large chunks). It is the dominant pattern for production RAG systems because it sidesteps the precision-vs-context trade-off instead of compromising it.

How much chunk overlap should I use?

Set overlap to 10–15% of your chunk size as a default — on a 512-token chunk that is 50–75 tokens. Zero overlap is fine for self-contained units like FAQ entries. Going above 20% has diminishing returns, bloats the index, and can cause near-duplicate hits in the top-k results. Reduce overlap if your reranker is surfacing nearly identical chunks.

Does chunk size affect the quality of LLM answers, or just retrieval?

Both. Chunk size affects retrieval (precision of the embedding match) and answer quality separately. Small chunks retrieve precisely but may strip the context the model needs to reason correctly. Large chunks provide full context but may retrieve weakly or bury the relevant sentence in noise. The optimal size depends on both query specificity and answer complexity — which is why hierarchical parent-document retrieval is so popular.

Should I use different chunking strategies for different document types in the same RAG system?

Yes, and it is a good practice for heterogeneous corpora. Use a document-type classifier or filename heuristic to route: Markdown and HTML through a recursive/format-aware splitter, dense prose through semantic chunking, code files through a code-aware splitter. LlamaIndex and LangChain both support per-document splitter selection during ingestion. Recent research (Adaptive Chunking, 2026) confirms that strategy routing outperforms any single fixed strategy on mixed corpora.

Further reading