In plain English
Before a RAG system can search your documents, it has to cut them into smaller pieces called chunks. The simplest way is to count characters: take 500 characters, cut, take the next 500, cut again. Fast and easy — but blind. It will happily slice a sentence in half, split a definition from the example that explains it, or merge the end of one idea with the start of an unrelated one.

Semantic chunking cuts at meaning boundaries instead of character counts. It reads the text sentence by sentence, notices when the topic shifts, and makes the cut there — at the seam between two ideas — rather than at an arbitrary position. Each chunk ends up being about one coherent thing.
Think of a long article with no section headings. A fixed-size splitter is like tearing the printed pages every 12 lines, no matter what's on them — paragraphs get ripped down the middle. Semantic chunking is like a careful editor reading the whole thing, then drawing a line only where one section genuinely ends and the next begins. The cuts land where you'd naturally pause, so every piece reads as a complete thought.
Why it matters
In RAG, retrieval is only as good as your chunks. The retriever can only return whole chunks, so a chunk is the smallest unit of knowledge your system can hand to the model. If that unit is malformed, every answer built on it suffers.
- Better retrieval precision. A chunk that covers exactly one topic embeds cleanly — its vector points in one clear direction. A chunk that straddles two topics embeds as a muddy average of both, so it matches neither query well and gets buried in the ranking.
- Fewer broken thoughts. Fixed cuts routinely separate a claim from its evidence, a term from its definition, or a question from its answer. Retrieve such a half-chunk and the model is reading half a thought — a common, quiet source of wrong answers.
- Less noise in the prompt. When each chunk is one tight idea, the top results you paste into the prompt carry signal, not filler. The model wastes fewer tokens and is less likely to be distracted by an irrelevant half-topic that rode along in a sloppy chunk.
Who should reach for this? Builders whose documents have uneven structure — long-form articles, transcripts, research papers, mixed reports — where topics start and stop at irregular intervals and a fixed window keeps cutting across them. If your retrieval feels slightly off and you suspect the chunks are to blame, semantic chunking is one of the first knobs worth trying.
It's important to see where this sits among the other chunking strategies. Fixed-size chunking is the baseline; tuning chunk size and overlap makes that baseline better without changing the method. Semantic chunking is a different method altogether — it lets the content, not a number you picked, decide where the cuts go.
How it works
The core idea is simple: walk through the document in order, and cut wherever the meaning changes sharply. You measure "change in meaning" by comparing the embedding of one piece of text to the next. When two neighbours are similar, they belong together. When similarity drops off a cliff, you've found a topic boundary — cut there.
Step by step
- Split into sentences. First break the document into sentences (or short windows of a few sentences). These are the atoms you'll regroup — you never cut inside one.
- Embed each sentence. Turn every sentence into a vector with an embedding model. Sentences about the same topic land near each other in vector space; sentences about different topics land far apart.
- Measure the gap between neighbours. For each adjacent pair, compute how similar their vectors are (usually cosine similarity). A high score means "same topic, keep going"; a low score means "the subject just shifted."
- Decide where to cut. Pick a threshold. Wherever the similarity between two neighbours falls below it — a clear dip in the curve — place a boundary. A common trick is to set the threshold from the data itself, for example cutting at the points in the lowest 5–10% of all similarity scores, so the document's own structure sets the bar.
- Group between boundaries. Everything between two cuts becomes one chunk. Each chunk is now a run of consecutive sentences that hang together by meaning, with a size set by the content rather than a fixed character count.
Here's the heart of it in a few lines. Notice the chunk sizes are not fixed — a long stretch on one topic stays whole, and a quick topic change makes a short chunk.
import numpy as np
def semantic_chunks(sentences, embed, percentile=5):
# 1) Embed every sentence (vectors are L2-normalized).
vecs = embed(sentences) # shape (n, dim)
# 2) Cosine similarity between each neighbouring pair.
sims = [float(vecs[i] @ vecs[i + 1])
for i in range(len(sentences) - 1)]
# 3) Cut where similarity drops into the lowest 'percentile' band.
cutoff = np.percentile(sims, percentile)
breaks = [i + 1 for i, s in enumerate(sims) if s < cutoff]
# 4) Group sentences between successive break points.
chunks, start = [], 0
for b in breaks + [len(sentences)]:
chunks.append(" ".join(sentences[start:b]))
start = b
return chunksA worked example
Take this short passage, which clearly contains two topics jammed together:
Our refund policy allows returns within 30 days of purchase.
Refunds are issued to the original payment method.
A receipt or order number is required for any return.
Our warehouse ships orders Monday through Friday.
Standard delivery takes three to five business days.
Expedited shipping is available at checkout for an extra fee.A fixed-size splitter set to roughly 150 characters would cut by length and might produce something like "...required for any return. Our warehouse ships orders" — gluing the end of the refunds topic onto the start of the shipping topic. That chunk now matches a refund question and a shipping question only weakly.
Semantic chunking embeds each sentence and looks at the neighbour-to-neighbour similarity. Within the refund sentences, similarity stays high. Between "...required for any return" and "Our warehouse ships orders...", similarity drops sharply — that's the seam. The cut lands exactly there:
| Chunk | Content | Topic |
|---|---|---|
| 1 | Refund policy + original payment method + receipt required | Refunds |
| 2 | Ships Mon–Fri + 3–5 day delivery + expedited option | Shipping |
Now a user asking "how do I get my money back?" retrieves a clean, all-refunds chunk, and "when will my order arrive?" retrieves a clean, all-shipping chunk. Same six sentences, but the boundary landed where the meaning actually changed — and retrieval gets sharper for both questions.
Semantic vs fixed-size chunking
Semantic chunking is more accurate at boundaries but costs more to run. Fixed-size chunking is cheap, predictable, and — for well-structured text — often good enough. The honest answer is that neither wins everywhere.
- Cuts at real topic shifts
- Variable chunk sizes
- Cleaner, single-topic chunks
- Needs an embedding call per sentence
- Best for messy, unstructured prose
- Cuts at a character/token count
- Predictable chunk sizes
- May split a thought in two
- Near-zero ingestion cost
- Best for uniform or short text
| Factor | Semantic | Fixed-size |
|---|---|---|
| Ingestion cost | Higher — embeds the text twice (once to chunk, once to index) | Lowest — pure string slicing |
| Speed | Slower per document | Instant |
| Boundary quality | Follows the content | Ignores the content |
| Tuning | Threshold / percentile | Size and overlap |
| Predictability | Chunk sizes vary a lot | Every chunk a known size |
When to use it (and when not to)
Semantic chunking earns its extra cost on some documents and wastes it on others. A quick rule: the more irregular the topic structure, the more it helps.
Reach for semantic chunking when
- Your sources are long, free-flowing prose with no headings — articles, transcripts, interviews, research papers — where topics start and stop at unpredictable points.
- Fixed-size chunking is visibly hurting retrieval: you can see chunks that mash two subjects together, and answers suffer for it.
- Each document is large enough that the one-time embedding cost is small next to the quality you gain.
Stick with fixed-size when
- The text already has clean structure you can split on — Markdown headings, FAQ entries, code functions, table rows. Splitting on that explicit structure beats inferring boundaries, and it's free. See chunking code, tables, and Markdown.
- You're ingesting at huge scale and every embedding call counts — semantic chunking roughly doubles your embedding bill at ingestion time.
- Your documents are short or each one is already about a single topic, so there's no boundary to find.
Going deeper
The neighbour-similarity method above is the classic recipe, but it has real edge cases and several refinements worth knowing once the basics click.
The threshold is brittle. Set it too sensitive and you over-cut, shredding the document into tiny one-sentence fragments. Set it too loose and you under-cut, sliding back toward giant fixed-size-like blobs. A fixed cosine threshold rarely transfers between document types, which is why data-relative thresholds (cut at the lowest Nth percentile of gaps within each document) are more robust than a hard-coded number. Either way, this is a parameter you must evaluate, not guess.
Single sentences embed poorly. A three-word sentence carries little meaning, so its vector is noisy and the similarity curve gets jumpy. The standard fix is a sliding window: embed each sentence together with its one or two neighbours, so every point in the curve reflects a small, stable span of text rather than one fragile line.
It only sees adjacent text. Comparing neighbour to neighbour catches local topic shifts but misses global structure — it can't tell that paragraph 2 and paragraph 9 are the same theme returning. For documents with that kind of long-range structure, clustering approaches or structure-aware (hierarchical) chunking go further, at more complexity.
Always measure, don't assume. Semantic chunking is intuitively appealing, but on benchmarks it doesn't always beat a well-tuned fixed-size splitter — and it's never free. The only way to know if it helps your corpus is to evaluate retrieval quality both ways. The durable lesson is the one that runs through all of chunking: there is no universally best strategy, only the one that fits your documents and your budget, proven by measurement. From here, compare the full menu of options in chunking strategies compared.
FAQ
What is semantic chunking in RAG?
Semantic chunking is a way of splitting documents for RAG that cuts at topic boundaries instead of at a fixed character or token count. It embeds the text, measures how similar each piece is to the next, and places a cut wherever similarity drops sharply — so each chunk covers roughly one coherent idea.
How is semantic chunking different from fixed-size chunking?
Fixed-size chunking cuts every N characters or tokens regardless of content, which is fast but can split a thought in half. Semantic chunking lets the meaning decide where cuts go, producing variable-size, single-topic chunks at the cost of an extra embedding pass during ingestion.
Does semantic chunking actually improve retrieval?
It often does for long, unstructured prose where topics start and stop unpredictably, because cleaner single-topic chunks embed and match more precisely. But it doesn't always beat a well-tuned fixed-size splitter, and it costs more — so you should measure retrieval quality both ways on your own data.
How does semantic chunking decide where to cut?
It embeds each sentence (or small window of sentences), computes the similarity between neighbouring pairs, and cuts wherever that similarity falls below a threshold. A robust approach sets the threshold from the document itself — for example, cutting at the points in the lowest 5–10% of similarity scores.
Is semantic chunking worth the extra cost?
It roughly doubles your embedding cost at ingestion time, since the text is embedded once to find boundaries and again to index. It's worth it when documents are large and unstructured and retrieval is suffering from messy chunks. For short, structured, or already single-topic text, fixed-size chunking is usually good enough.