AI/TLDR

How the RAG Pipeline Works, Step by Step

After reading, you will understand exactly what happens at every stage of a RAG pipeline and why each decision — chunk size, embedding model, index type, top-k — shapes answer quality.

BEGINNER13 MIN READUPDATED 2026-06-12

In plain English

A RAG pipeline is an assembly line for knowledge. Raw documents go in one end; grounded, cited answers come out the other. Between those two ends sit seven distinct stages, each with its own job. Skip one, do one badly, and the whole line produces defective output — usually in the form of a confident answer that contradicts the document you fed it ten minutes ago.

If you have already read What Is RAG?, you know the high-level loop: retrieve relevant passages, paste them into the prompt, generate a grounded reply. This article goes one level deeper — it opens each stage of that loop and explains what is actually happening, what choices you face, and what goes wrong if you get it wrong.

Why each stage matters

Most RAG failures are not model failures — they are pipeline failures. The language model is remarkably good at writing a fluent answer from whatever text you hand it. The problem is that you handed it the wrong text, or no text at all, because some upstream stage broke down.

  • Ingestion failure — a file format the parser cannot read, or a scanned PDF with no text layer, means entire documents go missing from the knowledge base silently.
  • Chunking failure — a chunk that splits a sentence, a table, or a code block in half loses the meaning that made it retrievable in the first place.
  • Embedding failure — using an embedding model trained on general web text to index legal contracts or medical notes produces vectors that do not cluster by the domain's actual meaning.
  • Indexing failure — an approximate nearest-neighbor index tuned for 100,000 vectors may return garbage when you scale to 10 million without retuning its parameters.
  • Retrieval failure — fetching the top-3 chunks when the answer requires synthesising five passages means the generation stage never sees the full picture.
  • Generation failure — prompting the model without clear boundaries between context and question lets it blend retrieved facts with hallucinated ones.

Understanding each stage lets you diagnose where a system is breaking rather than blindly tweaking the prompt or swapping models — the two things beginners try first.

The full pipeline, stage by stage

Stage 1 — Document ingestion

Ingestion is the work of reading your source material and turning it into plain text the rest of the pipeline can process. Sources vary enormously: PDFs, Word documents, web pages, Notion pages, Confluence wikis, database rows, Slack exports, code repositories. Each format needs its own parser.

The main pitfalls at this stage are silent data loss. A PDF that is really a scanned image has no text layer — a naive reader returns an empty string and you never notice. HTML scraped straight from a web page includes navigation menus, cookie banners, and footer links that dilute your chunks with noise. Tables and structured layouts often fall apart when converted to a flat string. The good practice is to log a character count per document so you catch empty or suspiciously short parses before they propagate downstream.

Stage 2 — Chunking

After ingestion you have long strings of text — potentially thousands of words per document. You cannot embed the whole document as one unit and retrieve it sensibly, because a 5,000-word article produces one vector that averages over every sentence in it. When a user asks about one specific paragraph, that averaged vector may not rank near the top.

Chunking splits each document into shorter passages that each cover one coherent idea. The two main settings are chunk size (how many tokens per chunk — commonly 256 to 512) and chunk overlap (how many tokens are shared between consecutive chunks — commonly 10–20% of chunk size). Overlap prevents a sentence that straddles a boundary from being split and losing its meaning.

  • Fixed-size chunking — cut every N tokens regardless of sentence boundaries. Fast, simple, and brittle: it regularly splits sentences or tables.
  • Sentence or paragraph chunking — split on natural boundaries like periods, newlines, or heading tags. More coherent chunks at the cost of variable length.
  • Recursive chunking — try splitting on paragraphs first; if a chunk is still too long, split on sentences; if still too long, fall back to fixed tokens. This is the default strategy in most frameworks.
  • Semantic chunking — embed sliding windows of sentences and cut where the embedding similarity drops, so each chunk corresponds to one semantic topic. Higher quality, significantly more expensive to compute.

The right chunk size depends on your documents and your embedding model's token limit. Longer chunks give the model more context per retrieved passage; shorter chunks improve retrieval precision. You almost always need to experiment. For more detail, see What Is Chunking in RAG?.

Stage 3 — Embedding

Each chunk is passed through an embedding model — a neural network that outputs a fixed-length list of numbers (typically 768 to 3,072 floats) called a vector. This vector encodes the chunk's meaning. Chunks about similar topics produce vectors that are numerically close to each other; chunks about unrelated topics produce vectors that are far apart.

Three practical choices at this stage:

  • Which model to use. General-purpose embedding models (OpenAI text-embedding-3-large, Voyage AI, Cohere Embed) work well for most domains. Domain-specific models (e.g. fine-tuned on medical or legal text) outperform them on narrow corpora.
  • Symmetric vs asymmetric embedding. Some models are trained so that a query (a short question) and a passage (a longer paragraph) are projected into the same space meaningfully. Models not designed for this produce poor query-to-chunk similarity. Always use a model explicitly designed for retrieval, not just similarity.
  • Batch size and cost. Embedding is cheap compared to LLM inference, but on millions of documents it adds up. Embed in batches, cache the results, and only re-embed chunks when their source document changes.

Stage 4 — Indexing

You now have a vector for each chunk. Storing them in a plain list and scanning every vector on every query (exact nearest-neighbor search) works fine up to roughly 100,000 vectors. Beyond that, it becomes too slow. A vector index organises the vectors so you can find the closest ones to a query vector in milliseconds, not seconds, by only examining a small fraction of the total.

The most widely used index algorithm is HNSW (Hierarchical Navigable Small World), a graph structure where each node connects to its nearest neighbours at multiple scales. It gives very high recall (it finds the true nearest neighbours most of the time) at query latencies under 10ms, even at tens of millions of vectors. Libraries like FAISS, Qdrant, Weaviate, Pinecone, and pgvector all support HNSW under the hood.

Alongside the vector, you store each chunk's metadata — the source document filename, page number, section heading, creation date, and any other field you might want to filter on later. This lets you do pre-filtering: "only retrieve from documents tagged as Q4 2025" before running vector search, which is faster and more precise than post-filtering.

Stage 5 — Query preprocessing

The online half of the pipeline begins when a user types a question. Before embedding and searching, it is often worth transforming that raw question into something that retrieves better. This stage is optional in a minimal pipeline but has a large payoff in production.

  • Query rewriting. An LLM rewrites a vague or conversational question ("what was that thing about refunds?") into a precise retrieval query ("refund policy for physical items"). Especially valuable when the query is a follow-up in a multi-turn conversation.
  • Hypothetical Document Embedding (HyDE). Instead of searching with the question, you ask an LLM to write a hypothetical answer, then embed that answer and search with it. Because the hypothetical answer looks like a document, it matches real document chunks better than a short question does.
  • Multi-query expansion. Generate two or three paraphrased versions of the question, run each as a separate search, then merge and deduplicate the results. Useful when a single phrasing might miss synonyms or related terms.

Stage 6 — Retrieval

The preprocessed query is embedded with the same model used at ingestion, and the vector index returns the top-k chunks whose vectors are closest to the query vector. "Closest" is measured by cosine similarity (or equivalently, dot product on normalised vectors) — a score between 0 and 1 where 1 means the vectors point in the same direction in the high-dimensional space. For more on how this works, see What Is Semantic Search?.

The choice of k (how many chunks to retrieve) is a meaningful tradeoff. Small k (3–5) keeps the prompt tight, cheap, and focused; large k (10–20) reduces the risk of missing the relevant passage but adds noise and cost. Most production systems retrieve a larger candidate set (20–50) and then apply a reranker — a cross-encoder model that reads the query and each chunk together and re-scores them with higher accuracy than a vector distance alone can manage. The top 3–5 reranked chunks go into the prompt.

A common enhancement is hybrid search: running both a vector search and a keyword search (BM25) in parallel, then merging the two ranked lists with a method called Reciprocal Rank Fusion. This catches exact matches — product codes, names, rare technical terms — that semantic search can miss.

Stage 7 — Generation

The final stage assembles the retrieved chunks and the user's question into a single prompt, then calls the language model. The structure of this augmented prompt matters more than most beginners expect.

Typical augmented prompt structuretext
You are a helpful assistant. Answer the question using ONLY the
context passages below. If the answer is not in the context,
respond with "I don't have information about that."
Do not add information from outside the provided context.

--- CONTEXT ---
[Passage 1 — source: help-center/returns.md, section: Physical goods]
Refunds on physical items are accepted within 30 days of purchase.
Items must be in their original packaging and unused.

[Passage 2 — source: help-center/returns.md, section: Digital goods]
Digital products are non-refundable once downloaded or activated.

[Passage 3 — source: policy/exceptions.md]
Exceptions may be granted for defective products. Contact support
with a description of the defect within 7 days of delivery.
--- END CONTEXT ---

Question: Can I return a download I already activated?

Four practices make generation more reliable:

  • Label each passage with its source. "source: help-center/returns.md" lets the model attribute its answer and lets you display citations in the UI.
  • Instruct the model to stay in the context. Explicitly telling it "do not use knowledge outside the context" reduces hallucination when the retrieved passages happen to be incomplete.
  • Ask for 'I don't know' when unsure. Models default to guessing when asked something they cannot answer from the context. Explicitly permit — even encourage — an honest non-answer.
  • Keep the context order intentional. Models recall information near the beginning and end of the context better than information buried in the middle. Put the most relevant chunk first.

Offline vs online — the two clocks

A useful mental model is to split the pipeline on its timing boundary:

This split explains a common beginner confusion: if you update a document, why doesn't the answer change immediately? Because the offline half has not re-run yet. You need to re-chunk, re-embed, and re-index the changed document before the new content becomes retrievable. Production systems solve this with an incremental ingestion job — a background process that watches for document changes and updates only the affected chunks.

Going deeper

Once you have a working pipeline that passes retrieval and answer-faithfulness evaluations, the common next steps are about quality and scale.

Parent-child chunking. You index small, precise child chunks for retrieval, but when a child is retrieved you pass the larger parent passage (or the whole section) to the generation model. This gives you retrieval precision and generation context — the model sees enough surrounding text to understand what the snippet means.

Contextual retrieval. Anthropic's 2024 technique prepends a short AI-generated description of where each chunk sits in its source document ("This passage is from the returns policy section of the help centre") before embedding it. This additional context improves retrieval accuracy because the vector captures not just the chunk's words but its role in the document.

Agentic RAG. Instead of a fixed one-shot retrieve-then-generate pass, you give a language model retrieval as a callable tool and let it decide when to search, what to search for, and whether one retrieval is enough or it needs to refine. This turns the pipeline into a loop — see What Is Agentic RAG? — and is the architecture behind most production AI assistants today.

Observability. In production you need to log every stage: which chunks were retrieved, what the cosine scores were, what the assembled prompt looked like, and whether the user accepted the answer. Without this trace you are debugging blind. Tools in the LLMOps ecosystem (LangSmith, Arize, Helicone) attach to the pipeline and give you that trace without extra code.

FAQ

What are the steps in a RAG pipeline?

The seven stages are: (1) document ingestion — parse raw files into text; (2) chunking — split text into short passages; (3) embedding — convert each chunk to a vector; (4) indexing — store vectors in a searchable index; (5) query preprocessing — clean or rewrite the user's question; (6) retrieval — find the top-k most relevant chunks; (7) generation — assemble a prompt with those chunks and call the LLM. The first four happen offline once; the last three happen online on every request.

How do I choose chunk size for RAG?

Start with 256–512 tokens per chunk with 10–20% overlap, which works for most prose documents. Shorter chunks (128 tokens) improve retrieval precision but reduce the context the model gets per passage; longer chunks (1,024 tokens) give more context but reduce precision and may exceed some embedding models' limits. Run an offline evaluation with a sample of real queries and measure retrieval recall at different sizes — that data beats any rule of thumb.

What is the difference between a vector index and a vector database?

A vector index (like FAISS or HNSW) is just the data structure and algorithm for fast nearest-neighbour search. A vector database (Pinecone, Qdrant, Weaviate, pgvector) adds everything around that index: persistent storage, metadata filtering, CRUD operations, authentication, and horizontal scaling. For a prototype you can use a bare FAISS index in memory; for production you almost always want a managed vector database.

Why does my RAG answer not update after I change a document?

Because the offline half of the pipeline — chunking, embedding, and indexing — has not re-run for that document. Your vector index still holds the old chunks. You need to delete the stale vectors for that document from the index, re-chunk and re-embed the updated document, and insert the new vectors. Production systems automate this with an incremental ingestion job triggered by document changes.

What is a reranker and do I need one?

A reranker is a cross-encoder model that takes a query and a candidate chunk together and scores how well the chunk answers the query. It is far more accurate than cosine similarity alone because it reads both texts jointly rather than comparing pre-computed vectors. The standard pattern is: retrieve top-20 with fast ANN search, rerank to top-5, pass top-5 to the LLM. You do not need one in a prototype, but it is one of the highest-value additions in production — typically improving answer accuracy more than any prompt change.

What is HyDE and when should I use it?

HyDE (Hypothetical Document Embedding) asks an LLM to write a hypothetical answer to the question, then embeds that hypothetical answer and uses it as the search query instead of the original question. Because the hypothetical answer looks like a document, it often matches real document chunks better than a short user question does. Use it when your queries are short or vague and your documents are long-form prose. It adds one extra LLM call per request, so it is not free.

Further reading