AI/TLDR

What Is Agentic RAG? When the LLM Decides What to Search

You'll understand how agentic RAG turns retrieval into a tool the LLM calls iteratively, and when that beats a fixed pipeline.

INTERMEDIATE12 MIN READUPDATED 2026-06-11

In plain English

Picture two researchers handed the same question. The first one runs a single library search, grabs the top five results no matter what, and writes an answer from whatever came back — even if half of it is irrelevant. The second one searches, skims the results, realizes the question has two parts, runs a second search for the part that's still missing, notices one source is out of date, and only then writes the answer. The second researcher is doing agentic RAG.

Ordinary Retrieval-Augmented Generation is the first researcher. It's a fixed pipeline: take the user's question, embed it, fetch the top-k chunks from a vector store, stuff them into the prompt, generate. One search, every time, in a straight line. Agentic RAG turns that retrieval step into a tool the model can choose to call — and the model decides when to call it, what to search for, whether the results are good enough, and when to stop.

The shift is small to describe and large in consequence: retrieval stops being a hard-coded step in your code and becomes a decision the LLM makes at runtime. The model is no longer a passenger that gets handed documents; it's the agent driving the search.

Why it matters

A fixed RAG pipeline makes one assumption that quietly breaks all the time: one search is enough, and the user's exact words are a good search query. Real questions don't cooperate.

  • Multi-part questions. "How does our refund policy compare to our competitor's, and which one is more generous?" needs at least two retrievals. A single top-k search blends both topics and nails neither.
  • Bad phrasing. A user types three keywords or a vague pronoun-heavy sentence. The literal query embeds poorly. A model can rewrite the query into something a retriever actually matches.
  • Empty or weak results. Fixed RAG forges ahead even when the chunks it got back are garbage, then hallucinates to fill the gap. An agent can look at the results, see they're irrelevant, and search again with different terms.
  • Questions that don't need retrieval at all. "Translate this to French" doesn't need your knowledge base. Fixed RAG retrieves anyway, wasting tokens and polluting the prompt. An agent can skip the search.

Who should care: anyone whose RAG demo worked great on simple lookups but falls apart on the real, messy questions users actually ask. If your evaluation shows good retrieval on single-hop questions and a cliff on anything multi-step, agentic RAG is the standard next move.

What it replaced: not RAG itself, but the rigidity of RAG. The earlier fix for hard questions was to bolt more fixed stages onto the pipeline — a query-rewrite step here, a reranker there, a fallback branch in your code. Agentic RAG hands that orchestration to the model instead of hard-coding every branch yourself. You trade predictable control for adaptive behavior.

How it works

Mechanically, agentic RAG is the standard agent loop with search wired in as a tool. You give the model a search tool (and maybe more than one), describe what each does, and let it run. The model reasons about the question, decides whether to call a tool, reads what comes back, and repeats until it's confident — then writes the final answer. This is tool use and function calling pointed at your knowledge base.

The loop is what makes it agentic. After each search the model faces a fork: I have what I need — write the answer, or I'm still missing something — search again with a better query. That second branch is impossible in fixed RAG, where retrieval happens exactly once before generation and never again.

Inside a single turn, a few distinct skills show up — often as separate reasoning steps the model performs on its own:

  1. Routing — does this question even need retrieval, and from which source? An agent with three tools (docs search, ticket search, web search) picks the right one.
  2. Query rewriting — turn the user's messy words into one or more clean search queries. "What about pricing for the enterprise tier" becomes "enterprise tier pricing plan".
  3. Decomposition — split a multi-hop question into sub-questions, search each, then combine. This is how agentic RAG answers things one search physically cannot.
  4. Grading & retrying — judge whether the retrieved chunks are relevant. If not, reformulate and search again instead of answering from junk.
  5. Stopping — recognize "I now have enough" and write the answer, rather than looping forever.

Two implementation styles dominate. The single-agent version is one LLM with a search tool in a loop — simple, the most common starting point. The multi-agent version splits the work: a planner agent breaks the question down, retriever agents fetch from different sources in parallel, and a synthesizer combines the findings. Multi-agent buys you parallelism and specialization at the cost of complexity, and is the realm of multi-agent systems.

A minimal agentic RAG loop

Here's the whole idea in one runnable file: expose a search function as a tool, then let the model call it as many times as it needs. The retriever here is a stub returning fake hits — swap in your real vector database query. The loop, not the retriever, is the point.

agentic_rag.pypython
import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY ("sk-...")


def search_kb(query: str) -> str:
    """Your real retriever goes here: embed `query`, hit the vector DB,
    return the top chunks. Stubbed for the example."""
    print(f"  [tool] searching: {query!r}")
    return "Refunds are allowed within 30 days. Enterprise plans get 60 days."


TOOLS = [
    {
        "name": "search_kb",
        "description": "Search the company knowledge base. Call it as many "
        "times as needed, rewriting the query if results are weak.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    }
]

messages = [{"role": "user", "content": "How long do I have to refund an enterprise plan?"}]

# The agentic loop: keep going while the model wants to use tools.
while True:
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        tools=TOOLS,
        messages=messages,
    )
    messages.append({"role": "assistant", "content": resp.content})

    if resp.stop_reason != "tool_use":
        # No more tool calls — the model has its answer.
        print(resp.content[-1].text)
        break

    # Run each requested search and feed the results back in.
    results = []
    for block in resp.content:
        if block.type == "tool_use":
            output = search_kb(block.input["query"])
            results.append(
                {"type": "tool_result", "tool_use_id": block.id, "content": output}
            )
    messages.append({"role": "user", "content": results})

Notice what you didn't write: no "if the results are empty, retry" branch, no query-rewrite step, no logic for multi-part questions. The model handles all of that inside the loop. Your job shrinks to defining the tool well and deciding when to stop. That tool description is doing real work — it's prompt engineering, and a vague description gives you a lazy agent that searches once and quits.

Agentic RAG vs traditional RAG: when each wins

This is not a "newer is better" situation. The two approaches occupy different points on a tradeoff curve, and picking wrong costs you either accuracy or money.

DimensionTraditional RAGAgentic RAG
Best question typeSingle-hop, well-phrasedMulti-hop, vague, multi-source
LatencyOne LLM call — fastSeveral calls — slower
Cost per queryLow, flatHigher, variable
Failure modeAnswers from bad chunksCan loop or over-search
DebuggabilityEasy — one fixed pathHarder — path changes per query

A useful rule: start with traditional RAG, measure, and go agentic only where the data says you need it. If your eval suite shows strong scores on simple questions and a sharp drop on multi-part ones, that gap is exactly what agentic retrieval closes. If your traffic is overwhelmingly simple lookups, agentic RAG just adds latency and a bill.

A common hybrid: a cheap router out front classifies each question as simple or complex. Simple ones take the fast fixed path; complex ones get the full agentic loop. You get fixed-RAG speed on the easy 80% and agentic power on the hard 20%.

Common pitfalls

Handing the model the wheel creates failure modes a fixed pipeline never has. Watch for these:

  • Runaway loops. Without a cap, an agent can search, search, search and never decide it's done — burning tokens and time. Always set a hard ceiling on tool-call rounds (e.g. stop after 5) as a safety net.
  • Over-retrieval. The model searches when it didn't need to, dragging irrelevant chunks into the prompt and degrading the answer. Good tool descriptions and a routing step reduce this.
  • Lazy single search. The opposite problem: the model searches once, gets weak results, and answers anyway instead of retrying. Usually a sign your tool description doesn't invite re-querying.
  • Cost blowups. Variable cost is hard to forecast. One pathological question can trigger ten searches. Budget by capping rounds and monitoring per-query tool-call counts.
  • Latency surprises. Each loop is a serial round-trip. Five sequential searches can mean a 15-second wait. Stream intermediate progress so users see it working instead of staring at a spinner.

Going deeper

The intellectual roots are ReAct. The 2022 ReAct: Synergizing Reasoning and Acting in Language Models paper (Yao et al.) formalized the interleave of "reason about what to do, take an action, observe the result, reason again." Agentic RAG is ReAct with retrieval as the primary action. Most agent frameworks' default loop is a ReAct variant under the hood, which is why understanding ReAct demystifies almost every agentic-RAG library.

Self-correcting variants are the active frontier. A well-known family — sometimes implemented as self-RAG or corrective RAG style loops — adds explicit grading steps: after retrieval, a model (or a cheap classifier) scores whether each chunk is relevant and whether the draft answer is actually supported by the sources. Fail the relevance check, and the agent rewrites the query and retries; fail the support check, and it searches for evidence before committing. These loops trade more LLM calls for far fewer hallucinations, and they pair naturally with LLM-as-a-judge grading.

Context-window pressure is the silent killer. Every loop appends the previous results, reasoning, and tool calls to the conversation. Five rounds of retrieval can balloon the context window past the point where the model attends to early content well. Production agentic RAG needs a context strategy: summarize older results, drop chunks that didn't help, or keep a running scratchpad instead of the raw transcript. This is context engineering, and it's what separates a demo from a system.

Production concerns multiply. Non-deterministic paths make caching harder — the same question can take a different route twice, so semantic caching has to key on more than the raw query. Observability becomes essential: you need to trace every tool call, its query, and its results to debug why an agent looped or answered from the wrong source. And evaluation gets harder — you're no longer scoring one retrieval but a whole trajectory, which means measuring search quality, decision quality, and final-answer quality separately.

The standardization story is MCP. As agents gain more retrieval sources — internal docs, ticketing systems, web search, databases — wiring each one in by hand doesn't scale. The Model Context Protocol is an emerging open standard for exposing tools and data sources to LLMs through a common interface, so an agent can pick up a new retrieval source without bespoke glue code. Where agentic RAG is heading is fewer hand-wired tools and more pluggable, standardized ones — the model orchestrating a marketplace of retrievers rather than the two or three you happened to hard-code.

FAQ

What is the difference between agentic RAG and traditional RAG?

Traditional RAG runs retrieval as a fixed step: it always searches once, uses the user's query as-is, and generates from whatever top-k chunks come back. Agentic RAG turns retrieval into a tool the LLM calls in a loop — the model decides whether to search, rewrites the query, grades the results, retries if they're weak, and may search several times before answering. The mechanics are the agent loop applied to search.

When should I use agentic RAG instead of a fixed pipeline?

Use it when your questions are multi-part, vaguely phrased, span multiple sources, or need several hops to answer — the cases where a single top-k search fails. Stick with traditional RAG for simple, well-phrased, single-hop lookups, where the agentic loop just adds latency and cost. A common compromise is a router that sends easy questions down the fast fixed path and only hard ones into the agentic loop.

Is agentic RAG slower and more expensive than normal RAG?

Usually yes. Each loop is another LLM round-trip, so a query that triggers four searches costs roughly four times the calls and takes several times longer than a single fixed retrieval. The cost is also variable and harder to forecast. It pays off when those extra searches convert wrong or incomplete answers into right ones — not on simple lookups where one search already suffices.

Do I need a framework like LangChain or LlamaIndex to build agentic RAG?

No. The core loop is short — expose a search function as a tool and keep calling the model while it requests tool use, as in the code example above. Frameworks like LangGraph and LlamaIndex add prebuilt patterns for retries, routing, grading, and parallel retrieval, which save time once your needs grow. Building the loop by hand first is the best way to understand what those frameworks are doing for you.

How do I stop an agentic RAG agent from looping forever?

Set a hard cap on the number of tool-call rounds — for example, force a final answer after five searches — as a safety net independent of the model's own judgment. Pair that with a clear tool description that tells the model when results are good enough to stop, and monitor the per-query tool-call count so you can catch questions that consistently hit the ceiling.

Further reading