In plain English
ReAct is a prompting pattern that tells an LLM to alternate between two modes: thinking and doing. Before calling a tool the model writes a short reasoning trace ("Thought"). After the tool returns a result ("Observation"), the model thinks again. The cycle repeats until the model has enough information to write a final answer.
The name is a portmanteau: Reasoning + Acting. It was introduced in a 2022 paper by Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao at Google Brain and Princeton, published on arXiv as arXiv:2210.03629. Since then it has become the default pattern that almost every agent framework implements under the hood.
A useful analogy: imagine a detective who, before every interview, writes a short note in their casebook about what they expect to learn and why. After the interview they write down what actually happened and update their theory. Compare that to a detective who just acts on gut instinct, never writes anything down, and forgets earlier clues by the time a new one arrives. ReAct is the casebook habit applied to LLMs.
Why it matters
Before ReAct, two approaches existed for making LLMs more capable. Chain-of-thought (CoT) prompting asked the model to reason step by step before answering — better accuracy, but all reasoning happened inside the model's head with no ability to check the outside world. Tool use without explicit reasoning let models call APIs and functions, but without narrated thoughts the model often called the wrong tool or misread returned data.
ReAct is the combination that made both better. The reasoning trace tells the model why it is calling a tool, which reduces wrong tool choices. The tool call fetches real data, which grounds the next reasoning step in facts rather than hallucination. The original paper showed measurable gains on four benchmarks: multi-hop question answering (HotpotQA), fact verification (Fever), text-based game navigation (ALFWorld), and online shopping (WebShop). On every benchmark, ReAct outperformed either approach alone.
Why builders care about it specifically
- Debuggability — every step of reasoning is written down, so when a run fails you can read the transcript and pinpoint exactly where the model went wrong.
- Grounding — answers are built from tool output, not from the model's training-time memory, which is stale and hallucination-prone for factual questions.
- Flexibility — the same loop works with any tool set: web search, code execution, database queries, REST APIs. Swap the tools; the pattern is identical.
- Composability — ReAct nodes can be nested inside larger orchestration graphs (see LangGraph) or delegated to sub-agents without changing the core loop.
- Framework convergence — because every major framework uses it, learning ReAct once transfers directly to LangChain, LlamaIndex, LangGraph, the Anthropic Agent SDK, and OpenAI Agents SDK.
How it works
A ReAct run has three repeating units. The model produces all three in order, then your harness code runs the tool and appends the Observation. The cycle continues until the model stops requesting tools.
- Thought — a free-text reasoning trace written by the model. It is not an action; it is the model planning the next action. Example: "I need the current price of gold. I'll search for it now."
- Action — a structured tool call, e.g.
web_search("gold price today"). The format depends on the framework; modern SDKs use a JSON tool-use block rather than text labels. - Observation — the result returned by the tool, injected into the context by your harness code. The model reads it on the next turn and uses it to update its Thought.
When the model has enough information it omits the Action and writes a Final Answer instead. Your harness detects the absence of a tool call and returns the answer to the caller.
A minimal annotated trace
This is what the raw text version of a ReAct trace looks like (the original paper used text labels; modern SDKs use structured blocks, but the logical structure is identical):
Task: What company made the AI model that scored highest on MMLU as of early 2024?
Thought: I need to find which AI model scored highest on MMLU in early 2024 and who made it. I'll search for that.
Action: web_search["highest MMLU score AI model 2024"]
Observation: GPT-4 scored around 86.4%, Gemini Ultra scored 90.0%, Claude 3 Opus scored 86.8% ...
Thought: Gemini Ultra's 90.0% is the highest listed. Gemini Ultra is made by Google DeepMind.
Final Answer: Google DeepMind, with Gemini Ultra scoring 90.0% on MMLU.Notice that the Thought before the Action is doing real planning work — it decides which query to run. The Thought after the Observation is doing real synthesis work — it identifies the answer from what was returned. Neither step is decorative.
How modern SDKs implement this
In a framework like LangChain or the Anthropic Agent SDK you do not write "Thought:" into your prompt manually. Instead, the SDK formats the system prompt, presents the tool list as structured schemas, and interprets the model's tool_use stop reason as the Action. The Observation is a tool_result message. The Thought is embedded in the model's text output before the tool call. The logical structure is ReAct; the wire format is JSON.
import anthropic
client = anthropic.Anthropic()
tools = [
{
"name": "web_search",
"description": "Search the web for current facts.",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"]
}
}
]
def run_tool(name, args):
# Replace with a real search call in production
return f"[stub result for: {args['query']}]"
messages = [{"role": "user", "content": "Who made the highest-scoring MMLU model in 2024?"}]
for _ in range(10): # step cap prevents runaway loops
resp = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
tools=tools,
messages=messages
)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason != "tool_use":
for block in resp.content:
if hasattr(block, "text"):
print("Answer:", block.text)
break
tool_results = []
for block in resp.content:
if block.type == "tool_use":
result = run_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})Why ReAct became the blueprint
Dozens of agent patterns existed before and after ReAct. Chain-of-thought, scratchpad prompting, tool-augmented language models, MRKL systems — all predated it. What made ReAct the one that stuck?
It solved the right problem at the right time
The paper appeared in late 2022, just as GPT-3.5 and the first wave of tool-use experiments were showing that LLMs could call external functions. Builders immediately hit the same failure mode: the model would call the wrong tool, ignore the result, or repeat the same bad call. ReAct's Thought step addressed all three: it forced the model to state its intent before acting, which reduced misfires, and to process the Observation before acting again, which reduced repetition.
It maps cleanly onto how APIs work
Function calling in OpenAI's API (launched in June 2023) and tool use in Claude's API both follow a request-response model: the model requests a tool, the harness runs it, the result comes back. That is exactly the Action-Observation half of ReAct. Grafting the Thought step onto the existing API mechanics required almost no new infrastructure — just a system prompt instruction and a loop in the caller's code.
It was simple enough to re-implement from scratch
ReAct has two moving parts: a system prompt that instructs the model to narrate its reasoning, and a loop that routes tool calls to real functions. That's it. A developer who has never read the paper can implement it in under 50 lines. That low barrier to entry meant it spread quickly through blog posts, tutorials, and framework code — long before anyone was formally aware of the paper.
| Approach | Reasoning? | External tools? | Adapts mid-task? |
|---|---|---|---|
| Direct prompting | No | No | No |
| Chain-of-thought | Yes | No | No |
| Tool use without CoT | No | Yes | Partially |
| ReAct | Yes | Yes | Yes |
| Plan-and-Execute | Yes (upfront) | Yes | Weakly |
| Reflexion | Yes + self-eval | Yes | Yes (across trials) |
ReAct vs function calling agents
A common point of confusion: is a "function calling agent" different from a ReAct agent? The short answer is that they are the same loop, expressed at different layers.
Function calling is the API mechanism: the model returns a structured tool_use block instead of plain text, your code runs the function, and the result comes back as a tool_result message. This handles the Action-Observation pair mechanically and robustly.
ReAct adds the Thought layer on top: the model's text output before the tool call acts as the narrated reasoning trace. In a pure function-calling agent without explicit ReAct framing, the reasoning is either absent (the model jumps straight to the tool call) or implicit (the model adds a text preamble but it isn't structurally required).
When does the distinction matter?
- Simple, well-defined tasks — if every step is predictable ("look up record, transform it, write output"), explicit Thought traces add tokens and latency without much benefit. Pure function calling is faster.
- Complex, multi-hop tasks — when the model needs to synthesize results from several tools, or when the right tool depends on what a previous tool returned, narrated Thoughts dramatically reduce wrong-turn errors.
- Debugging and auditability — if you need a human-readable record of why the agent did what it did, explicit Thoughts are invaluable. A pure function-calling log tells you what happened; a ReAct trace tells you why.
- Models with extended thinking — reasoning models (like Claude with extended thinking, or OpenAI o1/o3) generate internal reasoning traces automatically. For these models, prompting for explicit Thought labels is redundant; the function-calling loop alone is sufficient.
Going deeper
Once the basic ReAct loop is working, the engineering challenges shift from correctness to reliability, cost, and scale. Here are the threads most worth pulling.
Reflexion: ReAct plus self-critique
Reflexion (Shinn et al., 2023) extends ReAct by adding a self-evaluation step after each complete task attempt. The agent reflects on what went wrong, stores a verbal summary of the lesson in memory, and tries again. On benchmarks that permit multiple attempts, Reflexion significantly outperforms plain ReAct. The tradeoff is complexity: you now need persistent memory across trials, a dedicated reflection prompt, and at least two LLM calls per attempt.
Plan-and-Execute: when you know the shape of the task
In Plan-and-Execute (sometimes called Plan-and-Solve), a planner LLM call writes out all steps upfront before execution begins. This works well for structured tasks with predictable step sequences and makes the plan auditable before any actions are taken. The limitation: if step 3 uncovers unexpected data that invalidates step 4, replanning requires a second planner call. ReAct's implicit per-step replanning is more adaptive for exploratory tasks.
ReAct inside multi-agent graphs
In frameworks like LangGraph, individual agents are nodes in a directed graph. Each node can itself be a ReAct loop with its own tool set. An orchestrator node routes tasks to specialist sub-agents, each running their own Thought-Action-Observation cycles. The outer graph handles flow control (branching, parallelism, retry logic) while each inner node handles its portion of the work with standard ReAct mechanics. This composability is why ReAct remains relevant even as orchestration frameworks become much more complex.
Reliability math and the step-cap discipline
Error rates compound across steps. If each step of a ReAct run is 95% reliable (a generous estimate for production agents), a 10-step task completes correctly only about 60% of the time. That means: keep tasks short, validate tool results immediately, design tools that return structured errors rather than ambiguous text, and set a per-tool retry limit in addition to the global step cap. The most robust production agents also add a dedicated "check" step — an LLM call that evaluates whether the last Observation makes sense before proceeding.
Observability tools
Every serious ReAct deployment needs trace logging. Each turn should record: the full Thought text, the Action (tool name + raw inputs), and the Observation (truncated if large). LangSmith (from LangChain), Langfuse, and Weave (from Weights and Biases) are purpose-built for this. When a run fails, the first debugging move is always to read the trace and find the first Thought that contains a false assumption — that is almost always the root cause.
When to skip ReAct entirely
ReAct adds overhead: extra tokens, extra latency, extra cost. If the task has a fixed, predictable sequence of steps, a hard-coded pipeline is faster and cheaper. If only a single tool call is needed, the Thought step is pure overhead. If the model has all the required information in its training data (rare for factual questions, common for transformation tasks like summarization or code formatting), a single well-crafted prompt with no tools beats ReAct on every metric. The pattern earns its cost only when the steps genuinely depend on what the tools return — when you cannot know the path until you start walking it.
FAQ
What does ReAct stand for in AI?
ReAct stands for Reasoning and Acting. It is a prompting pattern introduced in the 2022 paper arXiv:2210.03629 by Yao et al. The model emits a Thought (reasoning) before each Action (tool call), then reads the tool's Observation before reasoning again. The cycle repeats until the model produces a Final Answer.
How is ReAct different from chain-of-thought prompting?
Chain-of-thought prompting adds reasoning steps within a single prompt-answer turn but cannot call external tools or check facts mid-reasoning. ReAct extends chain-of-thought with an Action-Observation loop: after each reasoning step the model can call a real tool, read the actual result, and update its reasoning. The Observation grounds the answer in live data rather than training-time memory.
Do I need to write Thought: and Action: labels in my prompts?
Not in most modern frameworks. LangChain, LangGraph, the Anthropic Agent SDK, and the OpenAI Agents SDK all handle the structural formatting for you. The model's text output before a tool call serves as the Thought, and tool calls are expressed as structured JSON rather than text labels. The labels were used in the original paper's text-only experiments; today they are implementation details hidden by the framework.
What is a ReAct agent's biggest failure mode?
Error compounding across steps. If each step is 95% reliable, a 10-step task succeeds only about 60% of the time. Additionally, hallucinated Thoughts can contaminate later reasoning — the model may invent a false "fact" in a Thought and then treat it as true even after an Observation contradicts it. Always treat Thoughts as plans, not facts; Observations are the only ground truth.
Is ReAct the same as a function calling agent?
They overlap heavily. Function calling is the API mechanism that handles the Action-Observation pair. ReAct adds the explicit Thought layer on top — the narrated reasoning step before each tool call. Most modern agent frameworks combine both: structured function-calling APIs for reliability, and ReAct-style Thought prompting for better planning on complex tasks.
Why do most agent frameworks still use ReAct in 2025 and 2026?
Because it remains the simplest pattern that delivers debuggability, grounding, and adaptive replanning together. Successors like Reflexion and Plan-and-Execute improve on ReAct in specific scenarios but add complexity. ReAct's core loop — Thought, Action, Observation, repeat — is simple enough to implement from scratch in 50 lines, which is why it became the common language of agent engineering.