AI/TLDR

What Is Tool Use in AI Agents? From Function Calls to Actions

Learn how a text-only model gains the ability to search, run code, and call APIs — and what actually happens during a tool call.

BEGINNER12 MIN READUPDATED 2026-06-11

In plain English

A language model, on its own, only does one thing: it reads text and predicts more text. It can't check today's weather, look up your order, run a calculation it can verify, or save a file. It has no hands and no eyes — just a frozen snapshot of whatever it read during training. Ask it for the current price of a stock and it will confidently make one up, because guessing the next word is the only move it has.

Tool use is how you give that model hands. You describe a set of actions it's allowed to take — search_web, get_weather, run_sql, send_email — and the model, instead of answering directly, can say: "I'd like to call get_weather with the input {"city": "Tokyo"}." Your code runs the real function, gets a real answer, hands it back, and the model continues with facts instead of guesses.

Think of a smart phone-support agent locked in a windowless room. They're sharp and articulate, but they can't see your account — so you give them a phone with a few speed-dial buttons: check balance, reset password, look up order. They don't press the buttons themselves; they tell you which button to press and what to type, you do it, and you read the result back. The agent's brain plus your buttons is a system that can actually get things done. The model is the agent. The tools are the speed-dial buttons. Tool use is the protocol for asking you to press one.

Why it matters

A raw model has three hard limits, and tool use punches through all of them. It's stuck in time — it doesn't know anything that happened after training, so it can't tell you who won last night's game. It's unreliable at exact work — arithmetic, sorting, precise lookups — because it's pattern-matching, not computing. And it's trapped in the chat box — it can read and write text, but it can't touch your database, your filesystem, or any external service.

Give it tools and each limit dissolves. A search tool fixes the knowledge cutoff. A code tool or calculator fixes the math. A database or API tool connects it to live systems. The model stops being a clever parrot and becomes a coordinator that knows which action to take and when, while real, deterministic code does the parts computers are good at.

Who should care:

  • Anyone building a chatbot that does more than chat — booking, ordering, looking things up in your data — needs tools to reach the systems behind those actions.
  • Anyone building RAG is already doing a flavor of this: "retrieve relevant documents" is a tool the agent can call when it needs facts.
  • Anyone using a coding assistant like Claude Code is watching tool use up close — every file it reads, every command it runs, every edit it makes is a tool call.

What did it replace? The fragile workaround era. Before tool use was a first-class feature, people coaxed models into emitting a special string ("ACTION: search(...)") and wrote brittle parsers to catch it. It worked until the model phrased it slightly differently and the parser broke. Tool use standardized the whole dance: the model returns a structured request your code can parse reliably, every time. This is the same machinery as function calling — the API-level feature that makes it possible.

How it works

Tool use is a structured conversation between your code and the model. You don't hand the model live functions — you hand it descriptions: a name, a plain-English explanation of what each tool does, and a schema for its inputs (usually JSON Schema). The model reads those descriptions like a menu and decides whether to order.

Here's one full round trip. The user asks a question, the model emits a tool call instead of an answer, your code executes the real function, you feed the result back, and the model writes the final reply:

The crucial thing to internalize: the model never runs anything itself. It only ever produces text — but you've taught it to produce a special, structured chunk of text (a tool_use block) that means "please call this function for me." Your application is the one with real powers. The model points; your code pulls the trigger and reports back. That separation is what keeps tool use safe and debuggable — you can inspect, validate, or refuse any call before it runs.

A few mechanical details that trip up beginners:

  • Tools are passed on every request. The model has no memory between calls, so the list of available tools rides along in each API request, the same as the conversation history.
  • *The description is* the interface.** The model decides which tool to use purely from the name and description you wrote. Vague descriptions cause wrong or missing calls — tool design is real engineering, not a label.
  • One turn can request several tools. Modern models can emit multiple tool calls at once ("check weather and look up flights") for you to run in parallel.
  • tool_choice controls the pressure. You can let the model decide (auto), force it to use some tool (any), force one specific tool, or forbid tools entirely — useful when you must get structured output.

Tool use and the agent loop

One round trip is rarely enough for a real task. "Book me a table near the office on Friday" might need three tools — look up your office address, search restaurants nearby, then make a reservation — and the model can't know the second step's input until it sees the first step's output. So tool use almost always runs inside a loop.

The loop is the heart of every agent. The model calls a tool, sees the result, decides what to do next — which might be another tool call, or might be the final answer. It keeps going until it's done:

When the model decides it has enough to answer, it stops calling tools and returns plain text — that's the loop's exit. Everything sophisticated agents do — research across many sources, debug code by running tests, fix a failing build — is this same cycle running more times with better tools. Deciding which tool to reach for and in what order is agent planning, and when one agent's tools include "ask another agent," you've built a multi-agent system.

A hands-on example

Here's the whole pattern in real Python against the Claude API. We define one tool, send the user's question with the tool description, and handle the model's tool_use response by running our function and sending the result back. Read the comments — this twelve-step dance is all tool use ever is.

tool_use_demo.pypython
import anthropic

client = anthropic.Anthropic(api_key="sk-...")  # your key here

# 1. Describe the tool. The description is the interface — be precise.
tools = [{
    "name": "get_weather",
    "description": "Get the current temperature for a city, in Celsius.",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name, e.g. Tokyo"}
        },
        "required": ["city"],
    },
}]

# 2. The real function. The model never runs this — your code does.
def get_weather(city: str) -> str:
    fake_db = {"Tokyo": "18\u00b0C, clear", "Paris": "12\u00b0C, rain"}
    return fake_db.get(city, "unknown")

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

# 3. First call: the model sees the tool and decides to use it.
resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    tools=tools,
    messages=messages,
)

# 4. If it asked for a tool, run it and feed the result back.
if resp.stop_reason == "tool_use":
    messages.append({"role": "assistant", "content": resp.content})
    for block in resp.content:
        if block.type == "tool_use":
            result = get_weather(**block.input)        # run the real function
            messages.append({
                "role": "user",
                "content": [{
                    "type": "tool_result",
                    "tool_use_id": block.id,           # link result to the call
                    "content": result,
                }],
            })
    # 5. Second call: now the model answers using the real data.
    final = client.messages.create(
        model="claude-opus-4-8", max_tokens=1024, tools=tools, messages=messages,
    )
    print(final.content[0].text)  # -> "It's currently 18\u00b0C and clear in Tokyo."

Notice there are two API calls. The first returns a tool request, not an answer. You execute, append a tool_result (linked to the call by its id), and call again. A production agent wraps steps 3–5 in a while loop and keeps going until stop_reason is no longer "tool_use". That loop, plus a handful of well-described tools, is a working agent.

Tool use vs function calling vs MCP

Three terms cause endless confusion, so let's pin them down. They're not competitors — they're layers.

  • Function calling is the low-level API feature: the provider trained the model to output a structured request matching your schema. "Tool calling" and "function calling" are used interchangeably; some providers prefer "tool use" because tools aren't always plain functions.
  • Tool use is the concept those calls enable: letting a model act — searching, running code, hitting APIs, editing files. Function calling is the how; tool use is the what.
  • MCP (Model Context Protocol) is a standard for packaging and sharing tools so any agent can plug into any tool server — think USB-C for tools. You still use function calling under the hood; MCP just standardizes where the tools come from.

Rule of thumb: if you're writing the schema and parsing the response, you're doing function calling. If you're reasoning about what the agent can do, you're talking tool use. If you're connecting to someone else's pre-built tools, you're probably using MCP.

Common pitfalls

Most tool-use failures are predictable. The big ones:

  • Vague descriptions. "Gets data" tells the model nothing. If two tools sound similar, the model guesses — and guesses wrong. Describe exactly when each tool applies and what every parameter means.
  • Too many tools. Twenty near-identical tools confuse the model the way a 200-button remote confuses a person. Fewer, sharper tools beat a sprawling pile. Group or retrieve tools when the catalog gets large.
  • Trusting tool output blindly. Whatever a tool returns lands in the model's context as if it were truth. A tool that fetches a web page can return text containing instructions aimed at your model — that's prompt injection. Treat tool output as untrusted data, not commands.
  • No error handling. Real APIs time out, rate-limit, and return garbage. Return a clear error string back to the model ("city not found, ask the user to clarify") instead of crashing — a good agent recovers from a failed call.
  • Letting it run wild. An unbounded loop with a delete_file tool is a footgun. Cap the number of iterations, and require confirmation for anything destructive or irreversible.

Going deeper

How does a model learn to call tools at all? Providers fine-tune models specifically for it, and there's a clean research lineage here. The Toolformer paper showed a model can teach itself when to call an API — like a calculator or search engine — by inserting and testing tool calls in its own training data, then keeping only the calls that improved its predictions. Production models are now trained on large amounts of tool-use data so they reliably emit well-formed calls and chain them sensibly. Tool use isn't bolted on after training; it's baked in.

Reliability is the hard part. Models can hallucinate arguments, call a tool that doesn't exist, or emit JSON that doesn't match your schema. Two defenses help. Strict / constrained decoding (offered by major providers) forces output to conform to your schema as it's generated, so malformed calls become impossible rather than rare. Validation on your side — reject and re-prompt on a bad call — catches the rest. Even with both, expect the occasional weird call and design for it.

Token cost is real and easy to miss. Tool definitions are sent on every request, so a fat catalog of tools quietly eats your context window and your bill before the model says a word. Long tool results pile up across a loop too. Production patterns include retrieving only the relevant tools per request instead of sending all of them, summarizing or clearing old tool results, and caching the stable prefix of the prompt so repeated tool definitions don't get re-billed each turn.

Where tools run is a design choice. Some tools execute in your code ("client tools") — you handle the function call. Others run on the provider's infrastructure ("server tools") — you ask for a web search or code execution and just receive the result, no plumbing. Server tools are convenient; client tools give you full control over what actually happens. Most real agents mix both, and increasingly pull external tools in over MCP.

The open problems are honest. There's no fully reliable way to make a model choose the right tool every time as catalogs grow into the hundreds. Multi-step tool chains compound errors — a wrong argument in step two derails everything after it. Evaluating tool-using agents is genuinely hard, because correctness depends on a whole trajectory of decisions, not one answer. And security keeps getting harder as agents gain more powerful tools: every tool is a new way for things to go wrong, accidentally or adversarially. Tool use is the single biggest capability jump you can give a model — and exactly because of that, it's where the most care is required.

FAQ

What is the difference between tool use and function calling?

They describe the same machinery at different levels. Function calling is the API feature that lets a model emit a structured request matching a schema you defined; tool use is the broader concept of a model acting on the world (searching, running code, calling APIs) that function calling makes possible. Many providers use the terms interchangeably.

Does the AI model actually run the tools itself?

No. The model only ever produces text — including a structured tool_use block that means "please call this function." Your application code executes the real function and feeds the result back. That separation is what makes tool use safe: you can validate, refuse, or log any call before it runs.

What are some examples of tools an AI agent can use?

Common ones are web search (for current info), a calculator or code interpreter (for exact computation), database or SQL queries, file read/write, sending email or messages, and calling external APIs like weather, maps, or a CRM. In a coding assistant, every file read, command run, and edit is a tool call.

How does an agent decide which tool to use?

It reads the name and description you wrote for each tool, like a menu, and picks based on the user's request and what's already in context. This is why tool descriptions matter so much — if the model picks the wrong tool or skips one, the fix is almost always a clearer description, not a longer prompt.

Is MCP the same thing as tool use?

No. Tool use is the general capability of a model calling tools. MCP (Model Context Protocol) is a standard for packaging and sharing those tools so any agent can plug into any tool server — like USB-C for AI tools. You still rely on function calling under the hood; MCP just standardizes where the tools come from.

Further reading