In plain English
Function calling is how a language model asks your program to do something it can't do on its own. The model doesn't run code, search the web, or read your database. What it can do is produce a small, structured message that says: "please run the function get_weather with the input {"city": "Tokyo"}." Your code runs that function, gets a real answer, and hands it back to the model. The model then writes its reply using that real data.
The everyday analogy is a smart but housebound assistant on the phone. They're brilliant at language and reasoning, but they can't see your calendar or check the weather. So they say, "Can you look up tomorrow's forecast for Tokyo and read it back to me?" You do the lookup, read out the numbers, and they fold that into a clear answer. The assistant never touched your tools — they just told you which tool to use and with what inputs. That round trip is function calling.
The single most important thing to internalize on day one: the model never executes anything. It only emits a request — a blob of JSON naming a function and its arguments. Your application is what actually calls the function. If you don't write code to catch that request and run something, nothing happens at all. People constantly imagine the model is reaching out and hitting an API itself. It isn't. It's passing you a note.
Why it matters
A raw language model is frozen and sealed. Its knowledge stops at its training cutoff, it can't see anything private to you, and it can't take actions in the world. Ask it for today's stock price and it will either refuse or — worse — invent a confident, wrong number. Function calling is the bridge out of that box: it lets a model that's great at language borrow capabilities it doesn't have, like fresh data and real actions.
Before this mechanism existed, people tried to bolt tools onto models with brittle hacks — prompting the model to write out a fake command in a special format, then scraping that text with regex and hoping the format held. It broke constantly. Function calling replaced the duct tape with a first-class API feature: the provider trains the model to emit a clean, schema-validated JSON object, and the API enforces the shape. You get a reliable structured request instead of fishing keywords out of prose.
Who should care, and why:
- App developers — this is how you connect an LLM API to your own backend: order lookups, account changes, sending an email, querying a SQL database, hitting a payment provider.
- Anyone building agents — function calling is the literal engine of an AI agent. An agent is mostly a loop that calls tools, reads results, and decides what to call next.
- RAG builders — instead of always stuffing documents into the prompt, you can give the model a
search_docsfunction and let it decide when it actually needs to look things up. That's agentic RAG.
Put simply: prompting gets a model to say things. Function calling gets a model to do things — safely, through code you control.
How it works
Function calling is a multi-step conversation, not a single API call. You send the model the user's question plus a list of tools it's allowed to use. Each tool is described by three things: a name, a plain-English description of what it does, and an input schema (a JSON Schema listing the arguments and their types). The model reads those descriptions the same way it reads any other text, and uses them to decide whether — and how — to call a tool.
Here's the full loop. The hand-off in the middle is the part beginners miss: the model pauses, you do the work, then you call the API again with the result.
Walk it slowly with a weather example. The user asks "What should I wear in Tokyo tomorrow?" You send that message and a get_weather tool definition. The model can't know tomorrow's weather, so instead of guessing it returns a tool call: get_weather({"city": "Tokyo"}). Critically, the API response now stops with a special signal — OpenAI sets finish_reason to tool_calls, Claude sets stop_reason to tool_use — meaning "I'm waiting on you."
Your code sees that signal, pulls out the function name and arguments, and runs your real get_weather function — which hits a weather API and gets back, say, 18°C and rain. You then make a second API call: the original conversation, plus the tool call, plus a tool-result message carrying 18°C and rain. Now the model has real data and writes the actual answer: "It'll be cool and rainy — bring a jacket and an umbrella."
A worked example in code
Here's the whole loop in Python against the Claude API. Read it top to bottom — the four phases (define, ask, run, return) map directly onto the diagram above. The same pattern works on every major provider; only the field names change.
import anthropic
client = anthropic.Anthropic(api_key="sk-...")
# 1. DEFINE the tool: name, description, and an input schema.
tools = [{
"name": "get_weather",
"description": "Get the current weather for a given city.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name, e.g. Tokyo"}
},
"required": ["city"],
},
}]
# This is YOUR real function. The model never runs it — you do.
def get_weather(city: str) -> str:
# In real life: call a weather API here. Hard-coded for the demo.
return "18\u00b0C, light rain"
messages = [{"role": "user", "content": "What should I wear in Tokyo tomorrow?"}]
# 2. ASK the model, handing it the tools.
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
tools=tools,
messages=messages,
)
# 3. The model paused to call a tool. RUN it.
if resp.stop_reason == "tool_use":
tool_call = next(b for b in resp.content if b.type == "tool_use")
result = get_weather(**tool_call.input) # -> "18\u00b0C, light rain"
# 4. RETURN the result and let the model finish.
messages.append({"role": "assistant", "content": resp.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_call.id,
"content": result,
}],
})
final = client.messages.create(
model="claude-opus-4-8", max_tokens=1024, tools=tools, messages=messages
)
print(final.content[0].text)
# -> "It'll be cool and rainy in Tokyo. Bring a jacket and an umbrella."Notice there are two messages.create calls. The first ends in a tool request; the second delivers the result and gets the prose answer. That second call is non-optional — skip it and the model never sees the weather data, so it can't use it. This back-and-forth is also why function calling costs more tokens: every tool definition, tool call, and tool result is text the model re-reads on each turn, which matters for your API bill.
How the providers differ
The concept is identical everywhere; the spelling differs. If you've seen one provider's version, you can read all of them. Here's the translation table for the three you'll meet most:
| Concept | OpenAI (GPT) | Anthropic (Claude) | Google (Gemini) |
|---|---|---|---|
| What they call it | Function calling | Tool use | Function calling |
| Schema field | parameters | input_schema | parameters |
| "Stop and call a tool" signal | finish_reason: tool_calls | stop_reason: tool_use | A functionCall part |
| You send results back as | A tool role message | A tool_result block | A functionResponse part |
Two features you'll want early, available on all major providers in some form. Parallel tool calls: when a user asks for the weather in three cities at once, the model can return all three calls in a single turn so you can run them together. Forced tool use (tool_choice): you can require the model to call a specific tool, or any tool, instead of letting it decide — handy when you want guaranteed structured output in a known shape rather than free prose.
Common pitfalls
Most function-calling bugs are the same handful, and they're all avoidable once you've seen them once.
- Expecting the model to run the function. It won't, ever. If you forget to write the code that catches the tool call and executes it, the model just stops with a tool request and nothing happens. The model proposes; your code disposes.
- Forgetting the second API call. Running the function but never sending the result back leaves the model blind. It needs that tool-result message to write the final answer.
- Vague tool descriptions. "Does stuff with users" gives the model nothing to reason about, so it calls the wrong tool or none at all. Describe exactly what the tool does and when to use it.
- Trusting the arguments blindly. The model generates the arguments, and it can hallucinate a value or be steered by a malicious user. Validate inputs and never let a tool call run a destructive action (delete, pay, email) without checks — see prompt injection.
- No error handling. When your function fails, return the error as the tool result (
"Error: city not found"). The model can often recover — re-ask, try another tool — if you tell it what went wrong instead of crashing.
Going deeper
How does the model learn to emit clean JSON on demand? It's trained for it. Providers fine-tune models on huge numbers of tool-use examples, so "produce a valid call matching this schema" becomes a learned behavior rather than a prompting trick. Some providers go further with constrained decoding (often called strict mode): at generation time the API restricts the model's token choices so the output is guaranteed to match your schema — no missing fields, no wrong types. That turns a usually-valid call into an always-valid one.
The single tool round trip is the atom of something much bigger. Repeat the loop — call a tool, read the result, decide the next call — and you've built an agent. Picture a customer-support flow where one user message kicks off a chain of dependent calls:
That chaining is where the hard production problems live. Tool-selection accuracy degrades as you add tools — a model juggling 40 vaguely-named tools picks wrong far more often than one choosing among 5 sharp ones, so people group tools, load them dynamically, or route with a smaller model first. Latency stacks up: each round trip is a full model call, so a four-tool chain is four sequential calls plus four function executions — streaming the final answer helps the wait feel shorter. Loop safety matters: agents can get stuck calling the same tool forever, so you cap the number of iterations. And observability becomes essential — when a chain misbehaves, you need a trace of every tool call and result to debug it, which is a core job of LLM observability.
An open frontier worth watching: standardizing the tool ecosystem. Today every app re-implements its own tools; the Model Context Protocol is the leading attempt to make tools portable across apps and models, the way HTTP made web servers portable across browsers. If you've understood this loop — model asks, your code runs, results go back — you understand the mechanism that nearly every AI agent, coding assistant, and tool-augmented chatbot is built on. The next step is to see it run inside a full loop in tool use.
FAQ
Does the LLM actually run my function during function calling?
No. The model only outputs a structured request naming a function and its arguments. Your application code is what catches that request and runs the real function, then sends the result back to the model. If you don't write that execution code, nothing runs.
What is the difference between function calling and tool use?
They're the same mechanism with different names. OpenAI originally called it 'function calling'; Anthropic calls it 'tool use' for Claude. Both now expose it through a 'tools' parameter where you describe functions the model can ask you to run. The terms are interchangeable in practice.
Why do I have to call the API twice for function calling?
The first call returns the model's tool request and stops. You then run the function yourself and make a second call that includes the function's result. The model needs that result to write its final answer — skipping the second call means it never sees the data it asked for.
How does the model decide which function to call?
It reads the name, description, and input schema you provide for each tool, just like reading any other text, and matches them against the user's request. Clear, specific descriptions are the biggest factor — a vague description makes the model pick the wrong tool or none at all.
Is function calling the same as an AI agent?
Function calling is the building block; an agent is the loop built from it. A single function call is one round trip. An agent repeats it — call a tool, read the result, decide the next call — until the task is done. Every agent relies on function calling under the hood.
Can the model call multiple functions at once?
Yes. Major providers support parallel tool calls, where the model returns several tool requests in one turn so your code can run them together — useful when a user asks for, say, the weather in three cities at once. It can also chain calls sequentially when one tool's output feeds the next.