Tool Calling Best Practices for LLMs

Learn how to design schemas, write descriptions, handle errors, and know when to skip tools entirely so your LLM agent calls functions reliably.

INTERMEDIATE10 MIN READUPDATED 2026-06-12

In plain English

You already know what tool calling is — the model asks your code to run a function and returns the result (see What Is Function Calling?). The harder question is: how do you build it so the model chooses the right tool, fills in the right arguments, and keeps working when something goes wrong?

That is what this article is about. Tool calling reliability lives almost entirely in three places: the JSON schema you give the model, the plain-English description attached to each tool and each parameter, and the error-handling logic in your agent loop. Get those three right and a well-tuned model will pick the right function, supply valid arguments, and recover gracefully when the external call fails.

Why it matters

Research on tool-use accuracy identifies two failure modes: selection errors (the model chooses the wrong tool, calls an extra one, or skips one entirely) and invocation errors (correct tool, wrong arguments). Studies find that argument correctness has a larger impact on end-to-end success than tool selection alone — which means a polished schema beats a clever prompt nearly every time.

There is also a catalog-size effect: giving a model 50 tools when it only needs 5 measurably degrades selection accuracy and consumes input tokens that would otherwise be available for reasoning. The engineering implication is clear — keep toolsets small and scoped, and load tools dynamically when the catalog grows large.

Reliability — well-described tools cut hallucinated or malformed arguments
Latency — parallel execution collapses independent calls into one round-trip
Safety — explicit scope and human-in-the-loop gates stop runaway write operations
Cost — concise schemas consume fewer input tokens per call
Debuggability — structured error returns let the model self-correct without a retry from scratch

How the decision loop works

Before diving into individual practices, it helps to see the full agent loop in one place. Understanding where each best practice slots in makes the guidance easier to remember and apply.

// Agent tool-calling loop with best-practice checkpoints

User message arrivesSystem prompt includes scoped tool listModel reasons over toolsNames + descriptions guide selectionTool call(s) emittedParallel if independent; sequential if dependentYour code validates argsSchema + runtime checks before executionTool executesLeast-privilege; confirm before irreversible writesResult returned to modelStructured error string on failureModel produces final replyOr calls another tool if needed

Each step in the loop has at least one best practice attached. The sections below walk through them in the order a developer encounters them: schema design → naming → descriptions → parallel calling → error handling → when not to use tools.

Practical best practices

1. Schema design — be explicit, be minimal

The JSON Schema block you pass to the API is the most influential thing you control. A few concrete rules:

Mark every required field as required. Leaving required parameters in the optional array is the single most common cause of invocation errors — the model omits them without error.
Use enum for constrained strings. If a parameter can only be "celsius" or "fahrenheit", say so. The model will never hallucinate a third option.
Add format hints for structured strings. "format": "date" (ISO 8601) or "format": "uri" gives the model a machine-readable contract, not just a hint.
Keep schemas as flat as possible. Deeply nested objects increase the chance of structural errors. One level of nesting is usually enough.
Omit optional parameters you rarely use. Every parameter in the schema is a token the model has to reason about.

jsonjson

{
  "name": "get_weather",
  "description": "Return current weather for a city. Use this when the user asks about current conditions, not forecasts.",
  "parameters": {
    "type": "object",
    "properties": {
      "city": {
        "type": "string",
        "description": "City name, e.g. 'Tokyo' or 'New York'"
      },
      "unit": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"],
        "description": "Temperature unit. Default to celsius unless the user specifies otherwise."
      }
    },
    "required": ["city"]
  }
}

2. Naming — verb_noun, snake_case, under 64 characters

Function names directly influence the model's selection decision — they are the first thing the model reads. Prefer verb_noun patterns: search_products, create_calendar_event, cancel_order. Most providers (OpenAI, Anthropic, Google) restrict names to alphanumerics, underscores, and hyphens with a 64-character maximum. Stay inside that limit and you are compatible across all major APIs.

Do: get_order_status, send_email, list_calendar_events
Avoid: doStuff, tool1, process — vague names hurt selection accuracy
Avoid: Names longer than 40 characters — readability matters for the model too
Separate read from write: get_ vs update_ vs delete_ makes the scope of an action obvious

3. Descriptions — tell the model when AND when not to call

The description field is your primary lever for improving selection accuracy. Most developers write what the function does; the better move is to write when to use it and when not to. The model resolves ambiguity between similar tools by reading descriptions, not names.

jsonjson

{
  "name": "search_knowledge_base",
  "description": "Search the internal knowledge base for factual answers. Use this for product documentation, policies, and FAQs. Do NOT use this for real-time data like prices, stock levels, or order status — use get_product or get_order_status instead."
}

Parameter descriptions deserve the same care. Include the expected format, valid range, and a concrete example. "The ISO 8601 start date, e.g. '2026-03-15'" is better than "The start date".

4. Parallel calling — collapse independent calls

When a model needs to call two or more tools whose outputs do not depend on each other, modern APIs let it emit all the calls in a single response. Your code executes them concurrently and returns all results in one batch. Total latency drops from the sum of every call to the slowest single call.

The key mental model: parallel when independent, sequential when dependent.

pythonpython

# Sequential (slow) — second call waits for first
weather = get_weather(city="Tokyo")
time_zone = get_timezone(city="Tokyo")

# Parallel — model emits both calls at once;
# you run them concurrently and return both results
# [get_weather(city="Tokyo"), get_timezone(city="Tokyo")]  <- one model turn
import asyncio
weather, timezone = await asyncio.gather(
    async_get_weather(city="Tokyo"),
    async_get_timezone(city="Tokyo")
)

Use parallel when: outputs are independent, calls are I/O-bound (database, API), order does not matter
Use sequential when: output of call A is input to call B, calls modify shared state, you need deterministic ordering
Watch for conflicts: if two parallel results contradict each other (e.g. two price sources disagree), define a resolution strategy before the model synthesizes them

5. Error handling — return structured failures, not exceptions

When a tool fails, do not raise an unhandled exception. Return a structured error string as the tool result. The model can then decide whether to retry with different arguments, call a fallback tool, or tell the user what went wrong — instead of crashing the loop.

pythonpython

def get_order_status(order_id: str) -> str:
    try:
        order = db.get_order(order_id)
        if order is None:
            # Return structured error — model can reason over this
            return '{"error": "ORDER_NOT_FOUND", "message": "No order with id ' + order_id + '", "suggestion": "Check the order id and try again."}'
        return order.to_json()
    except Exception as e:
        return '{"error": "SERVICE_UNAVAILABLE", "message": "Order service is down. Try again in 30 seconds."}'

Include an error code (ORDER_NOT_FOUND) — makes it machine-readable for logging and retry logic
Include a human message — the model can surface this directly to the user
Include a suggestion when the fix is obvious — the model can self-correct without a new user prompt
Set a maximum call count (10–15 is a common limit) in your agent loop and terminate gracefully if exceeded
Deduplicate calls — if the model calls the same function with the same args twice consecutively, surface the loop to the user rather than executing indefinitely

6. Safety and scope — least privilege, confirm before irreversible actions

Read tools are inherently safer than write tools. Split your surface into distinct functions: get_order (read-only) and cancel_order (write) should be separate tools, each with a description that names the side-effect explicitly. For irreversible writes — deleting data, sending emails, making payments — add a human-in-the-loop confirmation step rather than having the model trigger the action directly.

When not to use tools

Tool calling is powerful but not free. Every tool in the schema costs input tokens and adds a round-trip. Recognize the cases where tools add complexity without value:

The answer is in the model's training data. Basic math, well-known facts, and common unit conversions do not need a calculator tool. The model already knows 2 + 2 = 4.
The data can be injected into the system prompt. If the context is small and static (a user's name, their account tier), pass it as text — not as a database lookup tool.
The task is purely generative. Drafting an email, summarising text, or translating a paragraph requires no external data. Adding tools here just increases latency.
You need deterministic output. Tools introduce network failures, schema mismatches, and latency variability. If reliability and exact reproducibility matter more than real-time data, consider a non-agentic pipeline.
The tool catalog is too large to load at once. More than 20–30 tools in a single context degrades selection accuracy. Use dynamic tool loading (retrieve a relevant subset via vector search on descriptions) rather than dumping the full catalog.

Going deeper

Once you have reliable single-agent tool calling, the natural next step is multi-agent systems where specialized agents expose their capabilities as tools to an orchestrator. The same schema, naming, and description rules apply — but now your "tool" is another LLM rather than a Python function. The Model Context Protocol (MCP) standardises how tools are advertised and invoked across agent boundaries, so toolsets built to MCP specs work across different frameworks without rewiring.

For evaluation, treat tool-calling accuracy as a first-class metric. Log every tool call (function name, arguments, result, latency) and track: selection accuracy (did the model call the right tool?), argument correctness (were required fields present and valid?), and end-to-end success rate. These signals tell you whether a description change helped or hurt before you ship it to production.

Dynamic tool loading — embed tool descriptions, retrieve the top-K relevant ones per query via cosine similarity, inject only those into the context
Structured output mode — some APIs (OpenAI, Anthropic) support a strict mode where the model is constrained to emit only valid schema-conforming JSON — eliminates nearly all invocation errors
Plan-then-act — ask the model to reason about which tools it will need before calling any of them (tool_choice="none" in the planning turn), then execute the plan; reduces redundant calls
Tool versioning — version your tool schemas (get_order_v2) so you can run A/B tests on description or schema changes without breaking existing callers

FAQ

How many tools can I give an LLM at once?

Most APIs support 64–128 tools in a single context, but accuracy degrades well before that limit. In practice, keep active toolsets to 10–20 tools. For larger catalogs, use dynamic tool loading: embed all tool descriptions, retrieve the 5–10 most relevant ones per query, and inject only those.

What is the difference between parallel and sequential tool calling?

Parallel calling lets the model emit multiple tool calls in one response so your code can run them concurrently — ideal when the calls are independent. Sequential calling means each tool call waits for the previous result — necessary when output of one call is input to the next. Most modern APIs (OpenAI, Anthropic, Google Gemini) support both.

Why does my model keep calling the wrong tool?

Selection errors are almost always a description problem. Rewrite the description to explicitly state when the tool should be used AND when it should not, naming the competing tools by name. Also check that tool names are meaningfully different — vague or similar names increase confusion.

How should I handle a tool that fails or returns an error?

Return a structured error JSON string as the tool result rather than raising an exception. Include an error code, a plain-English message, and a suggestion for how to fix the problem. The model can then self-correct, try a fallback, or surface the error to the user — all without crashing the agent loop.

Should I use strict/structured output mode for tool calling?

Yes, when your API supports it. Strict mode constrains the model to emit JSON that exactly conforms to your schema, eliminating structural invocation errors. OpenAI calls this strict mode; Anthropic enforces schema conformance by default for tool_use blocks. It costs a small amount of extra processing but is almost always worth it.

When should I not use tool calling at all?

Skip tools when the answer is already in the model's training data (simple math, common facts), the data can be injected as plain text in the system prompt, the task is purely generative (writing, summarising), or you need deterministic and reproducible output. Tools are for live external data and real-world side-effects — not for things the model already knows.

// In plain English

// Why it matters

// How the decision loop works

// Practical best practices

1. Schema design — be explicit, be minimal

2. Naming — verb_noun, snake_case, under 64 characters

3. Descriptions — tell the model when AND when not to call

4. Parallel calling — collapse independent calls

5. Error handling — return structured failures, not exceptions

6. Safety and scope — least privilege, confirm before irreversible actions

// When not to use tools

// Going deeper

// FAQ

// Further reading

// Related

In plain English

Why it matters

How the decision loop works

Practical best practices

When not to use tools

Going deeper

FAQ

Further reading

Related