AI/TLDR

What Are Parallel Tool Calls? Running Agent Tools at Once

Understand how agents fire multiple tool calls in a single turn, when that genuinely speeds things up, and the dependency traps it creates.

INTERMEDIATE12 MIN READUPDATED 2026-06-13

In plain English

When an AI agent works on a task, it uses tools — small functions like search the web, read a file, or look up a customer. Normally you might picture it doing one tool at a time: ask, wait, get the result, ask again. Parallel tool calls let the model request several tools in the same turn, so the runtime can run them all at once instead of one after another.

Parallel Tool Calls — illustration
Parallel Tool Calls — koncert.com

Picture a chef plating one dish. The slow way is strictly serial: boil the pasta, then (only after) start the sauce, then toast the bread. The fast way is parallel: the moment the order comes in, water goes on for the pasta, a pan starts the sauce, and the bread goes under the grill — all at the same time, because none of them needs the others to finish first. Parallel tool calling is the model acting like that chef: it looks at the question, sees three independent things it needs, and fires off all three requests together.

The key word is independent. The chef can only parallelise steps that don't depend on each other — you can't slice the steak before it's cooked. A model can only safely parallelise tool calls whose inputs don't depend on another call's output. That single distinction is the whole subject of this article.

Why it matters

Every time an agent calls a tool, the slow part is usually waiting — for a network request, a database, an API on the other side of the world. If three lookups each take 800 milliseconds and you run them one by one, the user waits roughly 2.4 seconds plus the model's thinking time between each. Run the same three at once and the wait collapses to about 800 milliseconds: the time of the slowest call, not the sum of all of them.

That speed-up compounds because of how the agent loop works. Each "think → call tools → read results" cycle costs one full model round-trip. Sequential calling forces a separate round-trip per tool: think, call A, read A, think, call B, read B. Parallel calling lets the model resolve a whole batch of independent work in a single round-trip — fewer model invocations, lower latency, and often lower cost, because you pay to re-send the conversation history on every turn.

  • Latency. Independent I/O runs concurrently, so total wait is the slowest call, not the sum.
  • Fewer turns. One round-trip can dispatch a batch of calls instead of one, cutting the number of times you re-invoke the model.
  • Lower token cost. Every extra turn re-sends the growing conversation history; collapsing turns trims that repeated input.
  • Better UX. A research or comparison answer that gathers five sources feels instant instead of crawling source by source.

Builders care most when a task is naturally a fan-out: comparing several products, gathering weather for three cities, enriching a record from four separate APIs, or grading an answer against many independent checks. Whenever the work splits into parts that don't need each other, parallel tool calls turn a slow chain into one quick burst.

How it works

Parallel tool calling is a contract between two parties: the model, which decides what to call, and your runtime (your code or framework), which decides how to run it. The model never executes anything itself — it only emits requests. The concurrency is entirely your runtime's job.

Step 1 — the model emits a batch of calls

You give the model a list of tools and a question. Instead of returning one tool request, a capable model can return multiple tool-call requests in the same response — each with its own name and arguments, and each with a unique id so results can be matched back later. The model does this when it judges that the calls are independent and can all be answered from the current information.

Step 2 — the runtime executes them concurrently

Your code receives the batch and runs the calls at the same time — with async tasks, threads, or a worker pool. This is ordinary concurrent programming; the model just handed you a to-do list. When every call finishes, you collect the results and send them all back together in the next message, each labelled with the id of the call it answers.

The pattern looks like this in code. The model's response carries a list of tool calls; you gather them concurrently and return one results message:

running a batch of tool calls concurrentlypython
import asyncio

async def run_one(call):
    fn = TOOLS[call.name]              # map name -> async function
    result = await fn(**call.input)    # actually execute the tool
    return {
        "type": "tool_result",
        "tool_use_id": call.id,        # match the result to its call
        "content": result,
    }

async def run_batch(tool_calls):
    # Launch every call at once; wait for all to finish.
    return await asyncio.gather(*(run_one(c) for c in tool_calls))

# The model returned several tool_use blocks in one turn:
results = asyncio.run(run_batch(response.tool_calls))
# Send ALL results back in the next message, then let the model continue.

Two practical rules make this reliable. First, you must return a result for every call the model made, each keyed by its id — skip one and the conversation is malformed. Second, the order you put results into the message doesn't matter, because ids do the matching; but the content of each must clearly correspond to its call.

Parallel vs sequential calling

Parallel calling is not always the right move. It only helps when the calls are independent. The moment one call needs the output of another, you're back to a sequence — and forcing it into parallel breaks the logic. Here's the contrast at a glance.

ParallelSequential
Best forIndependent calls (fan-out)Calls where B needs A's result
LatencyTime of the slowest callSum of all calls
Model turnsOften one batch per turnOne round-trip per call
ExampleWeather for 3 cities at onceFind a user, then fetch their orders
RiskShared-state writes, partial failuresSlow, but simpler to reason about

The dependency test is simple: can you write down the arguments for call B before call A returns? If yes, the calls are independent and can run in parallel. If B's arguments contain a value you'll only know after A finishes, they must run in order. "Find the user with this email, then get that user's orders" is sequential — you don't have the user id until the lookup returns. "Get the weather in Tokyo, Paris, and Cairo" is parallel — every argument is known up front.

A good model usually gets this right on its own: it parallelises the weather lookups and serialises the user-then-orders chain, because it reasons about which arguments it already has. But it can misjudge, especially with vague tool descriptions — which is exactly where the pitfalls below come from.

Common pitfalls

Parallelism is a performance win that quietly introduces the classic hazards of concurrent programming. Four traps catch people most often.

  • Hidden dependencies run in parallel. The model parallelises two calls that looked independent but weren't — e.g. it books a flight and a hotel at once, when the hotel choice should depend on the flight's arrival time. The fix is clearer tool descriptions and, when order truly matters, designing one tool that encodes the dependency instead of two loose ones.
  • Shared-state writes collide. Two parallel calls that both write the same record, increment the same counter, or append to the same file can race and corrupt state. Reads are usually safe to parallelise; writes to shared state often are not. Keep write tools idempotent or serialise them.
  • Partial failures. Three calls go out; one times out or errors while the others succeed. You must return a result for the failed call too — an error payload, not nothing — so the model can decide whether to retry, work around it, or report the gap. Dropping the failed result leaves the conversation malformed.
  • Resource exhaustion. A model that fans out twenty calls at once can blow through rate limits or open too many connections. A concurrency cap (a semaphore or a bounded worker pool) keeps a wide fan-out from overwhelming a downstream API.

Most of these trace back to tool design, not the model. Tools that are independent, idempotent, and clearly described are safe to parallelise; tools that secretly share state or depend on each other are not. See designing tools for LLMs and how to design agent tools for the patterns that make parallel-safe tools.

A worked example

Say a user asks: "Which of these three laptops is cheapest and in stock?" The agent has one tool, lookup_product(sku), that returns price and stock for a single SKU. There are three SKUs and no call needs another's result — a textbook fan-out.

Sequential path: the model calls lookup_product for laptop 1, waits, reads the result, calls it for laptop 2, waits, reads, then laptop 3. Three model round-trips, three serial network waits. If each lookup takes 600 ms and each model turn 700 ms, the user waits well over 3 seconds.

Parallel path: in one turn the model emits all three lookup_product calls with the three SKUs. Your runtime runs them concurrently (the gather from earlier), returns all three results in the next message, and the model compares them and answers. The network wait is one 600 ms slice, not three; the model turns drop from three to two.

Now change the question to "How does the cheapest of these compare to last year's model?" Finding the cheapest is still a parallel fan-out, but the follow-up — look up last year's version of whichever won — is sequential, because you don't know which product to compare until the first batch returns. Real agents constantly mix the two: a parallel burst, then a dependent step, then maybe another burst. That mixing is the agentic workflow the model orchestrates turn by turn.

Going deeper

Once the basics click, a few finer points separate a demo from a robust system.

Forcing or disabling parallelism. Most APIs let you influence the model's behavior. You can require it to call a specific tool, force it to call exactly one tool per turn (disabling parallelism), or leave it free to batch. Disabling parallel calls is a useful debugging step when the model keeps parallelising things that should be sequential — it makes the agent slower but easier to reason about while you fix tool descriptions.

Streaming and ordering. When responses stream, the model's tool calls arrive token by token, and a later call's arguments may finish before an earlier one's. Buffer until each tool-call block is complete before dispatching it, and always reconcile by id. The same applies to results: collect them all, then send the batch — don't stream results back one at a time.

Concurrency control. A gather with no limit is fine for three calls and dangerous for thirty. Wrap tool execution in a bounded semaphore so a wide fan-out can't exceed your downstream rate limits or connection pool. This is invisible to the model — it's purely a property of your runtime.

Error semantics. Decide, per tool, whether a failure should be returned to the model as an error result (so it can adapt) or should abort the whole turn. Most of the time, returning a structured error per failed call is better: the model often recovers gracefully, retrying or routing around the gap, which is one of the strengths of the ReAct loop.

Where this sits in the bigger picture. Parallel tool calling is one optimisation inside the broader topic of tool use. It pairs naturally with techniques like code execution as a tool, where the model writes code that itself fans out work — sometimes a cleaner alternative to emitting dozens of separate tool calls. The durable lesson: the model decides what to run in parallel based on independence, but safe parallel execution is something you engineer into your tools and runtime, not something you get for free.

FAQ

What are parallel tool calls in an LLM agent?

They are when a model requests several tool (function) calls in a single response, so your runtime can execute them at the same time instead of one after another. The model only emits the requests; your code runs them concurrently and returns all the results together. It is the standard way to speed up tasks made of independent lookups.

When should an agent use parallel tool calls instead of sequential ones?

Use parallel calls when the tools are independent — when you already know every call's arguments and no call needs another's output. Use sequential calls when one call's input comes from another's result (for example, find a user, then fetch that user's orders). The quick test: can you write call B's arguments before call A returns? If yes, parallelise.

Does parallel function calling actually make agents faster?

Yes, when the work is independent I/O. Total wait drops from the sum of all calls to roughly the time of the slowest one, and you also cut the number of model round-trips, which lowers latency and token cost. It does not help — and can cause bugs — when calls depend on each other or write to shared state.

What happens if one of several parallel tool calls fails?

You should still return a result for the failed call, but as a structured error rather than nothing, so the conversation stays well-formed. The model can then decide to retry, route around the gap, or report it. Dropping the failed result entirely usually breaks the next turn because every call id must be answered.

Why must I match parallel tool results by id instead of by order?

Because concurrent calls can finish in any order — a faster call started second may return before a slower one started first. Each tool call carries a unique id, and each result must reference that id so the model pairs them correctly. Matching by position instead of id is a race-condition bug that only appears under load.

Can I turn off parallel tool calling?

Usually yes. Most model APIs let you force exactly one tool call per turn, which disables parallelism. It is a handy debugging mode when the model keeps parallelising calls that should run in order — slower, but easier to reason about while you improve your tool descriptions.

Further reading