AI/TLDR

OpenAI Agents SDK: Handoffs, Guardrails & Tracing

You will understand how the OpenAI Agents SDK coordinates multiple agents through handoffs, enforces guardrails, and traces runs, and how it relates to the earlier Swarm experiment.

INTERMEDIATE11 MIN READUPDATED 2026-06-14

In plain English

The OpenAI Agents SDK is a small, open-source framework (Python and TypeScript) for building apps where several AI agents work together instead of one big agent doing everything. It is OpenAI's production-ready successor to their earlier experimental Swarm project, and it boils the whole problem down to a handful of primitives: agents, handoffs, guardrails, and built-in tracing. This article zooms in on the three that make multi-agent systems actually reliable — handoffs, guardrails, and tracing.

OpenAI Agents SDK Handoffs — illustration
OpenAI Agents SDK Handoffs — firebasestorage.googleapis.com

Picture a hospital front desk. A patient walks in and describes a problem. The receptionist does not try to diagnose anything — they figure out who should handle it and send the patient to the right department: cardiology, dermatology, billing. That redirection is a handoff: control of the conversation passes to a specialist who takes over completely, with the patient's history already in hand.

Now add the rules pinned to the wall — things no department is allowed to do, like prescribe a drug a patient is allergic to. Those are guardrails: automatic checks that run alongside the agents and pull the emergency brake if something crosses a line. And the patient's chart, where every visit, test, and referral is logged so anyone can reconstruct what happened, is tracing. The Agents SDK gives you all three as real objects you configure in code, plus a runtime called the Runner that drives the whole flow.

Why it matters

One agent with one giant prompt and twenty tools sounds simple, but it falls apart in practice. The instructions get long and contradictory, the model picks the wrong tool, and a single context window cannot hold the knowledge for billing and refunds and technical support and legal. The natural fix is to split the work across focused agents — but the moment you do that, three hard problems appear at once.

  • Routing. Something has to decide which agent handles a given message, and pass the conversation along without losing context. Hand-rolling this with if/else logic gets brittle fast. Handoffs make routing a first-class, model-driven decision.
  • Safety. Once agents can call tools and take actions, you need checks that block bad input before it reaches a model and bad output before it reaches a user. Guardrails are that safety layer, built into the runtime rather than scattered through your code.
  • Observability. Multi-agent runs are hard to debug — a single user message can trigger several model calls, tool calls, and handoffs. When something goes wrong, you need a replay. Tracing records every step automatically so you can see exactly what happened.

The value of the SDK is that it makes these three concerns standard instead of bespoke. Before frameworks like this, every team reinvented their own routing glue, their own ad-hoc safety checks, and their own logging. Getting them as named primitives means less code to write, fewer ways to get it subtly wrong, and a shared vocabulary other developers already understand.

How it works

Everything runs inside the Runner. You call Runner.run(agent, input) and the Runner takes over the loop: it sends the conversation to the model, executes any tool the model asks for, follows any handoff the model decides to make, runs the guardrails, and stops when the active agent produces a final answer. You get back a result object holding the full history of what happened.

A handoff is just a special tool call

The key insight: a handoff is not magic — it is a tool. When you list other agents in an agent's handoffs parameter, the SDK quietly exposes each one to the model as a tool named something like transfer_to_billing_agent. The model, reading the conversation, decides whether to call that tool. If it does, the Runner swaps the active agent to the target and replays the entire conversation history into it, so the new agent has full context and continues talking to the user directly. The first agent is now out of the loop.

Because the model makes the decision, routing adapts to messy real-world phrasing far better than keyword rules. The user does not have to say "billing" — they can say "my card got charged twice" and the triage agent infers the right specialist. This is the same loop whether you have two agents or twenty.

Guardrails wrap the run, not the model

A guardrail is a function that inspects either the input to the first agent or the output from the last agent, and returns an object with a tripwire_triggered boolean. When the tripwire is True, the Runner immediately raises an exception (InputGuardrailTripwireTriggered or OutputGuardrailTripwireTriggered) and stops the run. Your code catches it and returns a safe fallback instead of letting a bad message or bad answer through. Crucially, input guardrails can run in parallel with the agent so they add little latency.

Tracing records the whole thing

Every Runner.run automatically emits a trace: a structured record of each model call, tool call, handoff, and guardrail check, tied together by a single trace ID. By default these go to the OpenAI dashboard for a visual replay, but an open processor interface lets you ship traces to other observability tools instead. You do not instrument anything — it is on by default.

A worked handoff example

Here is a triage system in a few lines. A triage agent decides between two specialists; an input guardrail blocks off-topic messages before any specialist runs. Notice how little wiring it takes — you just declare the relationships.

Installbash
pip install openai-agents   # Python 3.10+
export OPENAI_API_KEY=sk-...
triage.pypython
import asyncio
from agents import (
    Agent, Runner, InputGuardrail,
    GuardrailFunctionOutput, RunContextWrapper,
)
from pydantic import BaseModel


# --- An input guardrail (itself a tiny agent) ---
class TopicCheck(BaseModel):
    is_off_topic: bool

guard_agent = Agent(
    name="Topic Guard",
    instructions="Is this message NOT about billing or tech support?",
    output_type=TopicCheck,
)

async def check_topic(ctx: RunContextWrapper, agent, user_input):
    res = await Runner.run(guard_agent, user_input, context=ctx.context)
    return GuardrailFunctionOutput(
        output_info=res.final_output,
        tripwire_triggered=res.final_output.is_off_topic,
    )


# --- Specialists the triage agent can hand off to ---
billing = Agent(name="Billing", instructions="Answer billing questions.")
tech = Agent(name="Tech Support", instructions="Troubleshoot step by step.")

# --- Triage agent: routes + runs the guardrail ---
triage = Agent(
    name="Triage",
    instructions="Route billing issues to Billing, technical issues to Tech Support.",
    handoffs=[billing, tech],
    input_guardrails=[InputGuardrail(guardrail_function=check_topic)],
)


async def main():
    result = await Runner.run(triage, "My card was charged twice this month.")
    print(result.final_output)        # answered by the Billing agent
    print(result.last_agent.name)     # -> "Billing"

asyncio.run(main())

At runtime the triage agent recognises a billing problem, calls the hidden transfer_to_Billing tool, and the Runner switches the active agent. The billing agent answers the user directly with full context. If the user had typed something unrelated, the guardrail's tripwire would fire in parallel and you would catch the exception before the billing agent ever ran.

Handoffs vs agents-as-tools

There are two ways for one agent to use another, and beginners constantly mix them up. They produce very different conversation shapes, so choosing the right one is a real design decision.

QuestionUse a handoffUse agent-as-tool
Who talks to the user next?The new agentThe same orchestrator
Do you need several results combined?NoYes
Should the first agent stay in control?NoYes
Typical shapeCustomer-support triageResearch assistant calling sub-tasks

A rule of thumb: if the specialist should become the assistant the user is talking to, hand off. If you want a coordinator that asks a few experts and writes the final answer itself, expose those experts as tools instead. You can mix both in one system.

Common pitfalls

  • Vague handoff descriptions. The model routes based on each agent's name and handoff description. If two agents sound similar ("Support" and "Help"), the model picks wrong. Give each a clear, distinct description of when to route there.
  • Expecting guardrails on every hop. Input guardrails only run on the first agent in a chain; output guardrails only run on the last. A check you assumed protected a middle agent may never fire. For per-step safety, attach a guardrail to a specific tool instead.
  • Loops between agents. If agent A can hand off to B and B can hand back to A, a confused model can ping-pong. Set a sensible max_turns on the run so a runaway loop stops instead of burning tokens forever.
  • Forgetting context cost. Each handoff replays the full conversation into the next agent. In long chains that history grows, raising token cost and latency. Keep instructions tight and prune history where you can.
  • Trusting routing blindly. The model's routing is good, not perfect. Use tracing to review real runs and check the right specialist actually handled each case before you trust it in production.

Going deeper

Once the three primitives feel natural, a few advanced directions are worth knowing.

Structured handoff inputs

By default a handoff just transfers the conversation. But you can attach an input_type (a Pydantic model) so the handing-off agent is forced to fill in structured fields — for example a reason for an escalation or a ticket_id. The receiving agent then starts with clean, typed data instead of having to re-parse the conversation. This makes multi-agent flows much more predictable.

Filtering what gets passed on

Passing the entire history to every downstream agent is not always what you want — sometimes earlier tool noise just confuses the specialist. A handoff can take an input_filter that trims or reshapes the history before the new agent sees it (for example, dropping prior tool-call clutter). Less noise in means cleaner reasoning out.

Tool-level guardrails

Because input and output guardrails only cover the ends of a chain, the SDK also supports guardrails attached directly to a tool. These fire every time that specific function is called, no matter which agent invoked it — ideal for blocking a database write above a row limit or a payment over a threshold, deep inside a multi-agent flow.

Routing to other ecosystems

The same handoff/guardrail/tracing pattern shows up across provider SDKs, with different trade-offs. If you are comparing, see provider agent SDK comparison, the Claude Agent SDK, and Strands Agents. The OpenAI SDK also speaks the Model Context Protocol, so agents in a run can use tools served by any MCP-compatible server without you writing wrappers.

The honest open challenges remain. Model-driven routing is only as good as the model's judgment, so evaluation and tracing are not optional extras — they are how you find the cases where a handoff went to the wrong place. And every handoff trades a cleaner architecture for more model calls and more passed-around context. The durable lesson: keep each agent narrow, name and describe your handoffs precisely, guard the actions that matter, and watch real traces before you trust the system in production.

FAQ

How do handoffs work in the OpenAI Agents SDK?

You list target agents in an agent's handoffs parameter, and the SDK exposes each as a hidden tool like transfer_to_billing_agent. When the model decides to route, it calls that tool; the Runner then switches the active agent and replays the full conversation history into it. The new agent answers the user directly and the original agent steps out of the loop.

What is the difference between a handoff and using an agent as a tool?

A handoff transfers ownership of the conversation — the receiving agent replies to the user directly. Agent-as-tool keeps an orchestrator in charge: it calls a specialist like any other tool and folds the returned result into its own answer. Use handoffs when a specialist should take over; use agent-as-tool when one coordinator should gather and synthesise several results.

Do guardrails run on every agent in a handoff chain?

No. Input guardrails run only on the first agent in the chain, and output guardrails run only on the last. Agents in the middle are not automatically covered. If you need a safety check on a specific step, attach a guardrail directly to the relevant tool instead, since tool guardrails fire every time that function is called.

Is the OpenAI Agents SDK the same as OpenAI Swarm?

They share the same conceptual model but are different libraries. Swarm was explicitly labelled experimental and educational. The Agents SDK is the production-ready successor with a typed API, guardrail machinery, structured outputs, and built-in tracing, and it is actively maintained. Build on the Agents SDK rather than Swarm.

How do I stop two agents from handing off to each other forever?

Set a max_turns limit on the run so the Runner stops after a fixed number of steps instead of looping indefinitely. It also helps to give each agent precise handoff descriptions so the model is less likely to bounce control back and forth, and to review traces for ping-pong patterns before going to production.

Can I customise what happens during a handoff?

Yes. Wrap the target agent in the handoff() helper to override the tool name, add a description the model reads when deciding, run an on_handoff callback, require structured input_type data, or apply an input_filter that trims the history the receiving agent sees. These let you control routing precisely instead of relying on defaults.

Further reading