Why Tool Outputs Should Be Structured, Not Just Text

Learn that designing what a tool returns matters as much as its inputs — and how trimmed, structured results make agents more accurate and cheaper.

INTERMEDIATE11 MIN READUPDATED 2026-06-13

In plain English

When an agent calls a tool, the tool runs some code and hands a result back. That result goes straight into the model's next prompt — it becomes part of what the model reads before it decides what to do next. So the output of a tool is not a side detail. It is input to the model's reasoning. Most guides obsess over the tool's name, description, and parameters. Far fewer mention that what a tool returns matters just as much.

Structured Tool Outputs — illustration — Structured Tool Outputs — miro.medium.com

Here is the everyday analogy. Imagine you ask an assistant, "Is the conference room free at 3pm?" A good assistant says: "Yes, it's free until 4." A bad assistant photocopies the entire building's booking system — every room, every day, every attendee — and drops the stack on your desk. Both technically contain the answer. Only one is usable. A structured tool output is the one-line answer; a raw dump is the photocopied stack.

"Structured" here means two things at once: the result is shaped (clear fields with names, like a small json object) instead of a wall of text, and it is trimmed to the parts that matter instead of the full firehose. A weather tool shouldn't return the raw 40-kilobyte API response with radar tiles and station metadata. It should return something like {"temp_c": 18, "condition": "rain", "city": "Berlin"}. Same information the agent needs — a fraction of the noise.

Why it matters

Dumping a giant raw response back to the model feels harmless — the data is there, surely the model can find what it needs. In practice it hurts on three fronts at once: cost, speed, and accuracy.

Tokens cost money and fill the context window. Every character a tool returns is re-sent to the model on the next step, and often on every step after that as the conversation grows. A 50KB JSON blob can be 15,000+ tokens. In a multi-step agent loop that runs ten times, you may pay for that blob ten times over. Trimming it to the 200 tokens that matter is a direct, repeated cost saving.
Noise lowers accuracy. Models recall facts buried in long inputs less reliably than facts near the edges — the well-known "lost in the middle" effect. Bury the one number that matters inside 14,000 tokens of irrelevant metadata and the model is genuinely more likely to grab the wrong field or miss it entirely.
Big outputs are slow. More input tokens means higher latency on the next model call. In an agent that loops many times, this latency stacks up and the whole task drags.
Shape removes guessing. Faced with a flat slab of prose, the model has to parse meaning back out before it can act. Hand it labelled fields and the parsing is already done — there is less to misread.

There's a reliability angle too. The output of one tool is frequently the input to the next decision: the model reads a search result and decides which page to open; it reads a database row and decides whether to write an update. If that intermediate result is ambiguous or cluttered, every downstream step inherits the confusion. Clean outputs make the whole agent loop steadier, not just the one step.

Who should care? Anyone wrapping an external API, a database, or a file system as a tool for an agent. The temptation is always to return response.json() verbatim because it's the least code. Resisting that one-liner is one of the highest-leverage things you can do for an agent's quality.

How it works

Mechanically, a tool result is just a string (or a small structured value serialized to a string) that your code attaches to the conversation as a tool_result before the next model call. There is no magic — you decide what goes in that string. The design work happens in the gap between "the raw data your function fetched" and "the result you hand back." Think of it as a curation step that sits between fetch and return.

// Where output design lives

Model calls toolname + argumentsYour code runshit API / DB / fileRaw result50KB JSON, full rowsCurateselect + label + trimtool_resultsmall, structuredModel reads itdecides next step

The four curation moves

Good output design is almost always some combination of these four moves applied to the raw result before you return it.

Select — keep only the fields the agent could plausibly act on; drop IDs it will never use, internal flags, pagination cursors, styling metadata.
Label — give the surviving values clear names so the model doesn't have to infer what a bare number means. temp_c beats 18 floating alone.
Trim — cap list lengths and text lengths. Return the top 5 results, not 500; the first 2,000 characters of a page, not 200,000.
Signal — tell the model what was left out, so it can ask for more. "Showing 5 of 213 matches" is far more useful than silently returning 5 and pretending that's all there is.

Choosing a format: JSON, prose, or a handle

Not every output should be JSON. Three shapes cover almost everything, and picking the right one is part of the design.

Format	Best for	Example
Structured (JSON)	Discrete values the model will reason over or pass to another tool	`{"order_id": 4471, "status": "shipped", "eta": "2026-06-15"}`
Prose / text	Content meant to be read and summarized, like an article or doc snippet	The trimmed body text of a web page
Reference / handle	Data too big to inline, that a later tool can fetch by ID	`{"file_id": "rep_88", "rows": 12000, "preview": [...]}`

The handle pattern is the one beginners miss. When a result is genuinely huge — a 12,000-row query, a large file — you don't have to inline it. Return a small summary plus an identifier, and provide a second tool that fetches a slice of it by that identifier (a page, a filtered subset, a single row). The model gets enough to decide what it needs without you paying to stuff the whole thing into context. This is the same instinct behind preferring code execution over yet more tools: let the agent reach for the slice it needs instead of receiving everything up front.

A worked example: raw vs curated

Say the agent calls a search_products tool. The underlying API returns rich JSON. The lazy implementation returns it verbatim; the considered one curates it. Here's the raw response the API gives your code:

raw API response (one of 200 items shown)json

{
  "meta": { "request_id": "req_9f2a", "took_ms": 41, "page": 1,
            "per_page": 200, "total": 213, "api_version": "2.4.1" },
  "results": [
    {
      "sku": "A-1182-XL", "internal_id": 90431, "name": "Trail Jacket",
      "price_cents": 12900, "currency": "USD", "in_stock": true,
      "warehouse_flags": ["W3", "W7"], "tax_class": "std",
      "image_urls": ["https://cdn.example.com/a1.jpg", "...x6"],
      "description_html": "<p>Lightweight ...</p>", "created_at": "2024-11-02",
      "_score": 8.41, "category_path": ["apparel", "outerwear"]
    }
  ]
}

Returning that for all 200 items is thousands of tokens of request_id, warehouse_flags, tax_class, raw HTML, and image URLs the agent will never act on. Now the curated version — select the fields a shopping agent actually uses, trim to the top matches, and signal what was held back:

curate before returningpython

def search_products(query: str, limit: int = 5) -> dict:
    raw = api.search(query, per_page=200)        # full firehose
    items = raw["results"][:limit]               # TRIM the list
    return {
        "matches": [
            {                                    # SELECT + LABEL
                "sku": it["sku"],
                "name": it["name"],
                "price_usd": it["price_cents"] / 100,
                "in_stock": it["in_stock"],
            }
            for it in items
        ],
        # SIGNAL what was left out so the model can ask for more
        "showing": len(items),
        "total_found": raw["meta"]["total"],
    }

what the model now receivesjson

{
  "matches": [
    { "sku": "A-1182-XL", "name": "Trail Jacket",
      "price_usd": 129.0, "in_stock": true }
  ],
  "showing": 5,
  "total_found": 213
}

Same answer the agent needs, perhaps a tenth of the tokens, and zero parsing of HTML or guessing what price_cents means. The total_found field is the quiet hero: it tells the model there are 213 matches and only 5 are shown, so if the user asks "are there cheaper ones?" the model knows to search again with a tighter query rather than assuming it has seen everything.

Pagination, truncation, and errors

Two situations force you to leave data out: there's simply too much of it, and something went wrong. Handle both explicitly so the model isn't left guessing.

Truncation and pagination

When you cut a long result short, say so in the result. Silent truncation is dangerous because the model treats whatever it sees as complete. Two reliable patterns:

Truncate with a marker. Return the first N characters plus a note: append ... [truncated, 8200 of 41000 chars]. Now the model knows there's more and can decide whether to fetch the rest.
Paginate with a cursor. For long lists, return a page plus a next_page token, and expose a tool that accepts that token. The model pages through deliberately instead of you dumping all 500 rows at once.

Errors are outputs too

An error is just another result the model reads — so design it the same way. Returning a bare 500 or a raw stack trace tells the model nothing actionable. Return a short, plain-language message that says what failed and what to do next: "No product matched 'xl jakcet' — check spelling or try a broader term." That sentence lets the model self-correct on the next turn, which is the whole point of giving an agent tools. The same principle covers tool inputs — see how to design agent tools for the input side.

Going deeper

Once the basics click — select, label, trim, signal — a few subtler ideas separate a decent tool from a great one.

Match the output to the agent's actual task, not the API's shape. The right fields depend on what the agent is trying to do. A price-comparison agent needs price and in_stock; a catalog-audit agent might need created_at and category_path. Don't design a single "complete" output and reuse it everywhere — design the output for the job, and it's fine to have two thin tools over the same API instead of one fat one.

Progressive disclosure beats one big payload. Rather than returning everything a step might need, return a compact summary and let the model pull detail on demand with a follow-up call. This keeps the common case cheap and only pays for depth when the task genuinely requires it. It's the handle pattern generalized into a design philosophy: defer cost until it's needed.

Be consistent across tools. If one tool returns price_usd and another returns cost_in_dollars, the model has to learn two vocabularies. Shared field names and a shared error format across your toolset reduce the model's cognitive load and cut subtle mistakes — the same discipline good tool design brings to inputs, applied to outputs.

Measure, don't guess. The honest way to tune output design is to run the agent on real tasks and watch where it stumbles. If it keeps re-calling a search because it can't tell whether it saw all the results, add a total_found. If it grabs the wrong field, the labels or selection are off. Token counts and task success rate, tracked over a set of evaluation tasks, turn output design from taste into measurement. There is no universal "correct" output — only the one that makes your agent more accurate and cheaper on your tasks.

Finally, remember the principle that ties it all together: a tool's output is part of the prompt for the model's next step. Every habit you have for writing a good prompt — be concise, be clear, label things, don't bury the lede — applies just as much to what your tools return. Treat output design as prompt design, because that's exactly what it is.

FAQ

What should a tool return to an AI agent?

Return the smallest, clearest result that lets the agent decide its next step — usually a structured object with a few labelled fields, trimmed to the relevant parts. Drop metadata, internal IDs, and styling the agent will never act on, and include a short signal (like a total count) when you leave data out so the model knows more exists.

Should tool outputs be JSON or plain text?

Use JSON for discrete values the model will reason over or pass to another tool (an order status, a price, a row). Use plain text for content meant to be read and summarized, like a document snippet or article body. The deciding question is whether the model needs to act on specific fields (JSON) or understand prose (text).

Why is dumping a raw API response back to the model a bad idea?

A large raw response wastes tokens — it's re-sent on every following step, so a 50KB blob in a ten-step loop is paid for many times. It also lowers accuracy, because models recall facts buried in long inputs less reliably, and it adds latency. Selecting and labelling the few fields that matter is cheaper, faster, and more reliable.

How do I handle a tool result that's too big to return?

Use a handle or reference pattern: return a small summary plus an identifier, and expose a second tool that fetches a slice of the data by that identifier (a page, a filtered subset, a single row). For long lists, paginate with a cursor. Either way, the agent gets what it needs without you inlining the whole payload.

How should a tool report an error to an agent?

Treat the error as an output the model will read. Instead of a bare 500 or raw stack trace, return a short plain-language message saying what failed and what to do next, such as suggesting a corrected query. That lets the model self-correct on its next turn instead of getting stuck.

Does trimming tool output actually save money?

Yes, and often more than expected, because the result is re-sent to the model on every subsequent step in an agent loop. A blob that's expensive once becomes expensive many times over. Cutting it to the fields that matter reduces input tokens on each of those calls, which directly lowers cost and latency.

// In plain English

// Why it matters

// How it works

The four curation moves

Choosing a format: JSON, prose, or a handle

// A worked example: raw vs curated

// Pagination, truncation, and errors

Truncation and pagination

Errors are outputs too

// Going deeper

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

A worked example: raw vs curated

Pagination, truncation, and errors

Going deeper

FAQ

Further reading

Related