In plain English
Ask one language model a hard question and you get one answer. It might be brilliant. It might be confidently wrong. You have no second opinion, no one to catch the slip. Multi-agent debate fixes that by doing the obvious human thing: ask several models, have them critique each other, and let the strongest answer win.

Picture a panel of experts reviewing a tricky case. Each writes down an answer. Then they swap notes, poke holes in each other's reasoning, and revise. After a round or two, they usually converge on something better than any single first draft — because the obvious mistakes got caught and corrected out loud. That back-and-forth is exactly what multi-agent debate automates.
There is an even simpler cousin called voting (or ensembling): skip the arguing, just ask the same question several times, and keep whatever answer shows up most often. No critique, no rounds — pure wisdom of the crowd. Debate and voting are two points on the same scale: spend more compute on a question, and trade money for accuracy.
Why it matters
A single model answers in one pass. If it takes a wrong turn early in its reasoning, nothing stops it — the rest of the answer is built on the mistake. Debate and voting exist to add the missing safety net: a way to catch and correct errors that one pass can't see in itself.
- Reasoning errors. On multi-step math, logic puzzles, and planning, a single chain of thought can quietly go off the rails. When several independent attempts disagree, the disagreement itself is a signal — and the majority answer is usually the right one.
- Factual mistakes. One model may state a wrong fact fluently. A second model, asked to critique, often spots the claim it can't justify. Cross-examination surfaces shaky assertions that a lone model would never question.
- Overconfidence. A single answer comes with no honest uncertainty. If you run five attempts and they split three different ways, you've learned the question is genuinely hard — useful information you'd never get from one confident reply.
Who should care? Anyone whose task is hard and high-value enough that a wrong answer is expensive: a math or coding agent that must get the final step right, an evaluation pipeline grading other models, a research assistant where one fabricated fact poisons the report. For those, paying 3–5× the tokens to cut the error rate is an easy trade.
And who should not? Most everyday tasks. Summarizing an email, drafting a reply, simple lookups — a single careful agent already nails these, and debate just multiplies the cost for no gain. The whole technique only earns its keep when answers are both hard to get right and costly to get wrong.
How it works
All of these methods share one idea — generate multiple answers, then aggregate them — but they differ in how much the answers talk to each other. It helps to see them as a ladder from cheapest to most thorough.
Self-consistency: vote, don't argue
The simplest version doesn't involve multiple agents at all. You ask the same model the same question several times with a non-zero temperature (so each run wanders down a slightly different reasoning path), then take a majority vote over the final answers. This trick is called self-consistency. There's no critique and no second model — just N independent attempts and a tally. It works because correct reasoning tends to converge on the same answer, while wrong reasoning scatters in many directions.
Here the majority answer is 42 (three of four runs agree), so the lone outlier gets outvoted. The model never sees the other answers — aggregation happens outside the model, by simple counting.
Multi-agent debate: argue, then converge
Debate adds the missing step: the agents read each other's answers and respond. In round one, each agent answers independently. In round two, every agent sees the others' answers and reasoning and is asked: given what your peers said, do you stand by your answer or revise it? Repeat for a couple of rounds. Because each agent must now justify itself against criticism, weak answers tend to collapse and the group drifts toward a shared, better-argued conclusion.
The agents can be copies of one model or a mix of different models (a mixed panel often disagrees more usefully). After the final round you still need to pick a winner: take a majority vote over the agents' last answers, or hand all the transcripts to a separate judge agent that reads the debate and declares the best-supported answer.
Here's the round-two prompt at the heart of debate — it's mostly just string assembly, feeding peers' answers back in:
Here is the question:
What is 17 x 24?
Your previous answer was:
408
Other agents answered:
[Agent B] 408 — multiplied 17 x 24 step by step.
[Agent C] 398 — (looks like an arithmetic slip).
Review the other answers. If one of them exposes a
mistake in yours, revise. Otherwise, defend your answer
and explain why the others are wrong.Debate vs voting vs reflection
These three are easy to mix up because they all try to improve a single model's first answer. The difference is who does the checking and how much it costs.
| Technique | How it improves the answer | Rough cost | Best for |
|---|---|---|---|
| Reflection (self-critique) | One agent reviews and revises its own answer | ~2× | Cheap quality boost on most tasks |
| Self-consistency (voting) | Sample N answers from one model, majority vote | N× (3–10×) | Tasks with one clear final answer (math, classification) |
| Multi-agent debate | Several agents critique each other over rounds | N × rounds (high) | Hard reasoning where peers catch each other's flaws |
Notice the progression. Reflection — a single agent checking its own work — is the cheapest win and surprisingly effective; reach for it first. Voting helps when answers are short and comparable, so you can tally them. Debate is the heaviest hammer: use it only when a question is hard enough that one model genuinely benefits from being challenged by another.
The cost multiplier
The reason debate isn't the default for everything is simple arithmetic. Every extra answer and every extra round is a full model call you pay for. The total cost roughly equals number of agents × number of rounds, plus a final judging call.
| Setup | Model calls per question | Relative cost |
|---|---|---|
| Single agent | 1 | 1× |
| Self-consistency, 5 samples | 5 | ~5× |
| 3 agents, 2 debate rounds | 6 + 1 judge | ~7× |
| 5 agents, 3 debate rounds | 15 + 1 judge | ~16× |
Latency multiplies too. The independent first-round answers can run in parallel, but each debate round depends on the previous one, so rounds happen in sequence. A two-round debate is at least twice as slow end-to-end as a single answer — a real problem for anything user-facing. And there's a quieter cost: each round stuffs all the peers' answers back into the prompt, so the context (and the per-call token count) grows every round.
Going deeper
Once the basic loop clicks, the interesting questions are about how to wire it well and when it actually pays off. A few directions worth knowing.
Diversity is the fuel. Debate and voting only help when the answers are genuinely independent. If all your agents are the same model at the same temperature, they tend to make the same mistakes, and the vote just rubber-stamps a shared error. Real gains come from diversity: different models, different prompts, different temperatures, or different tools. The goal is for one agent's blind spot to be another agent's easy catch.
Aggregation is its own design choice. Majority vote is the simplest aggregator but it needs answers you can compare cleanly — easy for a number or a label, awkward for a paragraph of free text. For open-ended answers, the common pattern is an LLM judge: a separate model that reads every candidate plus the debate transcript and selects or synthesizes the best one. That judge is now a single point of failure, so its prompt deserves real care.
Where it fits in a bigger system. Debate is one tactic inside the broader space of multi-agent systems. It's distinct from the orchestrator–worker pattern, where a lead agent splits a task into different sub-tasks for specialists — there the agents divide labor; in debate they all attack the same problem and compare. They can combine: an orchestrator might spin up a debate only for the one sub-task that's hardest to get right.
Know when to stop climbing the ladder. The honest finding across the research is that returns diminish fast. Going from one answer to a few helps a lot; going from five to fifteen helps a little; adding a third or fourth debate round often helps not at all and sometimes hurts as agents talk themselves into a confident huddle. For most products, the right answer is the cheapest rung that clears your quality bar — frequently that's plain reflection or three-sample voting, not a full panel. Before reaching for debate, it's worth asking the upstream question entirely: do you even need an agent here, or would one well-prompted call do?
FAQ
What is multi-agent debate in AI?
It's a technique where several language model agents each answer the same question, then read and critique each other's answers over one or more rounds before converging on a final answer. The cross-examination catches mistakes a single model wouldn't question on its own, so the group answer is usually better than any single first draft.
What is the difference between self-consistency and debate?
Self-consistency samples the same model several times and takes a majority vote — the runs never see each other. Debate is heavier: the agents actually read each other's answers and reasoning and revise in response. Self-consistency is cheaper and great for tasks with one clear answer; debate helps more on hard reasoning where agents can expose each other's flaws.
Do multiple agents really give better answers?
Often, but not always. On hard reasoning and factual tasks, having independent attempts vote or debate measurably cuts errors. But the agents must be genuinely diverse — identical models tend to make identical mistakes, so the vote just confirms a shared error. On easy questions it adds cost for no real gain.
How much more expensive is multi-agent debate?
Roughly the number of agents times the number of rounds, plus a final judging call. Three agents over two rounds is about seven model calls per question versus one — so 5×–15× the token cost is typical, and latency grows too because rounds run in sequence. That's why it's reserved for hard, high-value questions.
When should I use a single agent instead of debate?
For most everyday tasks — summaries, drafts, simple lookups — a single careful agent is already accurate, and debate just multiplies cost. Try cheaper steps first: a single agent that reflects on and revises its own answer captures much of the benefit at roughly double the cost, not ten times. Escalate to debate only when those still miss.