multi-agent systems

Build a Team of Agents, or Do Not: The Multi-Agent Debate

Cognition said do not build multi-agents. Anthropic shipped one days later. Both were right, and the gap between them is the best design rule of the year.

In June 2025 the two best agent-building teams in the industry published opposite advice within a day of each other. Cognition, the company behind the autonomous coding agent Devin, put out an essay with a flat title: "Don't Build Multi-Agents". The argument was sharp and the conclusion was a single-threaded agent. Roughly twenty-four hours later, Anthropic published a field report on the multi-agent research system it had just shipped inside Claude, complete with an internal eval showing the multi-agent design beat the single-agent one by 90 percent.

For a few weeks this looked like a genuine schism. It was not. It was the field arguing its way to a rule, and the rule is worth more than either headline. If you are deciding whether your next system needs a second agent, this is the debate to understand, because the answer is not a preference. It is a property of your task.

Origin: where the swarm idea came from

The dream of many agents working together is older than the current wave. It runs through decades of research on multi-agent systems, the idea that you could decompose a hard problem, hand the pieces to specialised solvers, and recombine the results. When large language models got good enough to follow instructions and call tools, that old dream got a new body. If one LLM in a loop is an agent, the reasoning went, then ten LLMs in loops, each a specialist, must be a team. AutoGPT and BabyAGI in 2023 made the picture concrete and went viral on it: spawn sub-agents, let them talk, watch the work get done.

The picture was seductive and it mostly did not work. The viral demos looped, drifted, and burned tokens without converging. But the framing stuck, and by 2024 every framework had a multi-agent story. CrewAI built its whole identity on role-based crews. Microsoft AutoGen modelled agent collaboration as a structured conversation. The marketing settled into a comfortable and untested claim: more agents means more intelligence.

That claim is what Cognition decided to attack.

Present: two essays, one real disagreement

Cognition's case is built on two principles, and they are easy to state. The first: share context, and share full agent traces, not just the final messages an agent produces. The second: actions carry implicit decisions, and conflicting decisions produce bad results.

The essay makes this land with a small example. Ask a multi-agent system to build a Flappy Bird clone. The orchestrator splits the job: one sub-agent builds the background and pipes, another builds the bird. They run in parallel, and they have never seen each other's work. The first sub-agent renders a Super Mario style background because nothing told it otherwise. The second builds a bird in a visual style that does not match. The orchestrator is now holding two halves that do not fit, and there is no clean way to glue them. Each agent made a reasonable local decision. Together the decisions clashed. Cognition's conclusion: until agents can communicate as richly as people do, a single-threaded agent carrying the whole context beats a swarm that fragments it. Where a sub-agent is genuinely useful, the essay points at Claude Code, where sub-agents answer scoped questions and never write code in parallel.

Then Anthropic published the other side. Its research system is multi-agent on purpose. A lead agent, the orchestrator, reads the query, sets a strategy, and spawns sub-agents that search in parallel. Each sub-agent works in its own context window, pulls what it needs, and hands back a compressed summary. The lead synthesises. On Anthropic's internal research eval, a system with Claude Opus 4 leading and Claude Sonnet 4 sub-agents outperformed single-agent Opus 4 by 90.2 percent.

Read the two pieces together and the contradiction dissolves. Anthropic is explicit about where its design fits: tasks with heavy parallelism, work that exceeds one context window, and breadth-first exploration. It is just as explicit about where it does not fit. Anthropic says plainly that most coding tasks have fewer truly parallel pieces than research does, and that current models are not good at coordinating with each other in real time. That is Cognition's point, stated by the team that shipped the multi-agent system. The disagreement was never about architecture. It was that the two teams build for different work. Cognition builds a coding agent, where step two depends on step one and the agent writes a shared artifact. Anthropic was describing research, where you can chase ten leads at once and nobody is editing the same file.

The deciding factor: read-heavy and independent, or write-heavy and coupled

Once you see the split, the rule is simple. The question is whether your subtasks are genuinely independent and read-heavy, or tightly coupled and write-heavy.

Research is the clean case for many agents. Finding every board member of the S&P 500, gathering sources on a market, mapping a competitive field: these decompose into pieces that do not touch. One sub-agent's findings do not constrain another's. The work is reading, and reading is easy to parallelise. Crucially, each sub-agent gets its own context window, so the system can hold far more information than a single window allows, and each agent stays focused instead of drowning in everything at once.

Coding is the clean case against. Refactoring a module, writing a long document, building a feature: step two depends on the result of step one, and the agents would be writing into a shared state. Parallelise that and you get the Flappy Bird problem. Every write is an implicit decision, and two agents writing without shared context will contradict each other. A single agent carrying continuous context simply does not have that failure mode.

So the test for a second agent is not whether the task feels big. It is three conditions, and you want all three. The subtasks must run independently, with no agent needing another's output mid-flight. They should need materially different tools or permissions, so the separation buys something real. And verification should be more trustworthy done outside the agent that produced the work. Miss any of the three and a single agent with a better prompt usually wins. This is the same instinct behind the orchestrator-workers pattern in the wider pattern catalogue: split work only along seams that are already separate.

The real costs, stated plainly

Two costs decide whether a multi-agent design is worth building, and both are larger than teams expect.

The first is tokens. Anthropic measured its own systems: agents use about 4 times the tokens of a chat, and multi-agent systems use about 15 times. That multiplier is not a rounding error in a budget. In Anthropic's own analysis, token usage alone explained about 80 percent of the variance in how well the research system performed. Spend buys quality here, which is exactly why the economics only work when the task is valuable enough to justify the bill. One production cost model put a single-agent workload running at 15 dollars a day at roughly 6,750 dollars over a month once it was rebuilt as an orchestrated multi-agent system, against 450 dollars for the single agent. If your task is routine, the 15x is pure waste.

The second cost is coordination failure, and it is not anecdotal. The MAST study from a team at UC Berkeley, "Why Do Multi-Agent LLM Systems Fail?", did the unglamorous work: it had six expert annotators comb through more than 150 execution traces from five popular open-source multi-agent frameworks, and built a taxonomy of 14 distinct failure modes. The failures sort into three groups. Specification and system design problems account for 41.8 percent. Inter-agent misalignment, which is the polite name for coordination breakdown, accounts for 36.9 percent. Task verification failures make up the remaining 21.3 percent.

Read that middle number again. More than a third of the ways a multi-agent system fails are failures that exist only because there is more than one agent. Agents ignore what another agent said. They proceed on an assumption instead of asking. They reset a conversation and lose the thread. A single agent cannot fail any of those ways, because there is no second agent to misalign with. The MAST authors are blunt about the fix: many of these failures come from system design, not from a weak model, and they are not patched by adding another line to a prompt. In several cases the same model in a single-agent setup beat the multi-agent version outright. The bottleneck was the architecture.

There is a third cost worth naming because it bites in production: a peer swarm with too much autonomy tends to talk in circles, with agents validating each other's mistakes, and every agent that holds tool access widens the surface for a prompt injection to spread.

Future and impact: the pattern the field actually converged on

By 2026 the debate had converged on a working answer, and it is not "swarm" and it is not "always single agent." It is a specific shape: one orchestrator that owns the full context and the plan, delegating to isolated, mostly read-only sub-agents that each work in a scoped context window and return a compressed summary. Not a flat society of peers messaging each other. A hub.

The convergence is visible in what the players shipped, not just what they wrote. Anthropic's research system is exactly this hub shape. Cognition, the company that wrote "Don't Build Multi-Agents," shipped a feature it called "Devin Manages Devins" in March 2026: a coordinator session that scopes the work, hands each piece to a managed Devin running in its own isolated VM, and compiles the results. That is the same pattern, and it is consistent with the original essay rather than a reversal of it, because the sub-agents are scoped and isolated rather than peers writing into shared state. Surveys of what survived in production report the same thing across vendors: orchestrator plus isolated sub-agents, summaries flowing back to a single owner of context. The flat peer swarm, the thing the 2023 demos sold, did not make the cut.

Why this shape and not the swarm? It bounds the blast radius. A sub-agent that derails affects its own scoped task, not the whole system, and the orchestrator can re-run it. It keeps context clean, because the orchestrator is not flooded with ten agents' raw traces, only their distilled output. And it respects both of Cognition's principles at once: the orchestrator holds enough context to keep decisions consistent, and the sub-agents are scoped tightly enough that their implicit decisions cannot collide. One refinement follows naturally when the work gets large: rather than have a single orchestrator track a crowd of sub-agents directly, give it a few sub-leads, each spawning its own specialists, so no agent ever coordinates more than a handful of others.

For a practitioner, the takeaway is a decision you can make before writing any code. Default to a single agent. It is cheaper, it is far easier to debug, and a clean single-agent design beats a sloppy multi-agent one at the same token budget almost every time. Reach for a second agent only when the work genuinely splits into independent, read-heavy pieces, when the task is valuable enough to absorb the token multiplier, and when you have a real plan for verification. When you do reach for it, build a hub, not a swarm: an orchestrator with isolated sub-agents, talking to a few delegates rather than many. That is not a compromise between Cognition and Anthropic. It is the design both teams were describing from their different corners of the same problem. The framework you build it on is a separate decision, and the tradeoffs there are covered in the piece on agent orchestration frameworks.

The honest closing note is that the field's answer here is provisional. It rests on a current limitation: today's models are not good at coordinating with each other in real time. If that changes, the case for richer agent-to-agent collaboration reopens, and the standardisation work behind protocols like Agent2Agent is a bet that it will. Until then, the rule holds. A second agent is not a sign of a serious system. It is a cost you take on only when the task leaves you no cheaper way to win.

Council summary

This post argues that the Cognition versus Anthropic clash of June 2025 was never a real contradiction: the two teams were prescribing for different work, and reading both pieces together yields a single decision rule. That rule is the post's payload. A second agent is justified only when subtasks are genuinely independent and read-heavy, when the outcome is valuable enough to absorb a token bill roughly fifteen times that of a chat, and when verification belongs outside the agent that did the work. The post is honest about the costs, grounding them in Anthropic's own measurements and the MAST failure taxonomy, and it lands on the shape the field has converged on in production: an orchestrator that owns context delegating to isolated, scoped sub-agents, never a flat peer swarm. The reader's takeaway is a default they can apply before writing code: start with one agent, and reach for a hub only when the task leaves no cheaper way to win.

Comments

Leave a comment

Your email won't be published. Comments are reviewed before they appear.
★ Read next