Joe Archondis

June 24, 2026 · 9 min read

AI Agents & RAG

Multi-Agent Orchestration: When to Use It and When Not To

The first version of my B2B lead generation system had five agents. Researcher, qualifier, enricher, writer, controller. It worked. Cost four times more to run than the rewrite I shipped three months later. Had failure modes the simpler version didn't. Took twice as long to debug when something broke in production.

I rebuilt it with one orchestrator and two workers. Same output quality. Per-lead processing time dropped from 22 seconds to 8. API cost dropped 60%. The architecture got simpler. The system got more reliable.

That experience sharpened a rule I've held since: use multi-agent orchestration only when the specific benefits are real, not when the architecture sounds more impressive. Most tasks don't need it. Here's how to tell the difference.

What "Multi-Agent" Actually Means

A single agent is one LLM call that loops. It has tools, invokes them, gets results back into its context, and continues until the task is done. Everything lives in one context window. The agent can call tools dozens of times in a single turn.

A multi-agent system routes work across multiple separate LLM calls, each with its own context window. Agents can run in parallel, in sequence, or a combination of both. The orchestrator typically handles task decomposition and coordination. Workers execute specific subtasks.

The cost that most articles skip: every agent boundary introduces a separate API call, fresh context that needs task instructions re-injected, and a new failure point. You pay these costs every time. The question is whether the benefit outweighs them for your specific problem.

The Four Patterns That Justify It

Multi-agent makes sense in exactly four situations.

Parallel fan-out. You have N independent tasks that a single agent would process sequentially. Multiple agents process them simultaneously. My lead generation pipeline enriches company profiles: 50 companies, one at a time, took ~22 seconds. Ten parallel workers brought that to ~8 seconds. The tasks share no context, produce no shared state between them, and the speedup is roughly linear until you hit API rate limits. This is the cleanest justification for multi-agent. It's also the most commonly cited one, for good reason.

Context window overflow. The task requires reasoning over more material than fits in one context window. A document analysis pipeline might have Agent A summarize 30 PDFs separately, then Agent B synthesize across all those summaries. Each agent works within its limit. The orchestrator coordinates across the boundary. This case is rarer than people assume. Claude's 200k token context window is large. Before building a multi-stage pipeline for context reasons, confirm you've actually hit the wall.

Specialization. Different subtasks need genuinely different system prompts, tools, or models. A code-generation agent and a code-review agent have different purposes, different tools, different evaluation criteria. Running them in the same context means the system prompt serves two conflicting goals. The signal that specialization is needed: the agent keeps getting confused about which mode it's in, or you're writing a system prompt with two distinct sections that never interact.

Independent verification. You need a second opinion that hasn't seen the reasoning that produced the first answer. A generator produces output; a completely fresh agent reviews it with no shared context. This only works with genuine isolation. You can't ask one agent to "pretend you haven't seen this before." If the verification needs to be real, it requires a separate call.

If your problem fits one of these four, multi-agent is the right architecture. If none of them apply, a single agent with more tools is almost certainly the correct call.

When One Agent Is the Right Call

The ShawaMama operations bot connects to the restaurant chain's POS system and delivery platform. The owner asks questions in plain language: "What was my top-selling item last Saturday?" or "How did Location 3 compare to Location 1 this month?" The agent fetches the data via tool call, reasons over it, answers in under 10 seconds.

At one point I was asked whether each data source should be its own agent. One for the POS system, one for Deliverect. My answer was no. The tasks are sequential — one question produces one answer. They share context — follow-up questions reference earlier data. Everything fits in one context window with room to spare. A second agent would have added latency and a coordination layer for zero benefit.

The general rule: if tasks share context and run sequentially, keep them in one agent. Add tools, not agents.

Debugging is the other factor. When a multi-agent pipeline fails, you're correlating state across multiple API calls, multiple reasoning traces, context rebuilt from scratch at each hop. A single agent has one conversation to inspect. That difference matters more than it sounds when something breaks at 2 AM.

Coordination in Practice

The orchestrator pattern: receive the task, decompose into subtasks, dispatch to workers, collect results, synthesize. With the Anthropic SDK, the simplest version dispatches workers directly from Python:

import anthropic
from concurrent.futures import ThreadPoolExecutor

client = anthropic.Anthropic()
HAIKU  = "claude-haiku-4-5-20251001"
SONNET = "claude-sonnet-4-6"

def call_agent(task: str, system: str, model: str = HAIKU) -> str:
    msg = client.messages.create(
        model=model,
        max_tokens=2048,
        system=system,
        messages=[{"role": "user", "content": task}]
    )
    return msg.content[0].text

RESEARCH_SYSTEM = "Research analyst. Return JSON: founding_year, industry, headcount, recent_news."
QUALIFY_SYSTEM  = "Sales analyst. Return JSON: score (0-100), fit (good/medium/poor), reasoning."

def enrich(company: str) -> str:
    return call_agent(task="Research: " + company, system=RESEARCH_SYSTEM)

def qualify(profile: str, icp: str) -> str:
    return call_agent(task="Profile:\n" + profile + "\nICP:\n" + icp, system=QUALIFY_SYSTEM)

# Parallel enrichment across 50 companies, then sequential qualification
def run_pipeline(companies: list, icp: str) -> list:
    with ThreadPoolExecutor(max_workers=10) as pool:
        profiles = list(pool.map(enrich, companies))  # parallel fan-out
    return [qualify(p, icp) for p in profiles]        # sequential

Two decisions in that pattern carry most of the weight.

Model selection per role. The orchestrator handles complex planning and synthesis. Workers do narrow, structured work: extract this field, score this lead, reformat this output. Claude Haiku is fast and cheap for structured worker tasks — most workers complete in under 2 seconds at a fraction of Sonnet's cost. Using Sonnet across a pipeline that runs thousands of times a day is expensive. Know what each agent actually needs.

Scope of agent context. Workers should receive exactly what they need: task instructions, their specific input, their output schema. Not the orchestrator's full reasoning chain, not other workers' outputs unless explicitly required. Lean context means faster calls, lower cost, and fewer opportunities for the agent to be confused by information it doesn't need.

State Management — The Part Nobody Talks About

This is where most multi-agent implementations quietly break in production.

In-memory passing: Agent A runs, passes output to Agent B in-process. Agent B fails. You re-run Agent A to recover the starting point. At scale, that's expensive and slow. It also means failures are invisible until something downstream breaks.

Persistent state changes the failure model. Each agent stores its output before the next stage starts. Agent B failing means retrying Agent B with Agent A's stored result — not replaying the whole pipeline from scratch.

For the lead generation system, I used PostgreSQL. Each lead is a row. Columns track progress at each stage:

The orchestrator queries WHERE status = 'queued' LIMIT 10 and dispatches that batch. If qualification fails on a lead, that row stays at status: "researched". I can see exactly where it stalled. Retry is one query, not a full replay.

This model gives you monitoring for free. One dashboard query showing record counts by status reveals pipeline health in real time. Backed-up stages surface immediately. Stuck records are visible rather than silent.

Single Agent vs. Multi-Agent

Concern Single Agent Multi-Agent
Task relationship Sequential, shared context Independent or parallelizable
Context size Fits in one window Exceeds single window
Debugging One trace to read Multiple traces to correlate
Latency Lower (one call flow) Higher (multiple calls)
API cost Lower Higher — each agent is billed separately
Failure surface Smaller Larger
Parallelism Sequential only Concurrent where needed
Setup complexity Low Medium to high

Real numbers from the lead generation pipeline (50 leads processed):

The time savings are real. So is the cost increase. At low volume, the difference is noise. At 10,000 daily runs, it determines whether the system is economically viable. Know your volume before committing to the architecture.

The original five-agent version wasn't wrong because multi-agent is bad. It was wrong because I added agents before I understood the actual bottlenecks. Profiling first would have revealed that parallelism was the only real benefit — and that two workers captured it just as well as five.

Frequently Asked Questions

How many agents is too many for a multi-agent system?

There's no universal number. Every agent boundary adds latency and debugging complexity. Start with the minimum needed to address the specific problem — parallelism, context overflow, specialization, or verification. Add one agent at a time, with a concrete reason each time. If you're adding agents because the architecture feels more sophisticated, stop.

Do all agents in a multi-agent system need to use the same model?

No. Mixing models by role is often the right call. A capable model for the orchestrator doing complex planning and synthesis. A cheap, fast model for workers doing narrow structured tasks: extracting fields, formatting output, scoring leads. The cost difference between Claude Haiku and Claude Sonnet is substantial at scale — use each where its capabilities are actually needed.

How do you handle errors in a multi-agent pipeline?

Design for idempotency and persistence. Each agent's output gets stored before the next stage starts. Downstream failures retry from the stored upstream result, not from the beginning of the pipeline. Avoid in-memory state passing for anything meant to run reliably in production — a process crash or API timeout will lose that state silently.

Is multi-agent orchestration worth building for a small team?

Only if the specific benefits apply. A single well-designed agent with 10 tools is simpler to build, test, and maintain than a 4-agent system. The operational complexity compounds as the system evolves. If you can get the same output from one agent, that's the better choice — not because multi-agent is harder, but because simpler systems fail less and cost less to fix.

What's the difference between a multi-agent system and a single agent with many tools?

A single agent with many tools runs sequentially in one context window, calling tools as needed. A multi-agent system runs multiple context windows — in parallel or staged — with coordination between them. Use multi-agent when you need parallelism, when tasks exceed context limits, or when specialization requires genuinely different system prompts and tool access. For everything else, more tools is the answer.

Working on something similar?

I build AI agents and low-latency systems. If you're trying to solve a version of this, let's talk.

Get in touch

Author: Joe Archondis — AI systems engineer and HFT infrastructure builder.

Last updated: 2026-06-24