Hero image for Anthropic's Multi-Agent System: 4-Hour AI Dev
By AI Tool Briefing Team

Anthropic's Multi-Agent System: 4-Hour AI Dev


I woke up yesterday to two Anthropic announcements that, taken together, tell you exactly where this company is headed. The first: a multi-agent coding system that runs autonomous development sessions for up to four hours. The second: a quiet API change that tripled the token cap for batch processing. Neither announcement was accidental. Both dropped on April 4th, 2026.

Anthropic isn’t just making smarter models anymore. They’re building the scaffolding for those models to work as teams. And after spending the morning reading through the technical details, I think this is the smartest bet any AI lab has placed on agentic development this year.

Quick Verdict

AspectDetails
WhatMulti-agent coding system with 3 coordinated AI agents
ArchitectureGAN-inspired: planner + generator + evaluator
Session LengthUp to 4 hours autonomous operation
Self-Improvement5-15 iteration loops before human review
API ChangeMessage Batches max_tokens raised to 300K (Opus 4.6 & Sonnet 4.6)
Competes WithDevin, GitHub Copilot Workspace, Cursor Background Agents

Bottom line: Anthropic just staked a first-party claim in autonomous AI coding. The three-agent loop is genuinely novel, and the 300K batch token cap suggests they’re building for sessions that eat through context at industrial scale.

What Is the Anthropic Multi-Agent System?

Here’s the short version: three specialized AI agents working in a coordinated loop to plan, write, and critique code autonomously.

The architecture borrows from GANs (generative adversarial networks), where two neural networks improve each other through competition. Anthropic took that adversarial concept and split it into three roles:

  1. The Planner breaks down a task into subtasks, decides the order of operations, and maintains the high-level strategy. Think of it as the architect who doesn’t write code but decides what gets built and in what sequence.

  2. The Generator writes the actual code. It receives instructions from the planner and produces implementations, file by file, function by function. This is the builder.

  3. The Evaluator reviews the generator’s output against the planner’s intent. It catches bugs, flags inconsistencies, identifies missed requirements, and sends feedback back into the loop. This is the code reviewer who never gets tired and never rubber-stamps a pull request.

The three agents run in coordinated loops. The evaluator’s feedback goes back to the planner, which adjusts its strategy. The planner gives updated instructions to the generator. The generator produces revised code. The evaluator checks again. This cycle repeats 5 to 15 times per session before the system pauses for human review.

That’s not a chatbot writing code. That’s a development team in a box.

Why the GAN-Inspired Design Matters

I’ve been tracking AI coding assistants for a while now, and most of them follow the same pattern: you give the AI a prompt, it writes code, you review it, you iterate manually. The human is always the evaluator.

The Anthropic multi-agent system removes that bottleneck for the iterative middle stage. The evaluator agent does the code review pass that you’d normally do yourself, catches the obvious (and sometimes non-obvious) mistakes, and sends them back for revision before you ever see the output.

This is fundamentally different from what Cursor, Copilot, or even Claude Code have been doing in their single-agent modes. Those tools are brilliant at assist. This system is designed for autonomy.

The GAN parallel isn’t perfect. In a true GAN, the discriminator and generator are adversarial. Here, the evaluator and generator are cooperative, more like a senior dev and a junior dev pair-programming. But the self-improvement loop is real. Each iteration genuinely produces better code than the last, because the evaluator catches things the generator missed and the planner adjusts its approach based on what’s actually working.

Four-Hour Sessions: What That Actually Looks Like

Most AI coding tools time out after a few minutes of autonomous work. Cursor’s background agents run for maybe 10-15 minutes on a task. Devin can go longer but frequently drifts off-track on complex projects (I’ve written about this — the “autonomy tax” is real).

Anthropic’s system is designed for four-hour autonomous sessions. That’s not a marketing number. It reflects the architecture: with 5-15 self-improvement iterations, each involving planning, generation, and evaluation across a substantial codebase, you burn through a lot of compute and a lot of context.

Which brings us to the second announcement.

The 300K Token Batch Cap Isn’t a Coincidence

On the same day they announced the multi-agent system, Anthropic raised the Message Batches API max_tokens from 128K to 300K for both Opus 4.6 and Sonnet 4.6. That’s not a random infrastructure upgrade. That’s building the highway before opening the factory.

A single iteration of the planner-generator-evaluator loop probably consumes 15-30K tokens of output depending on task complexity. Multiply that by 10 iterations, add the context from the codebase being worked on, and you can see why 128K wasn’t going to cut it.

300K tokens of batch output means the system can run substantial autonomous sessions without hitting the ceiling. The timing tells the story: they built the engine and expanded the fuel tank on the same day.

For developers using the Anthropic API, this also means batch processing jobs that previously required splitting into multiple calls can now run in a single request. That’s relevant even outside the multi-agent context — anyone doing large-scale document processing, code analysis, or data transformation benefits.

How Does This Compare to the Competition?

The autonomous AI coding space has gotten crowded. Here’s where things stand as of this week:

SystemApproachMax SessionSelf-CorrectionHuman Checkpoints
Anthropic Multi-Agent3-agent loop (plan/gen/eval)~4 hours5-15 iterationsAfter each session
Devin (Cognition)Single agent with toolsUnlimited (but drifts)LimitedOn-demand
Copilot WorkspacePlan → implement → review~30 min tasks1-2 iterationsBefore merge
Cursor Background AgentsSingle agent, background exec~15 minMinimalAfter completion
Claude Code (standard)Single agent, interactiveContinuous (manual)User-drivenEvery response

The key differentiator is the evaluator agent. Devin will keep working indefinitely, but without a built-in critic, it accumulates errors over long sessions. I’ve had Devin sessions where it spent 45 minutes confidently building on a flawed assumption it made in minute three. Nobody was checking its work.

Anthropic’s approach forces self-correction at every iteration. That’s slower per cycle, but the output quality after 10 iterations should be dramatically better than what a single agent produces in a straight line.

What Are the Three Agents in Anthropic’s Multi-Agent System?

For anyone skimming, here’s the precise breakdown:

  1. Planner Agent — Decomposes the task, sequences subtasks, maintains architectural coherence, and adjusts strategy based on evaluator feedback
  2. Generator Agent — Writes code, creates files, implements features according to the planner’s current instructions
  3. Evaluator Agent — Reviews generated code against requirements, runs tests when available, identifies bugs and missed edge cases, and provides structured feedback to the planner

Each agent is a separate Claude instance. They share context about the project but maintain distinct system prompts optimized for their role. The planner thinks strategically. The generator thinks tactically. The evaluator thinks critically. That role separation is what makes the loop productive rather than circular.

Who Should Care About This

Enterprise dev teams running large codebases. Four-hour autonomous sessions with built-in quality checks could handle the kind of refactoring and migration work that nobody wants to do manually. I’m thinking database migrations, API version upgrades, test coverage expansion — the grunt work that’s important but tedious.

Solo developers and small teams who need to punch above their weight. If you’re a two-person startup and you need to ship features at a pace that normally requires five engineers, a system that can autonomously generate and self-review code for hours at a stretch changes the math.

AI-forward consultancies billing for development work. (Yes, this raises questions about billing models. That’s a different article.) The firms that figure out how to supervise autonomous AI dev sessions effectively will have a massive cost advantage.

If you’re still evaluating whether AI agents are ready for real development work, this might be the release that tips the balance.

Where I Think This Falls Short (For Now)

I don’t want to oversell this. A few concerns:

No public access yet. The multi-agent system was announced but isn’t generally available as of today. Details are thin on pricing, API access, and exactly how the human review checkpoints work. Anthropic has a pattern of announcing capabilities before they’re widely accessible.

Four hours of compute isn’t cheap. Running three Claude instances in coordinated loops for four hours will consume serious API credits. I’d estimate a single session could run $50-200+ depending on the model tier and codebase size. That’s fine for high-value tasks, but it’s not “leave it running on everything.”

The evaluator is only as good as its criteria. Code review is subjective. The evaluator can catch bugs, type errors, and logical inconsistencies. It probably can’t catch the kind of architectural mistakes that only become obvious six months later when the codebase has grown. Self-improvement loops don’t fix bad requirements.

Integration with existing workflows is unclear. Does this plug into GitHub? Does it create PRs? Does it run your test suite? The announcement didn’t cover the developer experience in detail, and that matters as much as the architecture.

The Bigger Picture: Anthropic’s Agentic Strategy

Zoom out and you can see the pattern. Opus 4.6 shipped with agent teams in Claude Code. The MCP protocol gave Claude the ability to use external tools. Now a first-party multi-agent coding system with GAN-inspired self-improvement.

Anthropic is building the full stack for autonomous AI development. Not just a smart model. Not just an IDE integration. A complete system where multiple AI agents plan, execute, review, and iterate on software — with humans providing oversight at checkpoints rather than at every keystroke.

That’s a direct challenge to Cognition’s Devin, GitHub’s Copilot Workspace, and Cursor’s background agents. But it’s also a challenge to every AI company that’s treated coding assistance as a single-agent problem. The argument Anthropic is making with this release: one agent writing code is assistance. Three agents collaborating is development.

Whether that argument holds up in practice depends on details we don’t have yet. But the architecture is sound, the self-improvement loop is the right idea, and the 300K batch token cap tells me Anthropic is serious about making this work at scale.

What to Watch Next

A few things I’m tracking:

  • Pricing and availability — When does the multi-agent system go GA, and at what cost per session?
  • Benchmark results — Anthropic will presumably publish SWE-Bench and similar scores for multi-agent vs. single-agent mode. Those numbers will tell us if the evaluator loop actually improves output quality measurably.
  • Third-party integrations — Can you plug this into your existing CI/CD pipeline, or is it a walled garden?
  • Competition response — OpenAI and Google have been quiet on multi-agent coding. That won’t last. Expect announcements within weeks, not months.

The autonomous AI coding race just got its most interesting entry. I’ll be writing a hands-on review the moment I get access.


Last updated: April 5, 2026. Based on Anthropic’s April 4 announcements. Feature details subject to change before general availability.

Related reading: