Microsoft Pulls Copilot From Office: What It Costs Now
I woke up yesterday to two Anthropic announcements that, taken together, tell you exactly where this company is headed. The first: a multi-agent coding system that runs autonomous development sessions for up to four hours. The second: a quiet API change that tripled the token cap for batch processing. Neither announcement was accidental. Both dropped on April 4th, 2026.
Anthropic isn’t just making smarter models anymore. They’re building the scaffolding for those models to work as teams. And after spending the morning reading through the technical details, I think this is the smartest bet any AI lab has placed on agentic development this year.
Quick Verdict
Aspect Details What Multi-agent coding system with 3 coordinated AI agents Architecture GAN-inspired: planner + generator + evaluator Session Length Up to 4 hours autonomous operation Self-Improvement 5-15 iteration loops before human review API Change Message Batches max_tokens raised to 300K (Opus 4.6 & Sonnet 4.6) Competes With Devin, GitHub Copilot Workspace, Cursor Background Agents Bottom line: Anthropic just staked a first-party claim in autonomous AI coding. The three-agent loop is genuinely novel, and the 300K batch token cap suggests they’re building for sessions that eat through context at industrial scale.
Here’s the short version: three specialized AI agents working in a coordinated loop to plan, write, and critique code autonomously.
The architecture borrows from GANs (generative adversarial networks), where two neural networks improve each other through competition. Anthropic took that adversarial concept and split it into three roles:
The Planner breaks down a task into subtasks, decides the order of operations, and maintains the high-level strategy. Think of it as the architect who doesn’t write code but decides what gets built and in what sequence.
The Generator writes the actual code. It receives instructions from the planner and produces implementations, file by file, function by function. This is the builder.
The Evaluator reviews the generator’s output against the planner’s intent. It catches bugs, flags inconsistencies, identifies missed requirements, and sends feedback back into the loop. This is the code reviewer who never gets tired and never rubber-stamps a pull request.
The three agents run in coordinated loops. The evaluator’s feedback goes back to the planner, which adjusts its strategy. The planner gives updated instructions to the generator. The generator produces revised code. The evaluator checks again. This cycle repeats 5 to 15 times per session before the system pauses for human review.
That’s not a chatbot writing code. That’s a development team in a box.
I’ve been tracking AI coding assistants for a while now, and most of them follow the same pattern: you give the AI a prompt, it writes code, you review it, you iterate manually. The human is always the evaluator.
The Anthropic multi-agent system removes that bottleneck for the iterative middle stage. The evaluator agent does the code review pass that you’d normally do yourself, catches the obvious (and sometimes non-obvious) mistakes, and sends them back for revision before you ever see the output.
This is fundamentally different from what Cursor, Copilot, or even Claude Code have been doing in their single-agent modes. Those tools are brilliant at assist. This system is designed for autonomy.
The GAN parallel isn’t perfect. In a true GAN, the discriminator and generator are adversarial. Here, the evaluator and generator are cooperative, more like a senior dev and a junior dev pair-programming. But the self-improvement loop is real. Each iteration genuinely produces better code than the last, because the evaluator catches things the generator missed and the planner adjusts its approach based on what’s actually working.
Most AI coding tools time out after a few minutes of autonomous work. Cursor’s background agents run for maybe 10-15 minutes on a task. Devin can go longer but frequently drifts off-track on complex projects (I’ve written about this — the “autonomy tax” is real).
Anthropic’s system is designed for four-hour autonomous sessions. That’s not a marketing number. It reflects the architecture: with 5-15 self-improvement iterations, each involving planning, generation, and evaluation across a substantial codebase, you burn through a lot of compute and a lot of context.
Which brings us to the second announcement.
On the same day they announced the multi-agent system, Anthropic raised the Message Batches API max_tokens from 128K to 300K for both Opus 4.6 and Sonnet 4.6. That’s not a random infrastructure upgrade. That’s building the highway before opening the factory.
A single iteration of the planner-generator-evaluator loop probably consumes 15-30K tokens of output depending on task complexity. Multiply that by 10 iterations, add the context from the codebase being worked on, and you can see why 128K wasn’t going to cut it.
300K tokens of batch output means the system can run substantial autonomous sessions without hitting the ceiling. The timing tells the story: they built the engine and expanded the fuel tank on the same day.
For developers using the Anthropic API, this also means batch processing jobs that previously required splitting into multiple calls can now run in a single request. That’s relevant even outside the multi-agent context — anyone doing large-scale document processing, code analysis, or data transformation benefits.
The autonomous AI coding space has gotten crowded. Here’s where things stand as of this week:
| System | Approach | Max Session | Self-Correction | Human Checkpoints |
|---|---|---|---|---|
| Anthropic Multi-Agent | 3-agent loop (plan/gen/eval) | ~4 hours | 5-15 iterations | After each session |
| Devin (Cognition) | Single agent with tools | Unlimited (but drifts) | Limited | On-demand |
| Copilot Workspace | Plan → implement → review | ~30 min tasks | 1-2 iterations | Before merge |
| Cursor Background Agents | Single agent, background exec | ~15 min | Minimal | After completion |
| Claude Code (standard) | Single agent, interactive | Continuous (manual) | User-driven | Every response |
The key differentiator is the evaluator agent. Devin will keep working indefinitely, but without a built-in critic, it accumulates errors over long sessions. I’ve had Devin sessions where it spent 45 minutes confidently building on a flawed assumption it made in minute three. Nobody was checking its work.
Anthropic’s approach forces self-correction at every iteration. That’s slower per cycle, but the output quality after 10 iterations should be dramatically better than what a single agent produces in a straight line.
For anyone skimming, here’s the precise breakdown:
Each agent is a separate Claude instance. They share context about the project but maintain distinct system prompts optimized for their role. The planner thinks strategically. The generator thinks tactically. The evaluator thinks critically. That role separation is what makes the loop productive rather than circular.
Enterprise dev teams running large codebases. Four-hour autonomous sessions with built-in quality checks could handle the kind of refactoring and migration work that nobody wants to do manually. I’m thinking database migrations, API version upgrades, test coverage expansion — the grunt work that’s important but tedious.
Solo developers and small teams who need to punch above their weight. If you’re a two-person startup and you need to ship features at a pace that normally requires five engineers, a system that can autonomously generate and self-review code for hours at a stretch changes the math.
AI-forward consultancies billing for development work. (Yes, this raises questions about billing models. That’s a different article.) The firms that figure out how to supervise autonomous AI dev sessions effectively will have a massive cost advantage.
If you’re still evaluating whether AI agents are ready for real development work, this might be the release that tips the balance.
I don’t want to oversell this. A few concerns:
No public access yet. The multi-agent system was announced but isn’t generally available as of today. Details are thin on pricing, API access, and exactly how the human review checkpoints work. Anthropic has a pattern of announcing capabilities before they’re widely accessible.
Four hours of compute isn’t cheap. Running three Claude instances in coordinated loops for four hours will consume serious API credits. I’d estimate a single session could run $50-200+ depending on the model tier and codebase size. That’s fine for high-value tasks, but it’s not “leave it running on everything.”
The evaluator is only as good as its criteria. Code review is subjective. The evaluator can catch bugs, type errors, and logical inconsistencies. It probably can’t catch the kind of architectural mistakes that only become obvious six months later when the codebase has grown. Self-improvement loops don’t fix bad requirements.
Integration with existing workflows is unclear. Does this plug into GitHub? Does it create PRs? Does it run your test suite? The announcement didn’t cover the developer experience in detail, and that matters as much as the architecture.
Zoom out and you can see the pattern. Opus 4.6 shipped with agent teams in Claude Code. The MCP protocol gave Claude the ability to use external tools. Now a first-party multi-agent coding system with GAN-inspired self-improvement.
Anthropic is building the full stack for autonomous AI development. Not just a smart model. Not just an IDE integration. A complete system where multiple AI agents plan, execute, review, and iterate on software — with humans providing oversight at checkpoints rather than at every keystroke.
That’s a direct challenge to Cognition’s Devin, GitHub’s Copilot Workspace, and Cursor’s background agents. But it’s also a challenge to every AI company that’s treated coding assistance as a single-agent problem. The argument Anthropic is making with this release: one agent writing code is assistance. Three agents collaborating is development.
Whether that argument holds up in practice depends on details we don’t have yet. But the architecture is sound, the self-improvement loop is the right idea, and the 300K batch token cap tells me Anthropic is serious about making this work at scale.
A few things I’m tracking:
The autonomous AI coding race just got its most interesting entry. I’ll be writing a hands-on review the moment I get access.
Last updated: April 5, 2026. Based on Anthropic’s April 4 announcements. Feature details subject to change before general availability.
Related reading: