GitHub Copilot Token Billing: What It Costs Now
At Microsoft Build 2026, Satya Nadella spent thirty seconds on Azure roadmap slides and then handed the stage to a model. The model is called Project Polaris. Starting August 2026, it replaces GPT-4 Turbo as GitHub Copilot’s default — for every Individual, Business, and Enterprise seat on the platform.
That’s the headline. The subtext is louder.
For four years, Copilot has been the most visible OpenAI deployment in the world. Every keystroke in VS Code that summoned a ghost-text suggestion was, underneath, a paid call to OpenAI’s API. The nearly 140,000 organizations on Copilot — as disclosed by Satya Nadella on the FY2026 Q3 earnings call — are running OpenAI infrastructure whether they signed an OpenAI contract or not. In August, that stops. Microsoft is pulling its highest-volume developer surface back inside its own stack, and the model doing the pulling is a mixture-of-experts architecture trained on Microsoft’s custom Maia silicon.
This is the same play Microsoft made on the Azure-only enterprise side when it shipped MAI-Image-2 and MAI-Voice-1 in April. Different surface. Same strategy. The OpenAI partnership isn’t ending — it’s getting decoupled, one product line at a time.
Quick Summary: Project Polaris at a Glance
Detail Info What it is Microsoft in-house mixture-of-experts coding model What it replaces GPT-4 Turbo as the default model behind GitHub Copilot Default rollout date August 2026 (all Copilot tiers) Enterprise fallback Optional 3-month GPT-4 Turbo rollback window Pro tier context Up to 100,000 lines of multi-file context Pro tier feature Autonomous test generation Hardware Microsoft custom Maia AI accelerators (Azure-only) Reported benchmark wins Outperforms GPT-4 Turbo on HumanEval and MBPP Specialized sub-modules Rust, Haskell, and other low-resource languages Primary source Microsoft Build 2026 live blog Bottom line: Microsoft cut the OpenAI cord on Copilot. For the enterprises already paying for it, nothing on the invoice changes in August. Everything underneath does. And the competitive math against Claude Code and Cursor just got harder to model.
On the Build 2026 main stage, Microsoft framed Polaris as “the first model Microsoft trained end-to-end for code.” The factual claims, in order of how much they matter:
Polaris is a mixture-of-experts architecture. Not a dense transformer. The relevant property is that inference activates only the subset of expert sub-networks that the input routing decides are relevant. For coding workloads, that means a Rust query doesn’t pay the compute tax of activating the Python expert weights, and vice versa. The architecture choice is the same family of decisions that produced DeepSeek V3 and Mistral’s Mixtral, but Microsoft trained the experts around language-specific gradients rather than generic capability gradients.
It outperforms GPT-4 Turbo on HumanEval and MBPP. Microsoft stated at the Build 2026 keynote that Polaris outperforms GPT-4 Turbo on both benchmarks, though the company did not release scores on harder benchmarks like SWE-Bench Verified or LiveCodeBench. Those omissions matter — HumanEval has been heavily saturated since 2024 and MBPP isn’t far behind. Beating GPT-4 Turbo on those tests in 2026 is the floor, not the ceiling.
Specialized sub-modules for low-resource languages. Polaris ships dedicated expert weights for Rust, Haskell, OCaml, Elixir, and Zig — the languages where stock GPT-4 Turbo consistently underperforms because training data is thin. Microsoft pointed to internal evaluations showing Rust completion quality improving by what it characterized as “double-digit percentage points” against GPT-4 Turbo on internal benchmarks, though no specific number was disclosed in the keynote.
Default model replacement in August 2026. Every Copilot seat gets Polaris as the default model. No price change. No new SKU. The model swap is happening underneath the existing contract. Enterprise customers get an optional three-month rollback window — they can pin their tenants to GPT-4 Turbo through November 2026 if they need time to validate the new model against internal codebases.
Pro tier upgrades. Copilot Pro and Pro+ subscribers get two features the free and Business tiers don’t: multi-file context up to 100,000 lines, and autonomous test generation that produces test files alongside implementation code. The 100K-line context is the headline number, but the autonomous test generation is the feature that changes daily workflow.
Maia silicon as the inference fabric. Polaris runs exclusively on Microsoft’s Maia AI accelerators inside Azure. The company has not disclosed inference cost per million tokens, but the operating implication is that the unit economics on Copilot just shifted from “OpenAI’s marginal cost plus Microsoft markup” to “Microsoft’s marginal cost on owned hardware.” That’s a different P&L conversation.
The financial logic is the easy part. Copilot is the highest-volume coding inference workload on the public internet. Every default suggestion, every tab-completion, every chat turn was a paid API call to a partner that just raised at an $852B post-money valuation and is increasingly building enterprise products that compete with Microsoft directly.
Running that workload on first-party silicon, with a first-party model, recaptures the entire margin stack. Microsoft has been telegraphing this for the better part of a year, first with the MAI enterprise model launch in April, then with the gradual shift of Copilot Chat’s underlying model to “Microsoft-optimized” GPT variants. Polaris is the part where the strategy stops being subtle.
The harder question is why the timing is August 2026, not Q3 2027 or sooner.
A few plausible reads, each with evidence:
Enterprise renewal cycles. Most Copilot enterprise contracts come up for renewal between September and January. Putting Polaris in production by August means every renewal conversation in Q4 happens against the new product. Microsoft can sell “you’re already on it, the swap was painless” instead of “you should consider switching.”
Anthropic’s Q3 IPO window. Anthropic is targeting an October 2026 IPO at the $900B Series H valuation. Every month between now and that window is a month Anthropic uses to land Claude Code seats inside enterprises Microsoft has spent four years acquiring. Shipping a credible first-party coding model before the Anthropic IPO is the strategic answer to that pressure. The window matters because the comp set the public market uses to price Anthropic includes Microsoft’s developer-tools revenue.
The Maia capacity story. Microsoft has been building out Maia-based inference clusters since late 2024. Until the silicon could actually serve a workload of Copilot’s scale, the model strategy was theoretical. Build 2026 is the first window where the chips, the model, and the GTM motion arrive at the same time.
None of these are mutually exclusive. They probably all contributed.
For most Copilot seats, August 1 will look like a Tuesday with a slightly different tab-completion experience.
That’s the practical answer.
The strategic answer is that Microsoft just changed the model risk profile of every enterprise that standardized on Copilot. Procurement teams that audited Copilot in 2024 or 2025 documented an OpenAI dependency in the vendor-risk worksheet. That dependency is now a Microsoft-on-Microsoft dependency, which most enterprises will read as less risky, not more. It’s also a competitive bind that Anthropic and Cursor have to attack from a worse position than they did a quarter ago.
A more concrete read of what changes:
Code completion quality. On the mainstream languages (Python, JavaScript, TypeScript, Java, C#), early benchmark numbers suggest Polaris is at parity with GPT-4 Turbo, maybe slightly ahead. For most enterprise engineering teams, the day-one experience won’t be noticeably different. On Rust and Haskell, the experience should be materially better; on the long tail of less common languages, results will vary by how aggressively Microsoft trained the relevant expert weights.
Multi-file context for Pro. This is the feature that closes the most visible gap against Cursor and Claude Code. A 100,000-line context window puts Copilot Pro in the same range as Cursor’s repo-aware indexing and Claude Code’s full-repo agentic mode. The implementation details matter (Cursor’s chunked retrieval versus Polaris’s full-context attention will produce different failure modes), but on the marketing slide, the gap closes.
Autonomous test generation. The Pro-tier feature where Polaris generates test files alongside implementation code is the workflow change that most engineers will actually feel. Test scaffolding is the highest-friction, lowest-creativity part of writing new functions. Autonomous test generation done well saves real time. Done poorly, it produces tests that pass without actually validating behavior — the failure mode every existing coding assistant has shipped at some point.
Latency. Maia silicon should reduce inference latency for Copilot calls served from Azure regions where Maia is deployed. Microsoft has not published latency numbers, but the operating advantage of running inference on owned hardware in the same region as the customer’s Azure tenant is real. For enterprise customers already on Azure, the inference path gets shorter.
The fallback window. The optional 3-month rollback to GPT-4 Turbo through November is the kind of detail that matters more than it reads. Most enterprise tenants will not use it. The teams that do — regulated industries, financial services, healthcare engineering — are the ones who will validate Polaris against internal code review benchmarks before flipping their seats over. Microsoft is signaling that it expects pushback from those segments and is willing to absorb the cost of running two inference fabrics in parallel for a quarter to keep them on the platform.
Three months ago, the Copilot-versus-Claude-Code-versus-Cursor frame looked like this: Copilot is the incumbent with the best IDE integration; Claude Code is the agentic frontier with the strongest model; Cursor is the IDE-first product that pulled ahead on multi-file workflows. The cursor-vs-claude-code-vs-copilot comparison walked through where each one wins and where each one struggles.
Polaris doesn’t reset that frame. It tightens it.
Against Cursor. Cursor’s structural advantage has been the IDE itself, a fork of VS Code rebuilt around agentic workflows. Polaris doesn’t change that. What it changes is the model that Cursor was beating in head-to-head completion tests. The new comparison is Cursor Composer 2 with Kimi K2.5 and frontier model routing versus Polaris running inside the user’s existing VS Code install. For developers who don’t want to leave their IDE, Polaris is now a credible enough default that the Cursor switching cost looks higher.
Against Claude Code. Claude Code’s structural advantage is the model — Opus 4.8 outperforms GPT-4 Turbo by wide margins on every coding benchmark that wasn’t already saturated. Polaris narrows that gap on HumanEval and MBPP but does not close it on harder evaluations like SWE-Bench. Where Polaris attacks Claude Code is on distribution. Anthropic has to convince an engineering org to install a new product, request a budget approval, and onboard developers. Microsoft just defaults Polaris into the seat the developer already has. That’s a different sale.
Against GitHub Copilot itself. This is the underrated angle. Copilot’s reputation through 2024 and 2025 was “the most reliable, least exciting coding assistant.” Polaris is Microsoft’s attempt to keep the reliability while finally being interesting. The 100K-line context and autonomous test generation are not features Copilot has had. Whether the execution lands is a Q4 evaluation question, but the product surface in August will be wider than the product surface in May.
The honest read is that Polaris doesn’t necessarily win any of the head-to-head fights. It makes losing them more expensive for the challengers, because Microsoft can absorb a few percentage points of benchmark loss with default-distribution economics that no challenger has access to.
A few things to watch.
Hard benchmarks are missing. Microsoft showcased HumanEval and MBPP wins. It did not show SWE-Bench Verified, SWE-Bench Pro, LiveCodeBench, or any of the multi-step coding evaluations that frontier model labs are competing on. The omission is loud. If Polaris matched or beat Claude Opus 4.8 on SWE-Bench, those numbers would be on the keynote slide. They weren’t.
The low-resource language claim is unverified. “Double-digit percentage points” on Rust against GPT-4 Turbo is a strong claim with no concrete number attached. The community will run independent benchmarks within weeks. Until the numbers land, treat the Rust/Haskell story as marketing rather than capability.
Multi-file context at 100K lines is hard. Long-context coding workflows fail in ways that are hard to detect — the model returns a confident answer that ignores a critical file the user didn’t realize was relevant. Cursor and Claude Code have both shipped long-context features and both have rolled back portions of them after enterprise customers reported regressions. Polaris is shipping its first 100K-line implementation into production seats. Expect issues.
Autonomous test generation is the riskiest new feature. Generated tests that pass without exercising the actual behavior are worse than no tests. They produce false confidence and slow down incident response when the production bug ships. The Pro-tier feature will be valuable to teams with mature test-review culture and a liability for teams without it.
The Azure-only inference fabric. Polaris runs only on Maia silicon inside Azure. For enterprise customers who run multi-cloud, that constrains where Copilot calls can be served. The latency wins disappear if your tenant is in a region where Maia hasn’t been deployed yet.
For Anthropic, Polaris is a defensive move against the company’s strongest growth engine. Every enterprise that defaults to Polaris on August 1 is a seat that didn’t move to Claude Code. The good news for Anthropic is that the developers who care most about model quality are the segment most likely to install Claude Code regardless of what Copilot ships. The bad news is that segment is much smaller than the segment of developers who use whatever their IT department defaults them to.
The strategic question for Anthropic between now and the October IPO is whether to respond with pricing, with capability, or with both. The Claude Opus 4.8 release and its Fast Mode pricing cut suggests the answer is “both.” The next data point is whether Sonnet 4.8 ships in time to absorb the developers Polaris does not convince.
For OpenAI, the read is more uncomfortable. The Copilot integration was OpenAI’s largest single enterprise deployment by inference volume. Losing the default seat there is a revenue hit and a strategic signal — every other Microsoft surface that runs on OpenAI is now visibly on the table for the same treatment. The Office Copilot and Bing Chat surfaces are the obvious next candidates, and the Microsoft Agent 365 GA earlier this quarter already shipped first-party orchestration that doesn’t require GPT.
The relationship isn’t ending. The exclusivity is. Microsoft is operating Copilot as a multi-model platform with a first-party default, and the partnership terms with OpenAI that anyone wrote in 2023 don’t survive that architecture intact.
The Polaris announcement is the most consequential Microsoft AI decision since the original OpenAI partnership. Not because the model is necessarily better than what it replaces — the benchmarks Microsoft published are incomplete, the harder evaluations are missing, and the autonomous test generation feature is going to ship some embarrassing failures before it stabilizes. The decision is consequential because of what it says about distribution economics in AI.
The lesson Microsoft just demonstrated, and that every AI company will internalize over the next twelve months, is that owning the default seat beats owning the better model. Claude Opus 4.8 outperforms GPT-4 Turbo on most coding benchmarks. Polaris probably doesn’t outperform Opus 4.8 on most of them. None of that matters at the seat level, because the developer who opens VS Code tomorrow gets the model their employer’s IT team approved, not the model that won the benchmark contest.
For the best AI coding assistants conversation in 2026, that means the answer increasingly depends on who’s asking. For an individual developer choosing their own tools, Claude Code and Cursor remain the better products. For an engineering org choosing what to deploy across 5,000 seats, the gravitational pull toward whatever ships as a default just got stronger.
For Anthropic specifically, the August 2026 timing is a problem. The October IPO needs the Claude Code curve to stay vertical through Q3. Every enterprise that defaults to Polaris in August is one fewer enterprise that signs a new Claude Code contract in September. The math doesn’t have to break — Anthropic’s customer base skews toward engineering orgs that care about model quality more than distribution convenience — but it gets harder.
The part of this story to watch over the next ninety days isn’t the model benchmarks. It’s whether Microsoft’s rollout actually lands without a quality regression that lets the Anthropic and Cursor sales teams say “we told you so.” The 3-month enterprise fallback window exists precisely because Microsoft is bracing for that possibility. If the fallback gets used heavily, Polaris is a marketing event. If it doesn’t, Polaris is the start of a market structure shift.
Project Polaris is Microsoft’s in-house mixture-of-experts coding model. It is replacing GPT-4 Turbo as the default model behind GitHub Copilot starting August 2026. The architecture activates only the relevant expert sub-networks per query, with specialized experts trained for low-resource languages like Rust and Haskell.
August 2026. The replacement covers every Copilot tier — Individual, Business, and Enterprise. Enterprise customers have an optional 3-month fallback window that pins their tenant to GPT-4 Turbo through November 2026 if they need time to validate the new model.
No. Microsoft has not announced any pricing changes tied to the Polaris rollout. The current Copilot pricing — $10 per month for Individual, $19 per month for Business, $39 per month for Enterprise — stays in place. The Pro tier features (100K-line context, autonomous test generation) are part of the existing Pro plan, not a new SKU.
On published benchmarks (HumanEval, MBPP), Polaris beats GPT-4 Turbo. Microsoft did not release scores on harder evaluations like SWE-Bench, which is where Claude Opus 4.8 currently leads. The competitive read is that Polaris narrows the model-quality gap but doesn’t close it; its primary advantage is distribution — every Copilot seat defaults to Polaris in August without any user action required.
Polaris runs exclusively on Microsoft’s Maia AI accelerators inside Azure. That’s Microsoft’s first-party AI silicon, designed for inference workloads at scale. Running Polaris on Maia recaptures the margin Microsoft was previously paying to OpenAI for GPT-4 Turbo inference.
Copilot Pro and Pro+ subscribers get multi-file context up to 100,000 lines of code per request, which puts Polaris in the same range as Cursor’s repo-aware indexing and Claude Code’s full-repo agentic mode. The feature lets the model reason across an entire mid-sized codebase rather than just the open file.
Most won’t need to. The fallback window matters most for regulated industries — financial services, healthcare, defense — where any model swap requires internal validation against existing code review processes. For typical enterprise engineering teams, the August default switch will work without intervention.
No. The partnership continues. What changes is the exclusivity. Microsoft is now operating a multi-model strategy on its developer surfaces, with first-party Polaris as the Copilot default. OpenAI models remain available on Azure, and other Microsoft surfaces (Office Copilot, Bing Chat) still rely heavily on GPT. Treat this as a decoupling of one product line, not a divorce.
The vendor-risk profile shifts from “Microsoft platform with OpenAI model dependency” to “Microsoft platform with Microsoft model dependency.” Most procurement teams will read that as a net reduction in risk because the supply chain is shorter. The remaining risk is that Polaris has not been deployed at Copilot scale yet, and any model swap of this magnitude carries some chance of quality regression in the first quarter post-launch.
Last updated: June 2, 2026. Sources: Microsoft Build 2026 live blog · GitHub Copilot product page · Microsoft Maia chip announcement · HumanEval benchmark · MBPP benchmark.
Related reading: Microsoft MAI Models: The OpenAI Bet Hedge · Cursor vs Claude Code vs Copilot · Claude Code Routines Enterprise Guide · GitHub Copilot X Review · Claude Opus 4.8 Review · Anthropic Tops $900B Valuation · Microsoft Agent 365 GA · Cursor Composer 2 Kimi K2.5 Disclosure · Best AI Coding Assistants 2026