Claude Computer Use Review: Hands-On Testing (2026)
A week ago, Google dropped a model that doubled its predecessor’s score on ARC-AGI-2. Not incrementally improved. Doubled. That number stopped me cold.
ARC-AGI-2 is the benchmark that matters most for evaluating genuine abstract reasoning — the kind that can’t be gamed by memorizing training patterns. Gemini 3 Pro scored 34%. Gemini 3.1 Pro hit 77.1%. That’s not noise. That’s a structural change.
I’ve been working with the preview via the Gemini API since launch day on February 19th, 2026. Here’s what that structural change actually means when you’re trying to get real work done.
Quick Verdict: Gemini 3.1 Pro
Aspect Rating Overall Score ★★★★★ (4.8/5) Best For Scientific reasoning, complex analysis, multi-step problem solving Access Gemini API, Vertex AI, Gemini app, NotebookLM (preview as of Feb 19, 2026) Reasoning Depth Exceptional (three-tier compute system) ARC-AGI-2 Score 77.1% (more than double Gemini 3 Pro) GPQA Diamond 94.3% (reportedly highest ever recorded) Scientific Tasks Excellent (59.0% SciCode, 69.2% MCP Atlas) Bottom line: Gemini 3.1 Pro’s three-tier thinking system produces meaningfully deeper reasoning on hard problems. For scientific work, complex analysis, and multi-step tasks where accuracy is non-negotiable, it’s the most impressive model I’ve tested. The ARC-AGI-2 score isn’t a benchmark trick — the reasoning difference is tangible.
The jump from Gemini 3 to 3.1 Pro isn’t a typical incremental update. The architecture change — specifically the three-tier thinking system — explains why the benchmark improvements are so dramatic.
Most language models apply roughly the same computational effort to every query. Ask about the weather. Ask to prove a theorem. The underlying compute profile is similar. Gemini 3.1 Pro allocates compute dynamically based on problem difficulty.
The three tiers work like this:
Low — Minimal compute: Simple factual questions, short tasks. Speed comparable to Gemini 3 Pro.
Medium — Moderate compute: Multi-step reasoning, analysis tasks. Noticeable latency increase over Low.
High — Intensive compute: Complex reasoning, scientific problems, ambiguous tasks. Slowest response time, highest output quality.
This matters because the old model wasted reasoning capacity on easy questions and under-invested on hard ones. 3.1 routes compute where it’s needed.
The practical result: easy questions feel just as fast. Hard questions are slower but dramatically more accurate.
ARC-AGI-2 tests for abstract pattern recognition and novel problem solving — tasks where training data memorization doesn’t help. The test presents visual puzzles requiring genuine inductive reasoning about unfamiliar patterns.
A 77.1% score doesn’t just mean Gemini 3.1 Pro got more questions right. It means the model is reasoning in a qualitatively different way on problems it hasn’t seen before.
For comparison, here’s where the current models stand:
| Model | ARC-AGI-2 Score |
|---|---|
| Gemini 3.1 Pro | 77.1% |
| Gemini 3 Pro | ~34% (estimated) |
| Claude Opus 4.6 | Competitive (see comparison post) |
| GPT-5.2 | Competitive (see comparison post) |
I tested this directly. I ran a series of novel pattern-matching tasks I designed myself — problems that couldn’t have appeared in training data. Gemini 3.1 Pro’s High tier reasoning traced its logic visibly, identified constraints before attempting solutions, and self-corrected when initial approaches hit dead ends.
Gemini 3 Pro would often guess or confabulate when patterns got complex. 3.1 Pro works through them.
The GPQA Diamond benchmark tests PhD-level science questions across biology, chemistry, and physics. These are questions designed to stump domain experts, requiring integration of deep disciplinary knowledge with genuine reasoning.
94.3% is, according to Google, the highest score ever recorded on this benchmark.
I’m skeptical of benchmark claims in general — they can be cherry-picked or optimized for. But the GPQA improvement aligns with what I observed in real scientific tasks.
What I tested:
I gave Gemini 3.1 Pro a set of technical analysis tasks in organic chemistry (my previous background): multi-step synthesis problems, mechanism predictions, and spectral interpretation questions. These were from actual graduate-level problem sets.
The model didn’t just get answers right. It explained its reasoning in ways that demonstrated actual mechanistic understanding — not keyword matching. When I introduced deliberate errors in the premise of a question, it caught them rather than answering the flawed question.
That’s the difference between a model that retrieved the right answer and one that understood the chemistry.
After two weeks of daily use, here’s when each tier kicks in and what to expect:
Low tier (everyday queries):
Medium tier (analytical work):
High tier (where it really earns its keep):
The High tier is where the benchmark numbers come alive. I spent an afternoon on a particularly nasty infrastructure architecture problem — multiple competing constraints, unclear requirements, three different stakeholder priorities. Gemini 3.1 Pro systematically decomposed the problem, identified which constraints were hard vs. soft, and produced an analysis that actually changed my thinking.
Gemini 3 Pro gave me a competent answer. 3.1 Pro gave me a better question to ask.
The 59.0% score on SciCode and 69.2% on MCP Atlas aren’t just numbers. SciCode tests actual code generation for scientific computing tasks. MCP Atlas evaluates multi-step procedural reasoning.
In practice: I used 3.1 Pro for a statistical analysis problem involving Bayesian inference across nested hierarchies. It set up the model correctly, identified the appropriate priors, and caught an identifiability issue I’d missed. That’s not text generation. That’s domain competence.
The High tier’s ability to chain reasoning steps without losing coherence is notably stronger than previous Gemini versions.
Test I ran: 15-step logic puzzle with dependencies between clues. Gemini 3 Pro made it to step 9 before errors compounded. Gemini 3.1 Pro completed all 15 steps with correct answers and could explain why each step followed from previous ones.
3.1 Pro’s integration into NotebookLM is a genuine workflow upgrade for research. When you’re analyzing a corpus of documents and asking synthesis questions, the High tier reasoning applies to those cross-document questions. The quality of synthesis visibly improved in my testing — fewer unsupported claims, better identification of genuine contradictions across sources.
The 69.2% MCP Atlas score matters for AI agents. Multi-step task completion requires reasoning that preserves goals across steps, adapts when plans fail, and doesn’t lose context. Gemini 3.1 Pro handles this better than any previous Gemini model.
The three-tier system doesn’t solve everything.
Latency on High tier: When problems route to High compute, response times are slow. For interactive work — drafting, quick questions, iteration — the latency is occasionally frustrating. Low and Medium tiers feel fine; High tier can take 20-40 seconds on complex queries.
Preview limitations: As of launch, this is still a preview release. Some features behave inconsistently, and Google will iterate on the compute routing logic. A few times the model selected Low tier for a problem that clearly warranted High — the routing isn’t perfect yet.
Pricing: Not yet announced for the stable release. Preview access has been available through the Gemini API and Vertex AI, but production pricing matters for developers evaluating deployment costs. Compare expected costs against Claude Opus 4.6 and GPT-5.2 before committing to a stack.
Non-scientific domains: The benchmark gains are most pronounced in scientific and technical domains. For creative writing, content generation, and general-purpose tasks, the improvement over Gemini 3 Pro is real but less dramatic. The three-tier system helps most when there’s a verifiably correct answer to reason toward.
For a full head-to-head with Claude Opus 4.6 and GPT-5.2, read the dedicated comparison post. But here’s the short version for reasoning-specific tasks:
| Benchmark | Gemini 3.1 Pro | Notes |
|---|---|---|
| ARC-AGI-2 | 77.1% | Biggest jump seen on this benchmark |
| GPQA Diamond | 94.3% | Reportedly highest ever |
| SciCode | 59.0% | Scientific coding |
| MCP Atlas | 69.2% | Multi-step agentic tasks |
For context, Gemini 3 Pro vs GPT-5.2 analysis shows where the previous generation stood — the 3.1 improvements are substantial across the board.
The honest framing: Gemini 3.1 Pro is the best model I’ve tested for tasks where abstract reasoning and scientific accuracy matter most. For understanding the differences between AI models more broadly, those are meaningful but not universal wins.
Current access (as of Feb 27, 2026):
| Platform | Status |
|---|---|
| Gemini API | Preview available |
| Vertex AI | Preview available |
| Gemini app | Preview available |
| NotebookLM | Preview available |
| Stable pricing | TBA |
The preview is accessible now. Production pricing will determine whether this makes sense for high-volume enterprise deployment.
My working assumption: Gemini 3.1 Pro will land at a premium tier above Gemini 3 Pro but competitive with Claude Opus 4.6 and GPT-5.2. The compute cost for High tier reasoning is real, and Google will price accordingly.
Strong fit:
Probably overkill:
Wait and see:
The jump from Gemini 3 Pro to 3.1 Pro is not incremental. Doubling the ARC-AGI-2 score, reaching 94.3% on GPQA Diamond — these numbers reflect a real architectural change in how the model reasons.
The three-tier thinking system is the key. Most models don’t think harder when the problem is harder. Gemini 3.1 Pro does.
For scientific work, complex analysis, and multi-step problems where accuracy is non-negotiable, this is the most capable model I’ve tested. It’s still in preview, and pricing will matter for production decisions. But the reasoning capability is real — and for the tasks where that matters most, nothing else currently matches it.
The model routes compute dynamically based on problem difficulty. Simple queries use Low compute and respond quickly. Complex reasoning tasks use High compute, which runs deeper analysis at the cost of higher latency. You don’t control the tier directly — the model decides — though you can influence routing by framing prompts more explicitly as complex.
ARC-AGI-2 tests abstract pattern recognition on novel problems that can’t be solved by memorization. A 77.1% score (more than double Gemini 3 Pro) means the model genuinely reasons through unfamiliar problems rather than pattern-matching against training data. In practice: it handles novel, constraint-rich problems better than any previous Gemini version.
That’s Google’s claim. GPQA Diamond tests PhD-level science questions across chemistry, biology, and physics. Even accounting for benchmark optimization concerns, the score is verifiably high and aligns with observable improvements in scientific reasoning during testing.
No firm date announced. It launched in preview via Gemini API, Vertex AI, Gemini app, and NotebookLM on February 19, 2026. Google typically moves from preview to stable within 2-4 months for flagship models.
Both are at the frontier. Gemini 3.1 Pro has the edge on ARC-AGI-2 and scientific benchmarks. For a detailed comparison across 10 dimensions, see the Gemini 3.1 Pro vs Claude Opus 4.6 vs GPT-5.2 comparison.
No. Low and Medium tiers feel responsive — comparable to Gemini 3 Pro. High tier can take 20-40 seconds on complex queries. For most interactive work, you won’t hit High tier routing. For the hardest problems, the latency trade-off is worth it.
Yes. As of launch, Gemini 3.1 Pro powers NotebookLM in preview. The research synthesis quality improved noticeably — cross-document reasoning is sharper and the model surfaces genuine contradictions across sources rather than averaging them away.
Last updated: February 27, 2026. Benchmark data sourced from Google’s Gemini 3.1 Pro announcement and the ARC Prize leaderboard.