Claude Computer Use Review: Hands-On Testing (2026)
Alibaba dropped Qwen3.5-397B-A17B on February 16, 2026. Within a week it was scoring higher than GPT-5.2 on instruction following benchmarks. Within two weeks, 130,000 derivative models had appeared on HuggingFace. And the whole thing is Apache 2.0 â free to download, fine-tune, and deploy commercially.
The honest question isnât whether Qwen 3.5 is impressive. It clearly is. The question is whether itâs impressive enough to replace what youâre already using â or worth running locally instead of paying for API access.
Quick Verdict: Qwen3.5-397B-A17B
Aspect Rating Overall Score â â â â â (4.2/5) Best For Instruction following, multilingual tasks, cost-sensitive API workloads, local self-hosting API Pricing ~$0.40/1M input, ~$1.20/1M output (Qwen3.5-Plus via Alibaba API) License Apache 2.0 (fully open-weight) Context Window 256K native, 1M via API (Qwen3.5-Plus) IFBench (Instruction Following) 76.5 â best of any current model AIME 2026 (Math) 91.3 â strong, behind GPT-5.2 (96.7) SWE-bench Verified (Coding) 76.4 â behind Claude Opus 4.6 (80.9) Bottom line: Qwen 3.5 doesnât win every benchmark, but it wins the one that matters most for real-world tasks â and does it at a cost that makes API-reliant projects suddenly viable. If youâre paying Claude Opus 4.6 rates for instruction-heavy workloads, youâre likely overpaying by 12x.
Most AI models are dense transformers: every parameter activates on every token, every time. Qwen3.5-397B-A17B uses sparse Mixture-of-Experts (MoE) instead. The model has 397 billion total parameters, but only 17 billion activate per forward pass.
Thatâs not a marketing trick. It means the compute per token is closer to a 17B model than a 400B one. The practical results: inference costs 60% less than the equivalent Qwen3-Max, decoding runs 8.6x faster at 32K context and 19x faster at 256K context.
The other architectural choice worth knowing: Gated Delta Networks replace standard attention in 75% of layers. Combined with early-fusion multimodal training (text, images, video processed together from the start), the model handles vision tasks without the limitations you see in models that added vision as an afterthought.
This isnât the first MoE frontier model â Mixtral pioneered the approach, DeepSeek V4 pushed it further (see our DeepSeek V4 review for the architecture comparison). Qwen 3.5 refines it further and delivers it under a fully permissive license.
The marketing copy says Qwen 3.5 âbeats GPT-5.2 and Claude Opus 4.5 on 80% of categories.â That framing is technically accurate and somewhat misleading. Hereâs the honest picture:
Instruction following is the genuine standout. IFBench score of 76.5 is the highest of any current model â GPT-5.2 sits at 75.4, Claude benchmarks significantly lower at 58.0. MultiChallenge (multi-turn instruction fidelity) follows the same pattern: 67.6 for Qwen 3.5 vs GPT-5.2âs 57.9.
This matters for real-world use more than most benchmark discussions acknowledge. Complex, multi-step prompts â generate a table in this format, using these constraints, following these rules â are exactly what breaks models in production. Qwen 3.5 handles them with less drift than anything else currently available.
Math reasoning is strong. AIME 2026 score of 91.3 puts it in the top tier, behind GPT-5.2 (96.7) and Claude Opus 4.6 (93.3) but ahead of every open-weight predecessor. HMMT Feb 2025 score of 94.8 is more competitive.
Multilingual capability is a genuine differentiator. Support for 201 languages and dialects, up from 119 in the previous generation. For teams doing international deployments or localization work, this isnât a footnote.
SWE-bench Verified (real-world coding) tells a different story. Qwen 3.5 scores 76.4 against Claude Opus 4.6âs 80.9 and GPT-5.2âs 80.0. Thatâs a meaningful gap if software engineering agents are your primary use case. The best AI coding assistants guide covers this breakdown in full.
Math at the extreme end still favors closed models. The AIME gap between Qwen (91.3) and GPT-5.2 (96.7) is 5.4 points â not huge, but consistent with a pattern where open-weight models still trail at the hardest reasoning tasks.
Computer use isnât its primary focus. Unlike GPT-5.4âs 75% OSWorld score, Qwen 3.5 benchmarks donât prominently feature GUI navigation. For agentic workflows requiring real desktop automation, check the AI models compared 2026 overview.
Hereâs the part that matters most if youâre running AI at any volume.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Multiple vs Qwen 3.5 |
|---|---|---|---|
| Qwen3.5-Plus | $0.40 | $1.20 | 1x (baseline) |
| GPT-5.2 | ~$2.50 | ~$15.00 | ~6x input / ~12x output |
| Claude Opus 4.6 | $5.00 | $25.00 | 12.5x input / ~21x output |
| Gemini 3.1 Pro | ~$4.00 | ~$12.00 | ~10x input / ~10x output |
At $0.40/M input tokens, Qwen3.5-Plus is 12.5x cheaper than Claude Opus 4.6 on input and roughly 21x cheaper on output. A 1M-token context request that costs $0.18 via Qwen API costs $5+ via Opus 4.6.
For applications where instruction following is the bottleneck â structured data extraction, document processing with complex output requirements, classification with detailed criteria â Qwen 3.5 delivers the best benchmark numbers at the lowest cost. Thatâs not a typical combination.
The self-hosted calculation is even more extreme. Under Apache 2.0, you download the weights, run them on your own hardware, and pay zero per-token costs. One-time infrastructure investment, unlimited usage.
This is where Qwen 3.5âs MoE architecture works in favor of individual users. Because only 17B parameters activate per forward pass, the actual compute requirement is far lower than the 397B parameter count suggests.
The full 397B model: 807GB on disk. Quantized versions from Unsloth range from 94GB (1-bit) to 462GB (Q8). You need either a multi-GPU server setup or a machine with large unified memory. Reported performance: 25+ tokens/second on a single 24GB GPU with 256GB system RAM using MoE offloading.
The practical option for individuals: Qwen3.5-9B via Ollama. Requires 16GB RAM minimum, runs well on Apple Silicon (M1/M2/M3) and NVIDIA cards with 8GB+ VRAM. Install with three commands:
# Install Ollama (v0.17+)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull and run Qwen 3.5
ollama run qwen3.5:9b
The 9B model handles most everyday tasks â drafting, summarization, light coding, Q&A â with noticeably better instruction following than older open-weight models at this size. On an M4 Mac mini with 16GB unified memory, expect 80-120 tokens per second.
For the 35B-A3B variant (the sweet spot of capability vs. local hardware), llama.cpp with GGUF quantization is the current best path. Download a Q4_K_M GGUF from HuggingFace and run with llama-server. Full setup details are in the local LLMs guide 2026.
Important caveat: Multimodal (vision) capability requires separate mmproj files that currently donât work with Ollamaâs standard setup. For vision tasks locally, use llama.cpp directly.
I ran Qwen3.5-Plus (the API-hosted 397B variant) across a set of structured extraction tasks I regularly use to evaluate instruction adherence. The results aligned with the IFBench numbers.
What held up: Complex table generation with multiple formatting constraints, multi-step document analysis where each step builds on the previous output, JSON extraction with nested schemas. These are tasks where other frontier models occasionally skip constraints or restructure output in ways that break downstream parsing. Qwen 3.5 followed the spec more consistently.
What I noticed: On open-ended creative tasks, the output felt slightly more mechanical than Claude Opus 4.6. Not worse in quality â just less stylistically varied. For structured work this doesnât matter. For long-form writing where voice consistency across outputs is important, Opus 4.6 still feels more natural.
Hallucination behavior: Comparable to other frontier models on factual questions. The model hedges on uncertainty reasonably well. I didnât notice systematic overconfidence issues, but I also didnât run a rigorous factual accuracy battery.
Latency via API: The 8.6x faster decoding claim translates into noticeably snappier API responses compared to Qwen3-Max. At 1M context lengths, the speed difference is more dramatic.
| Option | Cost | Best For |
|---|---|---|
| Qwen3.5-Plus API | $0.40/$1.20 per 1M tokens | High-volume structured tasks, teams that donât need GPT-5.2 math edge |
| Self-hosted 9B | Hardware cost only | Individual use, privacy-sensitive tasks, unlimited volume |
| Self-hosted 35B-A3B | Hardware cost only | Better quality at home, needs 32GB+ RAM |
| Self-hosted 397B | Multi-GPU or 256GB+ RAM server | Enterprise self-hosting, maximum quality, data sovereignty |
| NVIDIA NIM hosted | Check build.nvidia.com | Enterprise deployment without self-managed infra |
The self-hosted path is uniquely practical here because Apache 2.0 means no restrictions on commercial use. Teams with data sovereignty requirements â healthcare, legal, financial â can run the full model on-premises without licensing headaches.
Strong fit:
Teams doing high-volume structured extraction â processing large document sets, running classification pipelines, generating structured outputs at scale. The IFBench leadership translates directly to fewer broken outputs and lower error-handling overhead.
Cost-sensitive API users currently paying Claude Opus 4.6 rates for tasks that arenât leveraging Opusâs specific reasoning advantages. Do the math: if youâre spending $500/month on Opus 4.6 for instruction-following workloads, switching to Qwen3.5-Plus likely drops that to $40-50.
Multilingual deployments â 201 language support with strong benchmark performance in non-English settings. Qwenâs training data has better global language distribution than most Western-origin models.
Teams exploring local deployment who previously found open-weight models too limited. The 9B variant is good enough for most everyday tasks; the larger quantized models bring it close to frontier quality.
Software engineering agents where SWE-bench scores are the decision criterion. GPT-5.2 (80.0) and Claude Opus 4.6 (80.9) both outperform Qwen 3.5 (76.4) on real-world coding benchmarks. The gap isnât enormous, but itâs consistent. See the best AI tools for developers 2026 guide for a full breakdown.
Users where Claudeâs reasoning style specifically matters. Claude Opus 4.6 still leads on the Artificial Analysis Intelligence Index (53 vs Qwenâs competitive score) and on complex multi-step reasoning tasks where output nuance matters as much as accuracy. The Claude Opus 4.6 review is worth reading if youâre evaluating both seriously.
Anyone needing GUI/desktop automation agents. Qwen 3.5âs multimodal training doesnât currently translate to competitive computer-use benchmarks. GPT-5.4 at 75% OSWorld remains the leader for agentic desktop work.
Simple use cases where free tiers suffice. If youâre running a handful of queries daily for personal use, the best free AI tools 2026 guide covers better starting points than self-hosting a 400B parameter model.
Via API:
qwen3.5-plus for the 397B-equivalent hosted modelVia Ollama (local, small models):
ollama run qwen3.5:9b # 9B â practical for most home setups
ollama run qwen3.5:4b # 4B â runs on 8GB RAM
Via llama.cpp (local, larger models):
llama-server with the downloaded GGUFVia NVIDIA NIM: Available at build.nvidia.com for enterprise deployment without managing your own inference stack.
Qwen 3.5 does something the AI market rarely delivers: it offers a genuine tradeoff rather than a clear winner or loser. On instruction following, itâs the best available model. On coding, itâs third. On math, itâs strong but not leading. And itâs free to download and deploy commercially.
For teams currently spending significant budget on Claude Opus 4.6 for structured, instruction-heavy workloads, the cost argument is hard to ignore â 12x cheaper on input tokens with better instruction-following benchmarks. Thatâs a real finding, not marketing.
The honest caveat: âbest instruction-following modelâ is not the same as âbest model.â If your workload is primarily software engineering tasks, GPT-5.2 and Claude Opus 4.6 still have a real edge. If youâre doing frontier math, the gap to GPT-5.2 matters.
But 130,000 derivative models on HuggingFace in less than four weeks isnât noise. The open-weight AI ecosystem is building on Qwen 3.5 faster than almost anything before it, which means fine-tunes, specialized variants, and deployment tooling will continue improving for months.
That trajectory matters as much as todayâs benchmarks.
The model weights are released under Apache 2.0, which means free to download, use, and deploy commercially. You can run Qwen3.5-9B on a standard laptop at no cost. Larger variants require significant hardware. The API (Qwen3.5-Plus) is a paid service at ~$0.40/1M input tokens, but thatâs separate from the open weights.
Qwen 3.5 leads on instruction following (IFBench 76.5 vs 75.4) and is competitive on math. GPT-5.2 leads on coding (SWE-bench 80.0 vs 76.4) and extreme math (AIME 96.7 vs 91.3). Cost favors Qwen 3.5 by roughly 6x on API input pricing. The right choice depends on your specific workload.
The 9B model runs on a 16GB RAM laptop â an Apple Silicon Mac or recent Windows machine with 16GB RAM will work. Expect 80+ tokens/second on M4 Mac mini. The 4B model runs on 8GB RAM. The flagship 397B requires 256GB+ RAM or a multi-GPU setup.
Qwen3.5-Plus is the API-hosted version of the 397B-A17B model with additional production features: default 1M context window, built-in tool use, multimodal support via API. The raw model weights (Qwen3.5-397B-A17B) are self-hosted and require managing your own inference stack.
Yes â the model uses early fusion training for text, images, and video. Via the Qwen3.5-Plus API, multimodal inputs are supported natively. For local deployment, multimodal requires llama.cpp with separate mmproj files rather than Ollama (which currently doesnât support the vision components).
Sparse MoE means only 17 billion of the 397 billion parameters activate per forward pass. For users, this means: faster inference (8.6x faster than Qwen3-Max at 32K context), lower cost per token, and surprisingly accessible local hardware requirements given the total parameter count. The tradeoff is larger model files â the full model is 807GB.
The Apache 2.0 license imposes no commercial restrictions. For data-sensitive work, self-hosting means your data never leaves your infrastructure â a meaningful advantage over API-only models. Standard enterprise concerns (fine-tuning stability, output consistency, audit logging) are addressed through the standard deployment stack (SGLang, vLLM). The model itself has standard safety training; enterprise-specific fine-tuning for compliance is fully permitted under the license.
Last updated: March 14, 2026. Benchmarks sourced from Artificial Analysis, model documentation at Hugging Face, and Qwen team technical reports. API pricing verified against pricepertoken.com. Claude Opus 4.6 pricing from Anthropic.