Hero image for Qwen 3.5 Review: The 397B Open-Weight Model That Rivals GPT-5 for Free
By AI Tool Briefing Team

Qwen 3.5 Review: The 397B Open-Weight Model That Rivals GPT-5 for Free


Alibaba dropped Qwen3.5-397B-A17B on February 16, 2026. Within a week it was scoring higher than GPT-5.2 on instruction following benchmarks. Within two weeks, 130,000 derivative models had appeared on HuggingFace. And the whole thing is Apache 2.0 — free to download, fine-tune, and deploy commercially.

The honest question isn’t whether Qwen 3.5 is impressive. It clearly is. The question is whether it’s impressive enough to replace what you’re already using — or worth running locally instead of paying for API access.

Quick Verdict: Qwen3.5-397B-A17B

AspectRating
Overall Score★★★★☆ (4.2/5)
Best ForInstruction following, multilingual tasks, cost-sensitive API workloads, local self-hosting
API Pricing~$0.40/1M input, ~$1.20/1M output (Qwen3.5-Plus via Alibaba API)
LicenseApache 2.0 (fully open-weight)
Context Window256K native, 1M via API (Qwen3.5-Plus)
IFBench (Instruction Following)76.5 — best of any current model
AIME 2026 (Math)91.3 — strong, behind GPT-5.2 (96.7)
SWE-bench Verified (Coding)76.4 — behind Claude Opus 4.6 (80.9)

Bottom line: Qwen 3.5 doesn’t win every benchmark, but it wins the one that matters most for real-world tasks — and does it at a cost that makes API-reliant projects suddenly viable. If you’re paying Claude Opus 4.6 rates for instruction-heavy workloads, you’re likely overpaying by 12x.

Try Qwen3.5-Plus API | Download on HuggingFace

What Actually Makes This Architecture Different

Most AI models are dense transformers: every parameter activates on every token, every time. Qwen3.5-397B-A17B uses sparse Mixture-of-Experts (MoE) instead. The model has 397 billion total parameters, but only 17 billion activate per forward pass.

That’s not a marketing trick. It means the compute per token is closer to a 17B model than a 400B one. The practical results: inference costs 60% less than the equivalent Qwen3-Max, decoding runs 8.6x faster at 32K context and 19x faster at 256K context.

The other architectural choice worth knowing: Gated Delta Networks replace standard attention in 75% of layers. Combined with early-fusion multimodal training (text, images, video processed together from the start), the model handles vision tasks without the limitations you see in models that added vision as an afterthought.

This isn’t the first MoE frontier model — Mixtral pioneered the approach, DeepSeek V4 pushed it further (see our DeepSeek V4 review for the architecture comparison). Qwen 3.5 refines it further and delivers it under a fully permissive license.

Benchmark Reality Check

The marketing copy says Qwen 3.5 “beats GPT-5.2 and Claude Opus 4.5 on 80% of categories.” That framing is technically accurate and somewhat misleading. Here’s the honest picture:

Where Qwen 3.5 Actually Leads

Instruction following is the genuine standout. IFBench score of 76.5 is the highest of any current model — GPT-5.2 sits at 75.4, Claude benchmarks significantly lower at 58.0. MultiChallenge (multi-turn instruction fidelity) follows the same pattern: 67.6 for Qwen 3.5 vs GPT-5.2’s 57.9.

This matters for real-world use more than most benchmark discussions acknowledge. Complex, multi-step prompts — generate a table in this format, using these constraints, following these rules — are exactly what breaks models in production. Qwen 3.5 handles them with less drift than anything else currently available.

Math reasoning is strong. AIME 2026 score of 91.3 puts it in the top tier, behind GPT-5.2 (96.7) and Claude Opus 4.6 (93.3) but ahead of every open-weight predecessor. HMMT Feb 2025 score of 94.8 is more competitive.

Multilingual capability is a genuine differentiator. Support for 201 languages and dialects, up from 119 in the previous generation. For teams doing international deployments or localization work, this isn’t a footnote.

Where Qwen 3.5 Trails

SWE-bench Verified (real-world coding) tells a different story. Qwen 3.5 scores 76.4 against Claude Opus 4.6’s 80.9 and GPT-5.2’s 80.0. That’s a meaningful gap if software engineering agents are your primary use case. The best AI coding assistants guide covers this breakdown in full.

Math at the extreme end still favors closed models. The AIME gap between Qwen (91.3) and GPT-5.2 (96.7) is 5.4 points — not huge, but consistent with a pattern where open-weight models still trail at the hardest reasoning tasks.

Computer use isn’t its primary focus. Unlike GPT-5.4’s 75% OSWorld score, Qwen 3.5 benchmarks don’t prominently feature GUI navigation. For agentic workflows requiring real desktop automation, check the AI models compared 2026 overview.

The Cost Math That Changes Everything

Here’s the part that matters most if you’re running AI at any volume.

ModelInput (per 1M tokens)Output (per 1M tokens)Multiple vs Qwen 3.5
Qwen3.5-Plus$0.40$1.201x (baseline)
GPT-5.2~$2.50~$15.00~6x input / ~12x output
Claude Opus 4.6$5.00$25.0012.5x input / ~21x output
Gemini 3.1 Pro~$4.00~$12.00~10x input / ~10x output

At $0.40/M input tokens, Qwen3.5-Plus is 12.5x cheaper than Claude Opus 4.6 on input and roughly 21x cheaper on output. A 1M-token context request that costs $0.18 via Qwen API costs $5+ via Opus 4.6.

For applications where instruction following is the bottleneck — structured data extraction, document processing with complex output requirements, classification with detailed criteria — Qwen 3.5 delivers the best benchmark numbers at the lowest cost. That’s not a typical combination.

The self-hosted calculation is even more extreme. Under Apache 2.0, you download the weights, run them on your own hardware, and pay zero per-token costs. One-time infrastructure investment, unlimited usage.

Running It Locally: What You Actually Need

This is where Qwen 3.5’s MoE architecture works in favor of individual users. Because only 17B parameters activate per forward pass, the actual compute requirement is far lower than the 397B parameter count suggests.

The full 397B model: 807GB on disk. Quantized versions from Unsloth range from 94GB (1-bit) to 462GB (Q8). You need either a multi-GPU server setup or a machine with large unified memory. Reported performance: 25+ tokens/second on a single 24GB GPU with 256GB system RAM using MoE offloading.

The practical option for individuals: Qwen3.5-9B via Ollama. Requires 16GB RAM minimum, runs well on Apple Silicon (M1/M2/M3) and NVIDIA cards with 8GB+ VRAM. Install with three commands:

# Install Ollama (v0.17+)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run Qwen 3.5
ollama run qwen3.5:9b

The 9B model handles most everyday tasks — drafting, summarization, light coding, Q&A — with noticeably better instruction following than older open-weight models at this size. On an M4 Mac mini with 16GB unified memory, expect 80-120 tokens per second.

For the 35B-A3B variant (the sweet spot of capability vs. local hardware), llama.cpp with GGUF quantization is the current best path. Download a Q4_K_M GGUF from HuggingFace and run with llama-server. Full setup details are in the local LLMs guide 2026.

Important caveat: Multimodal (vision) capability requires separate mmproj files that currently don’t work with Ollama’s standard setup. For vision tasks locally, use llama.cpp directly.

My Testing Notes

I ran Qwen3.5-Plus (the API-hosted 397B variant) across a set of structured extraction tasks I regularly use to evaluate instruction adherence. The results aligned with the IFBench numbers.

What held up: Complex table generation with multiple formatting constraints, multi-step document analysis where each step builds on the previous output, JSON extraction with nested schemas. These are tasks where other frontier models occasionally skip constraints or restructure output in ways that break downstream parsing. Qwen 3.5 followed the spec more consistently.

What I noticed: On open-ended creative tasks, the output felt slightly more mechanical than Claude Opus 4.6. Not worse in quality — just less stylistically varied. For structured work this doesn’t matter. For long-form writing where voice consistency across outputs is important, Opus 4.6 still feels more natural.

Hallucination behavior: Comparable to other frontier models on factual questions. The model hedges on uncertainty reasonably well. I didn’t notice systematic overconfidence issues, but I also didn’t run a rigorous factual accuracy battery.

Latency via API: The 8.6x faster decoding claim translates into noticeably snappier API responses compared to Qwen3-Max. At 1M context lengths, the speed difference is more dramatic.

Pricing Tiers: What You’re Actually Choosing Between

OptionCostBest For
Qwen3.5-Plus API$0.40/$1.20 per 1M tokensHigh-volume structured tasks, teams that don’t need GPT-5.2 math edge
Self-hosted 9BHardware cost onlyIndividual use, privacy-sensitive tasks, unlimited volume
Self-hosted 35B-A3BHardware cost onlyBetter quality at home, needs 32GB+ RAM
Self-hosted 397BMulti-GPU or 256GB+ RAM serverEnterprise self-hosting, maximum quality, data sovereignty
NVIDIA NIM hostedCheck build.nvidia.comEnterprise deployment without self-managed infra

The self-hosted path is uniquely practical here because Apache 2.0 means no restrictions on commercial use. Teams with data sovereignty requirements — healthcare, legal, financial — can run the full model on-premises without licensing headaches.

Who Should Switch to Qwen 3.5

Strong fit:

Teams doing high-volume structured extraction — processing large document sets, running classification pipelines, generating structured outputs at scale. The IFBench leadership translates directly to fewer broken outputs and lower error-handling overhead.

Cost-sensitive API users currently paying Claude Opus 4.6 rates for tasks that aren’t leveraging Opus’s specific reasoning advantages. Do the math: if you’re spending $500/month on Opus 4.6 for instruction-following workloads, switching to Qwen3.5-Plus likely drops that to $40-50.

Multilingual deployments — 201 language support with strong benchmark performance in non-English settings. Qwen’s training data has better global language distribution than most Western-origin models.

Teams exploring local deployment who previously found open-weight models too limited. The 9B variant is good enough for most everyday tasks; the larger quantized models bring it close to frontier quality.

Who Should Look Elsewhere

Software engineering agents where SWE-bench scores are the decision criterion. GPT-5.2 (80.0) and Claude Opus 4.6 (80.9) both outperform Qwen 3.5 (76.4) on real-world coding benchmarks. The gap isn’t enormous, but it’s consistent. See the best AI tools for developers 2026 guide for a full breakdown.

Users where Claude’s reasoning style specifically matters. Claude Opus 4.6 still leads on the Artificial Analysis Intelligence Index (53 vs Qwen’s competitive score) and on complex multi-step reasoning tasks where output nuance matters as much as accuracy. The Claude Opus 4.6 review is worth reading if you’re evaluating both seriously.

Anyone needing GUI/desktop automation agents. Qwen 3.5’s multimodal training doesn’t currently translate to competitive computer-use benchmarks. GPT-5.4 at 75% OSWorld remains the leader for agentic desktop work.

Simple use cases where free tiers suffice. If you’re running a handful of queries daily for personal use, the best free AI tools 2026 guide covers better starting points than self-hosting a 400B parameter model.

How to Get Started

Via API:

  1. Create an account at qwen.ai or access via Alibaba Cloud DashScope
  2. Select qwen3.5-plus for the 397B-equivalent hosted model
  3. Standard OpenAI-compatible API format — drop-in compatible with most existing integrations

Via Ollama (local, small models):

ollama run qwen3.5:9b   # 9B — practical for most home setups
ollama run qwen3.5:4b   # 4B — runs on 8GB RAM

Via llama.cpp (local, larger models):

  1. Download GGUF from Unsloth’s HuggingFace (recommend Q4_K_M for quality/size balance)
  2. Build llama.cpp with GPU support for your hardware
  3. Run llama-server with the downloaded GGUF

Via NVIDIA NIM: Available at build.nvidia.com for enterprise deployment without managing your own inference stack.

The Bottom Line

Qwen 3.5 does something the AI market rarely delivers: it offers a genuine tradeoff rather than a clear winner or loser. On instruction following, it’s the best available model. On coding, it’s third. On math, it’s strong but not leading. And it’s free to download and deploy commercially.

For teams currently spending significant budget on Claude Opus 4.6 for structured, instruction-heavy workloads, the cost argument is hard to ignore — 12x cheaper on input tokens with better instruction-following benchmarks. That’s a real finding, not marketing.

The honest caveat: “best instruction-following model” is not the same as “best model.” If your workload is primarily software engineering tasks, GPT-5.2 and Claude Opus 4.6 still have a real edge. If you’re doing frontier math, the gap to GPT-5.2 matters.

But 130,000 derivative models on HuggingFace in less than four weeks isn’t noise. The open-weight AI ecosystem is building on Qwen 3.5 faster than almost anything before it, which means fine-tunes, specialized variants, and deployment tooling will continue improving for months.

That trajectory matters as much as today’s benchmarks.


Frequently Asked Questions

Is Qwen 3.5 actually free to use?

The model weights are released under Apache 2.0, which means free to download, use, and deploy commercially. You can run Qwen3.5-9B on a standard laptop at no cost. Larger variants require significant hardware. The API (Qwen3.5-Plus) is a paid service at ~$0.40/1M input tokens, but that’s separate from the open weights.

How does Qwen 3.5 compare to GPT-5.2?

Qwen 3.5 leads on instruction following (IFBench 76.5 vs 75.4) and is competitive on math. GPT-5.2 leads on coding (SWE-bench 80.0 vs 76.4) and extreme math (AIME 96.7 vs 91.3). Cost favors Qwen 3.5 by roughly 6x on API input pricing. The right choice depends on your specific workload.

Can I run Qwen 3.5 on a consumer laptop?

The 9B model runs on a 16GB RAM laptop — an Apple Silicon Mac or recent Windows machine with 16GB RAM will work. Expect 80+ tokens/second on M4 Mac mini. The 4B model runs on 8GB RAM. The flagship 397B requires 256GB+ RAM or a multi-GPU setup.

What’s the difference between Qwen3.5-397B-A17B and Qwen3.5-Plus?

Qwen3.5-Plus is the API-hosted version of the 397B-A17B model with additional production features: default 1M context window, built-in tool use, multimodal support via API. The raw model weights (Qwen3.5-397B-A17B) are self-hosted and require managing your own inference stack.

Does Qwen 3.5 support multimodal inputs?

Yes — the model uses early fusion training for text, images, and video. Via the Qwen3.5-Plus API, multimodal inputs are supported natively. For local deployment, multimodal requires llama.cpp with separate mmproj files rather than Ollama (which currently doesn’t support the vision components).

How does the MoE architecture affect practical performance?

Sparse MoE means only 17 billion of the 397 billion parameters activate per forward pass. For users, this means: faster inference (8.6x faster than Qwen3-Max at 32K context), lower cost per token, and surprisingly accessible local hardware requirements given the total parameter count. The tradeoff is larger model files — the full model is 807GB.

Is Qwen 3.5 safe to use for enterprise work?

The Apache 2.0 license imposes no commercial restrictions. For data-sensitive work, self-hosting means your data never leaves your infrastructure — a meaningful advantage over API-only models. Standard enterprise concerns (fine-tuning stability, output consistency, audit logging) are addressed through the standard deployment stack (SGLang, vLLM). The model itself has standard safety training; enterprise-specific fine-tuning for compliance is fully permitted under the license.


Last updated: March 14, 2026. Benchmarks sourced from Artificial Analysis, model documentation at Hugging Face, and Qwen team technical reports. API pricing verified against pricepertoken.com. Claude Opus 4.6 pricing from Anthropic.