DeepSeek's hosted R1 endpoint charges about $2.19 per million output tokens. OpenAI's o1, the proprietary reasoning model it chased, launched at $60. That 27x price gap, opened by DeepSeek-R1's January 2025 release under an MIT license, is the single number that explains the 2026 AI model landscape.
Open-source reasoning models are open-weight LLMs post-trained specifically for multi-step reasoning, typically with reinforcement learning on chain-of-thought traces. In 2026 they sit within roughly 5 to 17 points of the proprietary frontier on the hardest benchmarks while costing one to two orders of magnitude less per token.
TL;DR: Between January 2025 and mid-2026, DeepSeek-R1, Meta's Llama 4, and Alibaba's Qwen 3 turned frontier reasoning into commodity infrastructure. The capability gap to GPT-o3 and Claude 4 is real on the newest benchmarks but small enough that multi-model strategies are now the norm. For most production reasoning workloads, an open-weight default with a proprietary fallback is the rational architecture.
Key takeaways:
- DeepSeek-R1-0528 jumped from 70% to 87.5% on AIME 2025 in four months, per OpenRouter's tracked benchmarks. The open frontier is improving faster than the closed one.
- On SWE-bench Verified, the most commercially relevant benchmark, open models trail the proprietary leaders by 10 to 15 points.
- Licensing matters as much as capability: Llama 4 excludes EU residents, so Qwen 3 (Apache 2.0) and DeepSeek-R1 (MIT) dominate European regulated deployments.
- Self-hosting beats managed APIs above roughly 30% sustained GPU utilization. Below that, use a hosted endpoint.
- The 2023 cohort (OpenLLaMA, Cerebras-GPT) is history; RWKV survives as a niche architecture for constant-memory long-context inference.
What changed: RL post-training became a public recipe
The defining shift is that the recipe for reasoning is no longer a trade secret. Every leading open model now follows the same template: a large mixture-of-experts base, then reinforcement learning on chain-of-thought traces with verifiable rewards.
DeepSeek-R1 proved the recipe in public. It's a 671-billion-parameter MoE with 37 billion active parameters per token, and its R1-Zero variant reached 71% on AIME 2024 through pure RL with no supervised fine-tuning seed, showing emergent self-verification and backtracking along the way, per the Hugging Face analysis of the R1 paper.
By September 2025 that paper had been published in Nature, the first peer-reviewed open-weight reasoning model.
Meta's Llama 4 family, released April 5, 2025, brought MoE and native multimodality to the Llama line. Scout packs 109 billion total parameters with 17 billion active and a 10-million-token context window, the longest in the open-weight segment at release, according to Meta's announcement.
Alibaba's Qwen 3 added the most practical feature: a hybrid thinking mode. One model can emit long chain-of-thought tokens for hard problems or answer directly for easy ones, so you pay for reasoning only when you need it.
Alibaba reports roughly 90,000 enterprise clients on the Qwen Cloud API, and Qwen is the most-downloaded open-weight family on Hugging Face as of mid-2026.
How close are open-source reasoning models to the proprietary frontier?
Close on most benchmarks, with a measurable gap on the newest ones. The numbers below are lab-reported, so treat them as the issuing vendor's claim rather than an independent reproduction.
| Model | GPQA Diamond | AIME 2025 | SWE-bench Verified | License |
|---|---|---|---|---|
| DeepSeek-R1-0528 | 71.5 (orig. R1) | 87.5 | 57.6 | MIT |
| Qwen3-235B-A22B | 81.3 | 81.3 | 51.6 (Qwen3-Coder) | Apache 2.0 |
| Llama 4 Maverick | 73.7 | , | , | Llama 4 Community |
| GPT-o3 | 87.7 | 98.4 | 69.1 | Proprietary |
| Claude 4 Sonnet/Opus | 83+ | , | 72.7 | Proprietary |
Three patterns matter. On GPQA Diamond, Qwen3-235B-A22B's 81.3 sits within a few points of Claude 4 and Gemini 2.5 Pro. On AIME 2025, the most contamination-resistant math test, o3's 98.4 still clearly leads R1-0528's 87.5.
And on SWE-bench Verified, which measures resolving real GitHub issues end to end, the open cohort trails by 10 to 15 points.
That last gap is the one to watch. Software engineering is the highest-leverage commercial reasoning workload, and the rate of open-model improvement there is the strongest argument that open weights will be competitive for production coding by 2027.
One honest caveat on third-party validation: the Stanford HELM capabilities leaderboard, the most authoritative independent reference, hasn't been refreshed since March 2025. Several major 2026 releases simply aren't represented in any rigorous third-party ranking yet.
Why is the cost gap so large?
Because MoE architectures activate a fraction of their parameters per token, and because open weights let you ride rented GPU economics instead of API margins.
The self-hosting math is even more aggressive. An H100 rents for roughly $2 to $4 per GPU-hour, which works out to about $0.0005 per thousand output tokens for a quantized 70B model at 50% utilization. The break-even against managed APIs lands around 30% sustained utilization, per OpenRouter's pricing data and standard cloud GPU rates.
The serving stack is mature enough that this is no longer a research project. VLLM dominates with PagedAttention, prefix caching, and speculative decoding (which can roughly double tokens per second on long chain-of-thought outputs). SGLang is the strong second option for structured generation, and llama.cpp remains the standard for local LLMs for reasoning on consumer GPUs, where INT4 quantization fits 32B to 70B models on a single RTX 4090-class card at a cost of 1 to 3 MMLU points.
Where do proprietary models still win?
Three places, and they're consistent across evaluations.
First, the newest benchmarks. On ARC-AGI and FrontierMath-2026, the gap to o3 widens to 15 to 25 points, where late-stage RL recipes and an extra year of compute still pay off. Symbolic manipulation (counting, modular arithmetic, formal proofs) is the open cohort's weakest area.
Second, long-horizon agentic work. Multi-hour software engineering and browser automation depend on tool-use infrastructure that OpenAI, Anthropic, and Google have invested in heavily and that doesn't ship in a weights file.
Third, adversarial safety. Open-weight models are measurably more susceptible to jailbreaks, per academic vulnerability surveys, and an attacker with the weights can strip guardrails entirely. The practical mitigation is architectural: put the open model behind a gateway that handles input filtering, output moderation, and PII redaction, which is exactly the pattern banks running self-hosted R1 deployments use.
Which model should you actually deploy?
Start with the license, then the size, then the benchmark.
For EU or strictly regulated deployments, Qwen 3 and DeepSeek-R1 are the defaults. Llama 4's community license excludes EU residents from accepting its terms and blocks any entity over 700 million monthly active users, which makes it effectively closed in Europe. The EU AI Act's transparency obligations for general-purpose AI also favor models whose weights you can actually inspect. Mistral's Apache 2.0 releases round out the compliant European options.
For maximum reasoning per dollar, DeepSeek-R1-0528 or Qwen3-235B-A22B on a vLLM cluster. The common production pattern is 8 to 64 H100s or H200s behind an internal API gateway.
For edge and on-device, the small tier got serious. Microsoft's Phi-4-Mini-Reasoning fits a single consumer GPU at INT4, and Google's Gemma 3n runs multimodal workloads on smartphone-class hardware.
And a note on the 2023 generation many of us learned on: OpenLLaMA and Cerebras-GPT now serve as historical references for permissive reproduction and scaling research. RWKV is the survivor worth knowing, since RWKV-7 "Goose" keeps improving a linear-attention RNN that runs with constant memory per token regardless of context length. For resource-constrained long-context inference, that property is genuinely useful even if the model trails the reasoning frontier.
What this means for you
If you're running high-volume reasoning workloads (code review, document analysis, customer-facing assistants), price an open-weight deployment now. The 10 to 30x cost differential is durable, and the McKinsey State of AI 2025 survey confirms cost and data sovereignty are the dominant drivers pushing enterprises toward open or hybrid stacks.
Adopt the multi-model pattern: a primary open-weight model behind a common API, with a proprietary fallback for the hardest cases and long-horizon agent tasks. This is now the norm in production, and it also hedges the geopolitical and deprecation risks on both sides.
Budget honestly for fine-tuning. A full-parameter run on a 70B model costs $50,000 to $500,000 in cloud GPU time, and "we fine-tuned and it got worse" is a common failure mode. QLoRA plus GRPO on verifiable rewards is the cheaper, safer path for most teams.
The quotable version of 2026: open versus proprietary reasoning is now a deployment choice more than a capability one. The frontier still belongs to a few well-funded labs, but the recipes are public, the gap is single digits on most workloads, and the price of admission has collapsed.
Sources
- Understanding the DeepSeek R1 Paper, Hugging Face
- Meta releases Llama 4, TechCrunch
- The Llama 4 herd, Meta AI
- Llama 4 Community License, GitHub
- DeepSeek R1 pricing and benchmarks, OpenRouter
- Alibaba unveils new Qwen3 models
- Qwen on Alibaba Cloud
- Introducing Claude 4, Anthropic
- Gemini 2.5 Pro, Google DeepMind
- HELM capabilities leaderboard, Stanford CRFM
- Phi-4 reasoning training details, DeepLearning.AI
- Introducing Gemma 3n, Google Developers
- RWKV-7 "Goose", arXiv
- Jailbreaking and mitigation of LLM vulnerabilities, arXiv
- EU AI Act, European Commission
- The State of AI: Global Survey 2025, McKinsey
