The self-host-versus-API decision most teams settled two years ago is wrong now. Cheap open-weight flagships and collapsing inference prices moved the break-even line, and the number that matters is not one number.
Small open-weight models (7B to 34B parameters) break even against API pricing near 1 million requests per month. Large models (70B and up) need roughly 10 million requests per month before dedicated hardware pays off, according to production TCO calculators.
That is an order-of-magnitude gap, and it changes the answer for thousands of teams that treat self-hosting as a single decision.
The catch: even the 10M-request figure is optimistic. Once you add ops headcount, 30 to 50% idle GPUs, and compliance, realistic enterprise break-even for a 70B model lands at $50,000 to $80,000 per month.
TL;DR
Self-hosting an open-weight LLM beats API pricing only above real scale: about 1M requests/month for 7B models, about 10M/month for 70B models. Naive spreadsheets undercount by 1.5 to 2x because production GPUs sit 40 to 60% utilized, per industry inference estimates.
The dominant 2026 pattern is hybrid: self-host the cheap 80%, route the hard 20% to a frontier API.
Key takeaways
- Two break-evens, not one. 7B models cross over near 1M requests/month; 70B models near 10M requests/month, or $50K to $80K/month in real cost.
- The idle tax is real. Production GPU utilization averages 40 to 60%, so true self-host cost runs 1.5 to 2x the hardware-and-power spreadsheet.
- Hybrid wins the middle. Below $25K/month go API-first; $25K to $80K/month favors self-host-plus-route; above $80K/month evaluate dedicated infrastructure.
- Open-weight API pricing is absurdly cheap. DeepSeek V4-Flash launched at $0.09/$0.18 per 1M tokens, which is hard to undercut by owning GPUs at low volume.
- Timing matters. SK hynix's June 29, 2026 KRW 1,100T investment won't ease HBM supply until 2028 to 2029; B200 delivery still slips to 2028+.
When does self-hosting an LLM actually break even?
Direct answer: for a 7B model serving simple extraction and generation, self-hosting wins past roughly 1 million requests per month; for a 70B model handling harder reasoning, past roughly 10 million requests per month on paper, and higher once real costs land. Below those volumes, API pricing is cheaper and you avoid the ops burden entirely.
The math is starkest at the small end. A 7B model runs on a single H100 or two RTX 4090s, with a fully-loaded monthly cost of $3,000 to $8,000. That fixed cost only beats cheap open-weight API pricing once your token volume is large enough to amortize the box.
Here is the crossover in request terms.
At 100K requests, API costs $50 against $3,000 to self-host. That is not close. Even at 5M requests, small-model API pricing stays so cheap that owning a box rarely wins on raw dollars. For 7B models, the real reasons to self-host are latency control and data residency, not cost savings.
The large-model picture flips harder, because Claude Sonnet 5 pricing at $3/$15 per 1M tokens is expensive enough that owning hardware genuinely pays.
| Monthly requests | 70B API cost | Self-host (naive) | Self-host (realistic) | Winner |
|---|---|---|---|---|
| 1,000,000 | $8,000 | $50,000 | $50,000 | API |
| 5,000,000 | $40,000 | $55,000 | $55,000 | API |
| 10,000,000 | $80,000 | $60,000 | $85,000 | Depends |
Notice the last row. The naive calculation says self-hosting wins at 10M requests ($60K vs $80K). The realistic calculation, with idle waste and ops loaded in, says it loses ($85K vs $80K). That single gap is where most self-hosting budgets go wrong.
Why do naive break-even calculators undercount by 2x?
Direct answer: because production GPUs sit idle 30 to 50% of the time, and personnel costs stay fixed regardless of load. Most calculators price GPUs at full utilization and forget the humans. Real utilization tells the story.
A dedicated production cluster averages 40 to 60% utilization, per industry inference estimates. Burst-handling reserved capacity can drop to 20 to 30% because you provision for peak and pay for the trough. Human request traffic swings 3 to 10x between peak and off-hours, and holidays can pull utilization to 10 to 20%.
The cost-per-GPU-hour moves violently with that idle rate. Take an 8x H100 cluster at $240,000 hardware plus $15,000/month operating:
- At 70% utilization: $1.58 per GPU-hour.
- At 40% utilization: $2.76 per GPU-hour, a 74% jump.
- At 20% utilization: $5.52 per GPU-hour, 3.5x the baseline.
Then add the people. MLOps for a production cluster runs 1.0 to 2.0 senior platform engineers at $200,000 to $360,000 loaded, plus 24/7 on-call, plus fine-tuning cycles that eat 2 to 4 weeks per major model version, plus compliance audits that cost $50,000 to $200,000 a year in regulated industries.
None of that appears on a GPU price sheet.
What does GPU hardware cost in mid-2026?
Direct answer: a new H100 SXM runs $25,000 to $35,000, an H200 $35,000 to $45,000, and a B200 $45,000 to $55,000, per July 2026 pricing surveys. Renting is often smarter than buying while HBM supply stays tight.
Cloud spot pricing changes the calculus. H100 spot bottomed near $1.79/hour on Shadeform in July 2026, and B200 spot ran $2.12 to $2.35/hour. On hyperscalers the same H100 costs $3.50 to $14.24/hour depending on instance shape. A fully configured 8-GPU H100 server lands at $300,000 to $400,000.
Supply is the reason to rent. The B200 backlog sits near 3.6 million units, pushing delivery to 2028 to 2029. H100 and H200 still ship on 8 to 16 week lead times, and a used H100 market is emerging as enterprises jump to Blackwell.
The binding constraint underneath all of this is HBM. SK hynix held 71.8% of the HBM market in Q1 2026, and current HBM output supports roughly 3 million AI GPUs a year against demand projected past 10 million by 2027.
Does the SK hynix mega-investment change when I should buy?
Direct answer: not before 2028. SK hynix committed KRW 1,100 trillion (about $710 billion) on June 29, 2026, pulling its Yongin fourth-fab completion forward by 12 years to 2033, but first equipment install targets February 2027 and meaningful HBM output arrives 2028 to 2029.
Samsung answered the same week with a KRW 2,450 trillion total commitment, including KRW 56T earmarked specifically for HBM fabs. Combined with the private sector, South Korea's national AI-chip push tops $880 billion.
For a practitioner the read is simple. If you need capacity in 2026 to 2027, plan around tight supply and premium pricing. If you have a 12 to 24 month runway, eased HBM constraints in 2029 to 2030 may let you buy Blackwell-class hardware cheaper, which argues for renting now and committing capital later.
The open-weight pricing that reset the math
The reason API-first works so well below scale is that open-weight models are now available through APIs at almost throwaway prices.
| Model | Params | Input $/1M | Output $/1M | Released |
|---|---|---|---|---|
| DeepSeek V4-Flash | 13B active | $0.09 | $0.18 | Apr 24, 2026 |
| Qwen3.6-Plus | 32B+ | $0.325 | $1.95 | Apr 2, 2026 |
| Mistral Large 3 | 123B MoE | $0.25 | $0.70 | Dec 2025 |
| Claude Sonnet 5 | frontier | $3.00 | $15.00 | Jun 2026 |
One caveat on freshness: DeepSeek's 75% V4-Pro promotion expired May 31, 2026, and the full V4 release set for mid-July 2026 is expected to double peak-hour pricing. DeepSeek is also shifting to Huawei Ascend silicon under US export controls, which matters if you serve the Chinese market and less if you don't.
Meta's Llama 4 Scout and Maverick shipped in March 2026; Llama 5 has not landed as of July 4, 2026.
The hybrid architecture that owns the middle
Most 2026 production coverage still treats this as self-host-or-API. The teams actually saving money run both.
The pattern: a small intent classifier (often a 7B fine-tune) routes about 80% of requests to self-hosted open-weight models and escalates the hard 20% to a frontier API.
Request → Router → Intent Classifier
├─ Simple (80%) → self-hosted Llama 4 / Qwen 3.6 / DeepSeek V4
└─ Complex (20%) → frontier API (Claude Sonnet 5)
One widely-cited analysis of open-source LLM ROI reports RouteLLM-style routing holding roughly 95% of GPT-4-class quality at about 26% of the cost. Treat the exact figures as directional: the underlying benchmark was not independently reproducible at publication, and results depend heavily on which frontier model you compare against.
The structural benefit is more durable than the specific percentage. Hybrid also fixes the idle problem, lifting average utilization to 60 to 75% because the API absorbs bursts instead of forcing you to over-provision GPUs for peak.
The stack to run it is mature. vLLM's PagedAttention dominates production serving with continuous batching and 60 to 80% less memory fragmentation; TensorRT-LLM and the newer SGLang cover optimized and reasoning-heavy paths. Kubernetes handles autoscaling, and Prometheus plus Grafana cover GPU observability.
What this means for you
Match your strategy to your monthly inference spend, not to a blog's blanket verdict.
| Monthly inference spend | Strategy |
|---|---|
| < $3,000 | API-only. Self-hosting overhead exceeds any savings. |
| $3,000 to $25,000 | Self-host small models (7B to 13B) for cost-sensitive tasks; start routing. |
| $25,000 to $80,000 | Hybrid. Self-host 70B-class open-weight, escalate to frontier API. |
| > $80,000 | Evaluate dedicated infrastructure with full TCO accounting. |
Before you commit capital, pressure-test the budget against the hidden costs. Multiply your hardware line by 1.5 to 2.0x for idle utilization. Budget 1.0 to 2.0 MLOps FTEs.
Add compliance audits if you are in healthcare, finance, or legal. Provision 2 to 3x average capacity for bursts, or offload the burst to an API and skip the over-provisioning.
The durable move is to build the router first. Whatever the newest cheap open-weight model is when you read this, a routing layer lets you swap it in without re-architecting, and it keeps a frontier escape hatch for the reasoning your open weights still miss by 5 to 15%.
Rent GPUs until your volume clears 10M requests/month on a 70B model or 1M on a 7B, then revisit owning hardware when HBM supply loosens in 2029.
Sources
- SK hynix accelerates Yongin fab timeline as HBM strains (DigiTimes)
- SK hynix board approves Yongin Semiconductor Cluster (TweakTown)
- DeepSeek V4-Flash API pricing and benchmarks (OpenRouter)
- Qwen3.6 instruction-tuned LLM overview (Shanghai NYU)
- Mistral Large 3 2512 API pricing (OpenRouter)
- Claude Sonnet 5 intelligence, performance and price (Artificial Analysis)
- NVIDIA B200 GPU specs, pricing, cloud availability 2026 (Inworld)
- GPU inference cost calculator for self-hosted LLMs (AI Economy Hub)
- Open-source vs proprietary LLM ROI and RouteLLM analysis (Agile Leadership Day India)
- SK hynix HBM market share 2026 (Korea Invest Insights)
- DeepSeek V4 full release with mid-July peak-hour pricing (UncensoredHub)
- Samsung and SK clash over investment figures (Chosun)
- South Korea tech giants join $880B AI initiative (LinkedIn)
- GPU and CPU on-device LLM inference economics (arXiv)
- DeepSeek V4 delay shows shift to China chips (Bloomberg)
