When should I switch from a per-token API to self-hosted inference?

The break-even for 70B-class models sits around 30M-50M sustained tokens per day, which maps to roughly 40-60% utilization of a self-hosted H100 node. Below that, per-token providers like Together or Groq win on total cost and engineering attention. Above it, a vLLM deployment on a GPU cloud is usually cheaper, but budget 1-3 full-time engineers to run it.

Which Inference-as-a-Service provider is fastest in 2026?

Groq holds the public latency crown for 70B-class open models. Independent measurement by Artificial Analysis puts its Llama 3.3 70B output speed at roughly 276 tokens per second with time-to-first-token under 0.3 seconds, ahead of Fireworks (~120-150 tok/s) and Together (~95-120 tok/s). For sub-200ms p95 TTFT targets, Groq is effectively the only public managed option.

How much does it cost to serve a 70B model per million tokens?

Independent providers now serve Llama 70B-class models at $0.59-$0.90 per million output tokens (Groq, Together, Fireworks list prices). Equivalent capacity on hyperscaler on-demand GPUs works out to roughly $7.80 per million output tokens before reserved discounts, a 4-10x premium. Frontier proprietary APIs run $10-$75 per million output tokens.

Does quantization really cut inference costs?

Yes, and it is the single biggest per-model lever for self-hosters. Moving a 70B model from FP16 to INT4 cuts memory to a quarter and lifts throughput 1.8-2.5x, letting one H100 outperform two FP16 H100s at about 40% of the cost. The trade-off is a 1-4% quality drop on benchmarks like MMLU, so validate against your own eval suite first.

Why do MoE models like DeepSeek V3 cost so much less to serve?

DeepSeek V3 has 671B total parameters but activates only ~37B per token, so compute per token resembles a 37B dense model while only memory scales with full size. That breaks the linear link between parameter count and cost, which is why DeepSeek's API lists output at $1.10 per million tokens despite frontier-scale capacity.

Inference-as-a-Service in 2026: Cost, Speed, and Scale

The price spread between the cheapest and most expensive way to serve a language model is now 30-80x on output tokens, and it widened through 2025, not narrowed. Groq lists Llama 3.3 70B at $0.79 per million output tokens.

Frontier proprietary APIs charge up to $75 for the same million tokens. And hyperscaler on-demand GPU capacity for an equivalent open model works out to roughly $7.80 per million before reserved discounts.

That spread is the whole story of Inference-as-a-Service in 2026. Picking the wrong tier for your traffic shape is no longer a rounding error. It is the difference between a $35,000 monthly bill and a $1M+ one at enterprise scale.

TL;DR:

70B-class open models now cost $0.59-$0.90 per million output tokens on independent providers (Groq, Together, Fireworks), 4-10x below equivalent hyperscaler on-demand GPU capacity.
Groq holds the public latency record: ~276 output tokens/second on Llama 3.3 70B per Artificial Analysis, with TTFT under 0.3s.
The managed-vs-self-hosted break-even sits around 30M-50M sustained tokens/day for 70B models, roughly 40-60% GPU utilization.
Quantization beats buying GPUs: a 70B model at INT4 on one H100 outperforms FP16 on two H100s at ~40% of the cost.
MoE architecture is the dominant cost lever: DeepSeek V3 activates ~37B of 671B parameters per token, pricing output at $1.10/M.

What is Inference-as-a-Service?

Inference-as-a-Service is the delivery of model predictions as a metered API or managed endpoint, so teams pay per token, per request, or per GPU-hour instead of owning and operating the serving stack themselves. In 2026 it spans three tiers: hyperscaler managed platforms, pure-play model APIs, and raw GPU-cloud primitives.

The market has stratified cleanly. Amazon SageMaker, Google Vertex AI, and Azure Machine Learning sell managed endpoints with deep cloud integration and compliance coverage. Together AI, Fireworks AI, Groq, Replicate, and Baseten sell per-token or per-second access to optimized serving stacks. CoreWeave, Lambda Labs, RunPod, and Modal sell the GPUs and leave the stack to you.

Key takeaways

Default to per-token managed APIs until sustained traffic proves you wrong. Below ~30M tokens/day, self-hosting loses on both cost and attention.
Latency-critical products (p95 TTFT under 200ms) have one public managed option: Groq's LPU.
Context length is the most underestimated cost driver. Each doubling roughly halves the concurrency a fixed GPU can support.
Keep your model portable. A model-on-vLLM contract survives provider switches; a model-on-one-API contract does not.

How much does AI inference actually cost in 2026?

A 70B-class open model is now reachable at $0.50-$0.90 per million output tokens on independent providers, while frontier proprietary models still command $10-$75. For a typical chat workload with a 1:3 input-to-output ratio, that is a 68x blended spread between the cheapest open-weight tier and the most expensive closed tier.

The current list prices for Llama 70B-class serving:

Provider	Model	Input $/M	Output $/M
Groq	Llama 3.3 70B	$0.59	$0.79
Together AI	Llama 3.1 70B	$0.88	$0.88
Fireworks AI	Llama 3.1 70B	$0.90	$0.90
DeepSeek API	DeepSeek V3 (MoE)	$0.27 (cache miss)	$1.10
SageMaker (8xH100, on-demand)	70B self-served	n/a	~$7.80 effective
Frontier proprietary	GPT-4.1 / Claude class	$2.50-$15.00	$10.00-$75.00

The SageMaker figure assumes an ml.p5.48xlarge at roughly $98/hour on-demand running at 80% utilization, per CloudZero's 2026 SageMaker pricing analysis. Reserved capacity and Savings Plans cut 30-60% off that, but the gap to per-token providers remains real. Vertex AI's Llama 4 MaaS tier lands in between, at $0.72/M input and $2.40/M output for a Scout-class model.

Llama 70B-class output token price by provider (mid-2026 list)

The hidden costs that actually dominate the invoice

Pricing-page rates rarely equal the bill. Idle-infrastructure analysis identifies SageMaker, Azure ML, and Vertex as the three largest sources of idle waste in MLOps spend, because per-instance billing runs 24/7 regardless of traffic.

Add egress and cross-AZ transfer (5-15% of total spend for multi-region setups), warm-pool provisioned concurrency that costs about the same as always-on capacity, and token-level observability that can add an unexpected 10-20% line item. Budget for all four.

Who actually wins on inference performance benchmarks?

Groq's LPU is the fastest public option for 70B-class models, and the serving stack now matters more than raw GPU FLOPS. Artificial Analysis, the most-cited independent cross-provider measurement, clocks Groq's Llama 3.3 70B at ~276 output tokens/second. Fireworks lands at 120-150 tok/s, Together at 95-120, and frontier proprietary APIs at 50-90.

Groq's edge comes from architecture, not just tuning. The LPU keeps the model preloaded in SRAM, which is also why its cold starts run under 5 seconds while serverless tiers on Hugging Face and SageMaker sit at 30-60 seconds.

Groq's own 6x speed-up benchmark for Llama on GroqCloud is vendor-stated, but the independent numbers back the latency claim.

At the hardware layer, the MLPerf Inference v5.0 round (April 2025) showed NVIDIA's GB200 NVL72 delivering roughly 30x the throughput of an H200 NVL8 on Llama 3.1 405B at fixed latency, the largest inter-generation jump in MLPerf history. AMD's MI300X submissions matched H200 tokens/sec in some 8-GPU configurations, and its 192 GB of HBM3e makes it a genuine option for 400B+ dense models that won't fit on 80 GB H100s.

For self-hosters, vLLM 0.8.1's V1 engine delivers 2.2-2.8x the throughput of the 0.7.x line, serving Llama 70B at roughly 3,000-3,500 output tokens/second on a single H100 with FP8 weights. SGLang leads on structured-output and multi-turn agentic traffic via RadixAttention, and TensorRT-LLM squeezes out another 10-25% at the cost of model-compile complexity.

Managed API or self-hosted: where is the break-even?

The break-even between per-token APIs and self-hosted inference for a 70B-class model sits at roughly 30M-50M sustained tokens per day, which is about 40-60% utilization of the equivalent GPU capacity. Below that line, self-hosting loses on cost and on engineering attention. Above it, the savings compound fast.

Run the math on a sustained 1B-tokens/day workload. At Together-class rates, the monthly bill is around $35,000. The equivalent self-hosted capacity is 4-6 H100 nodes, and Lambda's published on-demand pricing puts an 8xH100 node near $2.99 per GPU-hour, with RunPod and Vast.ai typically 20-40% cheaper at the cost of more variable capacity, per CloudZero's cloud GPU comparison.

But the dollar figures hide the real constraint: headcount. A production-grade self-hosted deployment runs 1-3 platform engineers full-time depending on the SLA target. If you don't have that capacity to spare, the per-token premium is cheap insurance.

At the top end the calculus flips hard. Enterprise workloads above 100B tokens/day save more than $1M per month self-hosting versus per-token APIs at sustained utilization. Full on-prem ownership breaks even at 60-80% sustained utilization over a 3-year depreciation window, per infrastructure TCO analysis, and the case strengthens with data-sovereignty requirements or cheap power.

There's a middle path. Baseten's Google Cloud case study reports a vendor-stated 225% cost-performance improvement over baseline configurations through stack-level tuning, for teams that want custom inference engineering without the ops burden. Modal offers serverless GPUs with 5-15 second cold starts for bursty workloads.

Quantization and MoE change the math more than your provider does

Before adding GPUs or switching platforms, quantize. Moving a 70B model from FP16 to FP8 halves memory and lifts throughput 1.4-1.8x with negligible quality loss on most evals. INT4 quarters memory and adds 1.8-2.5x throughput, with a 1-4% MMLU drop you must validate against your own workload.

The practical consequence: a 70B model at INT4 on a single H100 beats the same model at FP16 across two H100s in most batched-serving scenarios, at 40% of the cost. vLLM shipped FP8 support back in 2024, and NVIDIA's NVFP4 format pushes Blackwell hardware to 2-3x throughput with NVIDIA-claimed quality parity with FP8.

Mixture-of-Experts breaks the parameter-count-to-cost relationship entirely. DeepSeek V3 carries 671B total parameters but activates only ~37B per token, which is why its API lists output at $1.10/M while a 405B dense model costs $2-$3/M on Together or Fireworks.

One more constraint that bites teams late: context length. A 70B model at 32K context needs roughly 2.8 GB of KV cache per request at FP16, so a single H100 supports only about 10 concurrent requests.

Every doubling of context roughly halves your concurrency on fixed hardware. Validate against your real prompt distribution, not your test suite.

What this means for you

Prototyping or under 1M tokens/day: per-token APIs, full stop. Groq for latency, Together for price, Fireworks for function calling.

Growth stage (1M-1B tokens/day): stay on per-token with reserved-capacity contracts, or run one GPU-cloud node on vLLM once you cross ~30M sustained tokens/day. vLLM is the default production stack for good reason: largest community, FP8/INT4/NVFP4 support, and throughput leadership on H100/H200.

Platform teams (1B-10B tokens/day): hyperscaler reserved capacity for baseline, per-token APIs for spikes and evals, KServe on Kubernetes as the serving layer. AWS claims its latest SageMaker features cut deployment costs by 50% on average; treat that as vendor-stated and benchmark your own workload.

Regulated workloads: on-prem or VPC-isolated self-hosting is a requirement, not an optimization. VLLM with FP8 quant on owned H100/H200 hardware is the proven recipe.

And one strategic note: the 2024-2026 price war has reset per-token rates downward 3-5x, and there's no sign it has stopped. If your self-hosted cluster payback runs 12-24 months, model the scenario where managed-API prices keep falling underneath you.

The most durable position is a portable one: your model on vLLM, deployable anywhere, with per-token APIs as the overflow valve.

Sources

New AI Inference Speed Benchmark for Llama 3.3 70B (Groq), vendor benchmark context for Groq's LPU latency claims.
Performance boosts in vLLM 0.8.1: the V1 engine (Red Hat), 2.2-2.8x throughput gains in vLLM's V1 engine.
MLPerf results coverage (TechPowerUp), GB200 NVL72 generational throughput jump.
DeepSeek API pricing, official MoE per-token rates including cache-hit pricing.
Llama 4 as MaaS on Vertex AI (Google), Vertex per-token MaaS pricing.
SageMaker Pricing Guide 2026 (CloudZero), hyperscaler GPU-hour cost breakdown.
Cloud GPU Pricing Comparison (CloudZero), AWS vs. Azure vs. GCP GPU rates.
Serving Llama 3.1 on Lambda Cloud (Lambda docs), GPU-cloud node pricing reference.
How Baseten achieves 225% better cost-performance (Google Cloud), vendor-stated stack-tuning gains.
Introducing NVFP4 (NVIDIA), Blackwell-era 4-bit quantization format.
vLLM brings FP8 inference to open source (Red Hat), FP8 quantization support in vLLM.
The Hidden Cost of Idle AI/ML Infrastructure, idle-capacity waste across SageMaker, AML, Vertex.
Cloud vs. On-Prem GPUs: The 3-Year Cost Dilemma, on-prem break-even utilization analysis.
Combining KServe and llm-d (Red Hat), Kubernetes-native serving for generative inference.
TensorRT-LLM (NVIDIA GitHub), highest single-batch decode throughput, at compile-time cost.

The Rise of Inference-as-a-Service: Cost, Performance, and Scalability in 2026