A 120 kW rack can be cheap or ruinously expensive depending on how many billable tokens it pushes before it hits a latency wall. The short answer: AI inference TCO 2026 is now a tokens-per-dollar and tokens-per-megawatt problem because power, memory, networking, and serving software decide usable throughput, as of June 20, 2026.
TL;DR: Blackwell has the cleanest published inference economics today, especially where NVIDIA’s FP4 and disaggregated serving stack applies. AMD MI400 has the most interesting memory-capacity story but no AMD-published $/M-token benchmark yet. Trainium is credible for AWS-native fleets, but its economics are easiest to defend through Bedrock and Neuron-supported models rather than broad third-party benchmarks.
AI infrastructure buyers should stop ranking chips by peak FLOPS. The useful question is simpler: at your latency target, under your power cap, what is the cost per million tokens after software, utilization, and cooling are included?
AI Inference TCO 2026 Starts With Tokens Per Dollar
A practical inference TCO model has four moving parts: token throughput, serving latency, power budget, and amortized platform cost.
Peak FP4 or FP8 numbers matter only when the software can keep the accelerator fed. Long-context prefill, decode-heavy agent loops, KV cache movement, and MoE routing can all turn impressive math into idle silicon.
The cleanest unit for comparing platforms is cost per million generated or processed tokens at a stated model, precision, and latency target. The second-best unit is tokens per megawatt, because power availability is now a procurement constraint, not a facilities footnote.
NVIDIA has leaned hard into this framing. Its April 2026 enterprise TCO page says B200 reaches $0.02 per million tokens on GPT-OSS-120B, while GB300 NVL72 reaches $0.04 per million tokens on DeepSeek-R1, citing SemiAnalysis InferenceX results in NVIDIA’s Blackwell inference TCO analysis.
That does not automatically make Blackwell cheapest for every endpoint. It does make NVIDIA the only vendor in this comparison with several current, model-specific, vendor-published $/M-token claims.
The Numbers That Matter
| Platform | Status as of June 20, 2026 | Published inference economics | Power or scale signal | Main caveat |
|---|---|---|---|---|
| NVIDIA B200 / GB200 / GB300 | Shipping Blackwell and Blackwell Ultra systems | B200: $0.02/M tokens on GPT-OSS-120B; GB300: $0.04/M tokens on DeepSeek-R1, per NVIDIA citing SemiAnalysis | GB200 NVL72: 120 kW rack estimate; GB300 NVL72: 72 B300 GPUs | NVIDIA does not publish list pricing for GPUs or full racks |
| AMD MI400 / MI450 Helios | Announced; first deployments scheduled H2 2026 | No AMD-published $/M-token figure for MI400 as of June 20, 2026 | Helios: 72 GPUs, about 31 TB HBM4, 2.9 FP4 exaFLOPS from AMD disclosures reported by third parties | TCO is forward-looking until production benchmarks ship |
| AWS Trainium2 / Trainium3 | Trn2 GA Dec. 2024; Trn3 GA Dec. 2025 | Trn2: AWS claims 30-40% better price-performance than P5e/P5en; Bedrock Claude 3.5 Haiku latency-optimized at $1/M input and $5/M output | Trn3: AWS claims 5x more output tokens/MW than Trn2 on Bedrock | Best evidence is AWS-published and model-support dependent |
These numbers are reference points, not a universal leaderboard. GPT-OSS-120B, DeepSeek-R1, and Bedrock Claude 3.5 Haiku exercise different serving paths.
But they show the shape of the market: the chip race is being re-priced around model-specific throughput and power efficiency.
What Changed In Blackwell vs MI400 Inference Cost?
NVIDIA’s strongest case is software compounding on installed hardware.
NVIDIA says TensorRT-LLM and Dynamo delivered a 5x reduction in cost per token on B200 within two months of GPT-OSS-120B launch with no hardware change, according to its enterprise accelerator TCO page. That is the metric operators should watch. A rack that gets cheaper every few releases changes the depreciation curve.
The Blackwell stack also has the most complete current rack-scale inference story. GB200 NVL72 combines 72 B200 GPUs and 36 Grace CPUs in one NVLink domain, while NVIDIA’s GB300 NVL72 page positions the B300 Ultra generation as a fully liquid-cooled rack-scale system.
NVIDIA’s Dynamo announcement matters because modern inference is not just batching. Disaggregated serving, KV-aware routing, and transfer libraries decide whether prefill and decode scale independently or block each other.
The pricing caveat is real. NVIDIA does not publish a Blackwell list price, and third-party estimates for GB200 NVL72 racks cluster around multimillion-dollar systems rather than clean public SKUs. Treat rack capex as a range unless you have a quote.
Where AMD MI400 Changes The Model
AMD MI400 is the credible challenger because it attacks the memory side of inference.
AMD’s OpenAI agreement says the first 1 GW deployment of AMD Instinct MI450 GPUs starts in the second half of 2026, as part of a broader 6 GW partnership in the AMD and OpenAI press release. Meta has a similar reported 6 GW MI450-architecture commitment, with first deployment also scheduled for H2 2026, according to CNBC TV18 coverage.
The hardware pitch is straightforward. MI400-class GPUs are reported at 432 GB HBM4 and roughly 20 TB/s memory bandwidth, with Helios racks combining 72 MI400-series GPUs and 18 EPYC Venice CPUs, according to ServeTheHome’s Helios coverage and AMD disclosure summaries.
That memory footprint can matter more than peak math for large models, high concurrency, and long-context serving. If you can keep more weights and KV state close to the accelerator, you may reduce sharding overhead and network pressure.
The missing piece is also straightforward: AMD has not published an MI400 cost-per-million-token number as of June 20, 2026. ROCm 7 supports the right direction of travel, including SGLang, vLLM, llm-d, FlashInfer, and AMD Quark, according to AMD’s ROCm 7 announcement, but MI400-specific production inference evidence has to arrive with the systems.
How Should You Model Trainium2 Inference TCO?
Trainium belongs in a different column from NVIDIA and AMD because AWS controls the silicon, cloud environment, and much of the serving surface.
Trainium2 became generally available on December 3, 2024, according to the AWS News Blog launch post. AWS says Trn2 delivers 30-40% better price-performance than P5e and P5en for supported AI workloads.
Trainium3 UltraServers became generally available in December 2025. AWS says Trn3 offers 4.4x more compute performance, 4x greater energy efficiency, and 4x more memory bandwidth than Trainium2 UltraServers, according to its Trn3 GA notice.
The most useful Trainium inference claim is power-normalized. AWS says Bedrock on Trainium3 is 3x faster than Trainium2 and delivers 5x more output tokens per megawatt at similar per-user latency, in the same Trn3 GA notice.
The practical constraint is model support. The current Neuron SDK is Neuron 2.30.0, released May 2026, according to AWS Neuron documentation. NxD Inference supports Llama 4 tutorials and a beta GPT-OSS 120B on Trn3 tutorial, but primary AWS docs did not confirm DeepSeek-R1 production inference support in the research set.
For buyers, Trainium2 inference TCO is easiest to justify when the workload is already inside AWS and the model path is blessed by Neuron or Bedrock. The clean per-token reference is Bedrock Claude 3.5 Haiku latency-optimized on Trainium2 at $1 per million input tokens and $5 per million output tokens, from the Anthropic/AWS announcement mirrored in the research.
What About PUE And The Data Center Inference Power Budget?
PUE is a multiplier on every token.
If your accelerator rack draws 120 kW and your facility runs at 1.2 PUE, the site-level draw is 144 kW before you argue about utilization. If the same rack runs in a 1.35 PUE environment, the site-level draw becomes 162 kW.
That difference compounds across thousands of racks. It also makes tokens per megawatt a better boardroom metric than FLOPS per chip.
Blackwell’s GB200 rack power estimate of 120 kW appears in third-party deployment coverage, including Introl’s GB200 NVL72 deployment analysis. Trainium3 is also moving into high-density liquid cooling territory, with DataCenterDynamics reporting an AWS executive estimate of roughly 1 kW per Trainium3 chip and cold-plate cooling plans in its Trainium3 power report.
The buyer question is no longer “can I fit the racks?” It is “can I keep enough of the racks busy at my latency target to justify the megawatts?”
Best Choice If...
| Best choice if... | Likely platform | Why |
|---|---|---|
| You need the strongest published $/M-token evidence today | Blackwell / GB300 | NVIDIA has model-specific cost-per-token claims and a mature inference software stack |
| You are building a memory-heavy frontier inference fleet for H2 2026 and can co-engineer software | AMD MI400 / Helios | 432 GB HBM4 per GPU and open rack strategy are compelling, but benchmarks are pending |
| You are AWS-native and can stay inside Bedrock or Neuron-supported paths | Trainium2 / Trainium3 | AWS controls pricing, capacity, and serving integration |
| You are power-constrained before you are capex-constrained | Compare tokens/MW first | A cheap accelerator that underutilizes power is expensive in production |
| You serve many model families with frequent architecture changes | Blackwell today | CUDA, TensorRT-LLM, Dynamo, and NIM reduce porting risk |
Risks And Caveats
The Blackwell numbers are strong, but many are NVIDIA-published and cite SemiAnalysis. They are more useful than vague FLOPS claims, but buyers should still reproduce the workload with their prompt mix, context length, batch policy, and latency SLO.
AMD MI400 is real enough for OpenAI, Meta, and Oracle to commit capacity, but it is not generally shipping at scale as of June 20, 2026. Any Blackwell vs MI400 inference cost model that assigns AMD a precise $/M-token value is making an assumption.
Trainium’s risk is portability. If your model is supported by Neuron and your organization is comfortable with AWS capacity planning, the economics can work. If you need broad framework compatibility, CUDA remains the lower-friction path.
What This Means For You
Build your TCO spreadsheet around tokens, watts, and utilization.
Use this minimum model:
effective_cost_per_M_tokens =
(accelerator_hourly_cost + power_cost + cooling_cost + networking_storage_overhead)
/ (sustained_tokens_per_hour / 1,000,000)
site_power_kW =
rack_IT_power_kW * facility_PUE
tokens_per_MW =
sustained_tokens_per_second / (site_power_kW / 1000)
Then run it three times: prefill-heavy, decode-heavy, and mixed agent workload. The winning chip can change when the workload shifts from bulk throughput to interactive reasoning.
For June 2026 procurement, treat Blackwell as the default benchmark, MI400 as the H2 2026 challenger to watch, and Trainium as the AWS-native control case.
Bottom Line
AI inference TCO 2026 is being won by platforms that turn scarce power into low-latency tokens, not by the largest peak FLOPS number. Blackwell has the best current public $/M-token evidence; MI400 has the strongest pending memory-capacity challenge; Trainium is credible where AWS controls the software path.
The next number to watch is not another PFLOPS figure. Watch independently reproduced cost per million tokens at a named model, named latency target, and named power envelope.
JSON-LD Schema
Sources
- NVIDIA Perspectives: Enterprise TCO Inference Cost
- NVIDIA Perspectives: Inference Cost Curve FAQ
- NVIDIA Blog: Blackwell InferenceMAX Benchmark Results
- NVIDIA Developer Blog: Introducing Dynamo
- AMD and OpenAI 6 GW Partnership
- ServeTheHome: AMD Helios Rack-Scale Systems
- AWS News Blog: Trn2 Instances and UltraServers GA
- AWS Trainium Landing Page
- AWS Trn3 UltraServers GA Notice
- AWS Neuron Documentation
- Introl: GB200 NVL72 Deployment
- DataCenterDynamics: Trainium3 Power and Cold-Plate Cooling
