Question 1

What is a GPU compute capacity ledger?

Accepted Answer

A GPU compute capacity ledger is a sourcing reference that maps where buyers can procure GPU capacity — hyperscalers, GPU neoclouds, bare-metal providers, and hosted appliances — against the capability surfaces that matter for procurement: deployment model, capacity type (reserved, spot, dedicated), reservation programs, regional footprint, and interconnect. It is a sourcing matrix, not a live inventory feed; current availability and pricing must be verified with each provider.

Question 2

How do I secure guaranteed GPU capacity for training?

Accepted Answer

Use a reserved-capacity program rather than on-demand or spot. AWS Capacity Blocks for ML and Google Cloud Future Reservations let you reserve whole-instance blocks for a future window. Bare-metal or dedicated hosts (OCI, CoreWeave committed use) give predictable capacity at higher commit cost. Spot and preemptible capacity is cheaper but not guaranteed and should not backstop a fixed training deadline.

Question 3

Cloud-managed GPU vs. bare-metal GPU — when does each win?

Accepted Answer

Cloud-managed instances win for elasticity, integration with managed storage and networking, and teams without data-center operations. Bare-metal or dedicated wins when you need full host control, predictable multi-tenant isolation, custom networking, or you are amortizing a long-running cluster and want to avoid per-hour cloud markups. OCI, CoreWeave, and self-hosted DGX sit on the bare-metal side.

Question 4

Are GPU neoclouds (CoreWeave, Lambda) cheaper than hyperscalers?

Accepted Answer

They often quote lower per-GPU-hour rates for H100/H200 capacity, but total cost depends on reservation terms, egress, storage, and regional footprint. Neoclouds typically have fewer regions and smaller managed-service surfaces than AWS, GCP, or Azure. We do not publish pricing comparisons unless we have collected and verified the data — request current pricing directly and model it against your workload, not a sticker rate.

Question 5

What about non-NVIDIA silicon (Trainium2, TPU, MI300)?

Accepted Answer

AWS Trainium2 and Google TPU v5e/v5p are hyperscaler-native accelerators that bypass NVIDIA per-GPU pricing but lock you to that cloud and require code changes. AMD MI300 is available on some clouds and bare-metal. These are real capacity options for cost-constrained training or inference, but they trade portability for unit economics — evaluate against your framework support and migration cost, not headline FLOPS.

Question 6

How do I compare regional GPU availability across providers?

Accepted Answer

Regional availability changes weekly and is not published as a stable dataset. The practical method is to query each provider's capacity or reservation API for your target instance type and region, then rank by the earliest guaranteed window. This ledger marks multi-region breadth as a capability surface (yes/partial/no), not a live availability count — verify current regional coverage on request before committing an architecture.

Provider	Cloud-managed	Bare-metal / dedicated	Spot / preemptible	Reserved capacity program	Multi-region	NVLink / NVSwitch fabric	Notes
AWS (EC2 P5 / Trainium2)	✓	~	✓	✓	✓	✓	P5 H100/H200 instances on NVSwitch; Capacity Blocks for ML reserve whole-instance blocks. Trainium2 is AWS-native silicon.
Google Cloud (A3/A4, TPU)	✓	~	✓	✓	✓	✓	A3 Mega/A4 VMs on NVSwitch; Future Reservations + capacity reservations. TPU v5e/v5p is Google-native.
Microsoft Azure (NDv5)	✓	~	✓	~	✓	✓	ND H100 v5 / H200 on NVSwitch; standard reserved-instance pricing, not capacity blocks.
Oracle Cloud (OCI)	✓	✓	~	~	~	✓	BM.GPU.H100.8 bare-metal H100; fewer AI regions than the top three.
CoreWeave	✓	~	~	✓	✓	✓	GPU neocloud; committed-use discounts; US + EU regions.
Lambda	✓	~	~	~	~	✓	GPU neocloud; on-demand + reserved H100/H200; smaller regional footprint.
NVIDIA DGX Cloud	✓	—	—	~	~	✓	Hosted DGX on partner clouds (AWS/GCP/Azure/OCI); capacity via partner programs.

The GPU compute capacity ledger

The ledger

How to decide

Get the deeper capacity framework

Sponsor this coverage

Need a sourcing decision, not a matrix?