What is the KV cache and why does it make long-context inference expensive?

The KV cache stores key/value tensors for every prior token so attention can avoid recomputation. It grows linearly with context length and batch size, so a 100k-token window can require ~16 GB per layer per decoded token of bandwidth on top of the constant weight load. Long context turns the cache, not the weights, into the dominant bandwidth consumer.

How much can quantization and speculative decoding close the memory bandwidth gap?

FP8/INT8 quantization gives roughly 1.5-2.6x bandwidth reduction, speculative decoding gives 2-15x throughput gains for single-stream decode but little benefit under batching, and pruning/distillation cuts traffic 40-50% but needs retraining. Stacked optimistically these yield 3-5x relief against a ~49x gap, so software alone cannot close the wall.

When does memory bandwidth dominate inference cost in practice?

Bandwidth dominates for autoregressive generation at small batch, context over ~4,096 tokens, models above ~30B parameters, or latency-sensitive serving. If GPU utilization sits at 20-40% under load, you are bandwidth-bound; above ~70% you are compute-bound and more bandwidth buys little.

Which 2026 hardware targets the inference memory wall?

NVIDIA's Groq 3 LPU (March 2026) ships 150 TB/s of on-die SRAM, Fractile raised $220M in May 2026 for compute-in-memory silicon, and XCENA raised $135M the same month for compute-near-memory over CXL. Cerebras, SambaNova, and Intel's LPDDR5X Crescent Island take distinct bets on SRAM, dataflow, and commodity memory respectively.

Why Memory Bandwidth, Not Compute, Is the LLM Inference Bottleneck

Q: Why is LLM inference memory-bandwidth-bound rather than compute-bound?

Autoregressive decode loads the full model weights for every single token, and the arithmetic intensity of a 70B FP16 model is ~1 FLOP/byte versus the ~156 FLOP/byte needed to be compute-optimal. That three-order-of-magnitude gap means the GPU spends most of the decode phase waiting on memory, so adding FLOPS barely moves per-token latency.

GPU compute has grown roughly 80x in a decade. Memory bandwidth has grown only about 17x. For autoregressive LLM inference, that gap is now the single biggest thing setting your per-token cost.

The mechanism is concrete: every decoded token reloads the entire model weights from memory, and the KV cache adds another load that scales with context length. Adding FLOPS to a memory-bound workload barely moves latency.

The arithmetic intensity of a 70B FP16 model is about 1 FLOP per byte transferred, versus the ~156 FLOP/byte needed to be compute-optimal. That three-order-of-magnitude gap is the inference memory wall.

TL;DR

Compute scaled ~~80x in a decade; DRAM bandwidth scaled ~17x, opening a [~~49x bandwidth gap](https://dev.to/plasmon_imp/the-memory-bandwidth-gap-is-49x-and-growing-why-local-llms-hit-a-ceiling-4cca) that keeps widening.
Decode loads the full model weights per token, so per-token cost is set by bandwidth, not FLOPS.
The KV cache grows linearly with context, making long-context serving the most bandwidth-hostile workload.
May 2026 brought $355M into memory-centric silicon: Fractile ($220M, compute-in-memory) and XCENA ($135M, compute-near-memory).
Quantization, speculative decoding, and pruning give 3-5x relief combined against a ~49x gap. Software alone cannot close it.

Key takeaways

Bandwidth is the bottleneck, not FLOPS. Decode-phase GPU utilization on a 70B model typically sits at 30% or below because the chip is starved for data.
Context length compounds the problem. KV cache bandwidth scales linearly with sequence length; a 100k-token window can demand ~16 GB per layer per decoded token.
Hardware bets are diverging. SRAM (Groq), wafer-scale SRAM (Cerebras), dataflow hierarchy (SambaNova), compute-in-memory (Fractile), and CXL-attached DRAM (XCENA) all attack the wall differently.
Measure before you buy. GPU utilization between 20-40% under load means you are bandwidth-bound; above 70% means compute-bound.

Why is LLM inference memory-bandwidth-bound?

Transformer inference runs in two phases with opposite profiles. The prefill phase processes the whole prompt in parallel, populates the KV cache, and routinely exceeds 90% GPU utilization on a 70B model on H100. It is compute-bound and the GPU loves it.

The decode phase generates one token at a time. Each step reloads the full model weights to do the next matmul. For a 70B model in FP16 that is roughly 140 GB of weights per token. Utilization collapses to 30% or less because the chip is waiting on memory, not computing.

The arithmetic intensity makes the diagnosis unambiguous. A 70B FP16 model does about 1 FLOP per byte transferred. Compute-optimal operation needs about 156 FLOP per byte. You are running roughly 150x below the roofline, which means doubling FLOPS gives you almost nothing. Doubling bandwidth gives you almost everything.

This is why a hypothetical system with half the FLOPS but twice the bandwidth beats the faster-compute box on per-token latency. The bottleneck is delivery, not calculation.

How the KV cache drives inference cost

The KV cache stores key and value tensors for every prior position so attention can avoid recomputation. For a model with 8 attention heads and 128-dimensional keys and values, each cached token costs about 2 KB per layer. That sounds small until you multiply.

For a 1,000-token context across ~80 layers, each decoded token must load about 160 MB of KV data. Push context to 100,000 tokens and that becomes roughly 16 GB per layer per token. At that scale the KV cache, not the weights, dominates bandwidth consumption.

Two effects compound. First, weight loading is constant per token, so short sequences are weight-bound. Second, KV loading grows with sequence length, so long sequences are cache-bound. The crossover depends on model size and architecture, but for any modern long-context serving workload the cache is the cost center.

This is also why "just increase batch size" stops helping past a point. More concurrent streams means more KV cache resident, more bandwidth pressure per step, and diminishing returns on the FLOPS you already paid for. The memory wall punishes the obvious scaling lever.

How bad is the compute-vs-bandwidth gap?

The growth asymmetry is documented across roughly two decades. Gholami et al. found GPU FLOPS growing about 3.0x every two years while DRAM bandwidth grew only about 1.6x per two-year period. Compounded over a decade, that produces the industry-shorthand 80x compute versus 17x bandwidth figures, and a documented ~49x gap between what current hardware delivers and what compute-optimal inference would need.

Current specs show the gap in absolute terms. The NVIDIA H200 ships 141 GB of HBM3e at 4.8 TB/s alongside ~1,979 TFLOPS. The B200 pushes ~8 TB/s of HBM3e against 20 PFLOPS of FP4. Bandwidth improves generation over generation, but compute improves faster.

Decade-scale growth: compute vs memory bandwidth

Roadmaps make the trend worse, not better. NVIDIA's Rubin platform, slated for 2026, promises about 2.4x compute over Blackwell. Bandwidth improves too, but not at the same rate. Buying the next GPU generation and expecting proportional inference speedup is the classic mistake.

HBM vs LPDDR vs SRAM: which memory wins for inference?

There is no universal winner. The right memory depends on whether your workload is capacity-bound, bandwidth-bound, or latency-bound.

Memory	Bandwidth	Capacity	Cost	Best for
HBM3e	4.8-8 TB/s	141-192 GB	High	Large models, broad ecosystem
LPDDR5X	~100-200 GB/s per chip	6-8x HBM per chip	Low	Capacity-bound, small-batch serving
On-die SRAM	80-150 TB/s	230-500 MB	Very high	Latency-critical, partitioned models

HBM is the incumbent. It stacks DRAM dies on an interposer and gives you multiple TB/s with enough capacity to hold a 70B model in FP16. The catch is that real decode workloads realize only 10-30% of theoretical HBM bandwidth because autoregressive access patterns are not the streaming bursts HBM was designed for.

LPDDR flips the trade. Intel's Crescent Island pairs 160 GB of LPDDR5X with inference-optimized compute and targets air-cooled enterprise servers. Per-chip bandwidth is far lower than HBM, but capacity per dollar is dramatically better, and purpose-built streaming paths can hit 90%+ utilization versus HBM's 10-30%.

For high-volume, short-context, batch-friendly workloads, LPDDR often wins on cost-per-inference.

SRAM is the latency play. On-die SRAM hits 80-150 TB/s with single-digit-nanosecond latency, an order of magnitude above HBM. The Groq 3 LPU, announced at NVIDIA GTC on March 16, 2026, packs 500 MB of SRAM at 150 TB/s per chip.

A full LPX rack integrates 256 LPUs for 315 PFLOPS of FP8 and 40 PB/s of aggregate SRAM bandwidth. The constraint is capacity: 500 MB holds maybe a 250M-parameter INT8 model, so large models must be partitioned across many chips.

What shipped in 2026 to attack the memory wall?

The first half of 2026 saw capital flow directly at the bandwidth problem. Two rounds closed in May 2026 alone, totaling $355M.

Fractile raised a $220M Series B on May 13-14, 2026 at roughly a $1B post-money valuation, co-led by Accel, Factorial Funds, and Founders Fund, with former Intel CEO Pat Gelsinger as an angel. Fractile's bet is compute-in-memory: do the matmuls inside SRAM cells sitting next to the logic, so weights never move.

The company claims 100x faster inference at 90% lower cost, but those are targets, not verified benchmarks, and first silicon is not expected until 2027.

XCENA closed a $135M Series B on May 29, 2026 at a $570M valuation, co-led by Altinum and IMM Investment. The MX1 chip does compute-near-memory over CXL-attached DDR5 and SSDs, explicitly targeting the KV cache bottleneck.

The founding team came from Samsung and SK Hynix, which is why the approach leverages commodity memory instead of inventing a new memory category.

NVIDIA's move is the most deployed. The reported $20B Groq deal folds Groq's LPU into the Vera Rubin rack as an inference accelerator alongside traditional GPUs. NVIDIA claims the Groq 3 LPX delivers 35x more throughput per megawatt than Blackwell for trillion-parameter models.

That number awaits independent verification, but the efficiency logic is sound: eliminating external memory access eliminates the associated energy and latency.

Cerebras and SambaNova take the other two established architectural positions. Cerebras puts 44 GB of SRAM on a wafer-scale chip so models that fit never touch external memory. SambaNova's SN40L uses a dataflow architecture with an attention-based memory hierarchy that auto-tiers data between memory levels, handling models larger than single-chip capacity without manual management.

Can quantization and speculative decoding close the gap?

Software helps, but the math is unforgiving. The mitigations collectively give 3-5x relief against a ~49x gap.

Quantization cuts bytes per transfer. FP8 gives 1.81-2.66x throughput over FP16 on TensorRT-LLM with near-FP16 quality. INT8 with quantization-aware training can match FP32 on many tasks. INT4 is the problem child: research documents a "59% long-context cliff" where INT4 quality collapses on long-context workloads, limiting it to short-context use cases.

Speculative decoding trades compute for bandwidth by having a small draft model propose tokens that the large model verifies in parallel. NVIDIA reports 3x throughput on Llama 3.3 70B and Medusa heads give up to 1.9x on time-to-first-token. The catch: gains concentrate in single-stream decode. Under batched serving, every stream hits the same bandwidth wall, so the headline 2-15x rarely shows up in production multi-tenant systems.

Pruning and distillation shrink the model. NVIDIA's Minitron compresses Llama-3.1 8B to a 4B variant with about 50% inference speedup, and the technique generalizes across languages. The cost is retraining or fine-tuning, plus empirical validation per workload because compression ratios vary by architecture.

Stack the three and you get maybe 3-5x in optimistic scenarios. The wall is 49x. Software buys time; it does not buy a solution.

What this means for you

Run the utilization test before you spec hardware. If GPU utilization during inference sits at 20-40% under load, you are bandwidth-bound and more FLOPS will not help. If it sits above 70%, you are compute-bound and bandwidth upgrades buy little. This single number tells you which axis to spend on.

Match memory to workload. Short-context, high-volume, batch-friendly serving favors LPDDR economics like Crescent Island. Long-context, latency-sensitive work like code generation and document reasoning favors SRAM latency like Groq LPX. Diverse workloads with broad model support still favor HBM GPUs from NVIDIA or AMD, and the Rubin-with-Groq rack is NVIDIA's hedge that both will coexist.

Budget for the cache, not just the weights. KV cache bandwidth scales with context length and batch size, so a "100k context" feature can quietly 10x your per-token memory traffic. Model the cache into your capacity planning from day one, or you will discover the wall in production.

Date every version-specific claim you rely on. The Groq 3 numbers, the Fractile and XCENA rounds, and the Rubin roadmap are all 2026 facts that will be stale by the next generation. Treat any inference benchmark older than a quarter as directional, not definitive.

The memory wall is the defining constraint of LLM serving in 2026. The teams that measure it, model it, and pick the memory architecture that fits their workload will spend less and serve faster. The teams that keep buying FLOPS will keep waiting on data.

Why Memory Bandwidth, Not Compute, Now Sets LLM Inference Cost