Evaluating Ai Models And Agents

LPU vs GPU Inference: What Groq's Numbers Actually Settle

The bifurcation debate is over on paper and messy in production; here is the practitioner's read on cost, latency, and routing.

By June 27, 202612 min read
LPU vs GPU inferenceGroq LPU benchmarkAI inference hardware comparison
LPU vs GPU Inference: What Groq's Numbers Actually Settle

NVIDIA spent a decade telling the industry one chip could do everything. Then on December 24, 2025 it licensed Groq's LPU and folded it into Vera Rubin. The LPU vs GPU inference debate did not get resolved. It got re-architected.

At GTC 2026 in March, NVIDIA showed a Vera Rubin rack with a Groq 3 LPX slice cutting AI response latency by 70% versus a GPU-only baseline. GroqCloud was already selling Llama 3.1 8B inference at roughly a tenth of comparable GPU per-token pricing.

The numbers are real. The conclusion most people draw from them is wrong.

TL;DR. LPU and GPU are converging into a heterogeneous inference stack, not a winner-take-all fight. LPUs win the decode phase on latency, consistency, and tokens-per-watt; GPUs win prefill, training, long context, and anything needing a custom kernel. NVIDIA Dynamo now routes between them. For most teams in June 2026, the right question is not "LPU or GPU" but "what ratio, on what workload, behind which neocloud."

Key takeaways

  • Groq 3 LPX inside Vera Rubin cut response latency 70% in NVIDIA's own demo by isolating decode on deterministic SRAM silicon.
  • GroqCloud lists Llama 3.1 8B at $0.05/$0.08 per million tokens, about 10x cheaper than comparable GPU endpoints for small, latency-sensitive workloads.
  • The December 2025 NVIDIA-Groq deal was a licensing-plus-acquihire, not an acquisition; the $20B figure is investor-reported and unconfirmed.
  • Vera Rubin NVL72 with Groq 3 LPX ships in H2 2026; GB300 NVL72 is the current shipping top-of-line GPU platform.
  • Training, long-context prefill, multimodal, and custom-CUDA workloads should stay on GPU. LPU is an inference accelerator, full stop.

What is LPU vs GPU inference, in one sentence?

LPU vs GPU inference is the choice between a deterministic, SRAM-based systolic array (Groq's Linear Processing Unit) optimized for the sequential decode phase of transformer generation, and a parallel, HBM-backed tensor-core processor (NVIDIA's GPU) optimized for prefill, training, and throughput. The 2026 answer is to run both, routed by NVIDIA Dynamo.

The silicon, as of June 2026

Groq's third-generation LPU, the Groq 3 (also labeled GroqChip2 / LP30), is fabricated on Samsung's 4nm process, with Samsung reportedly hitting around 80% yield. Each chip carries 500 MB of on-chip SRAM and 150 TB/s of SRAM bandwidth, up from 80 TB/s and 230 MB on the first generation.

The rack-scale LPX system aggregates 256 LPUs across 32 liquid-cooled trays for 40 PB/s of aggregate SRAM bandwidth, 640 TB/s of scale-up interconnect, and 315 petaFLOPS of FP8 at up to 160 kW. That is a power density that demands liquid cooling, comparable to a dense GPU rack.

On the GPU side, NVIDIA's Blackwell Ultra GB300 NVL72 is the shipping top-of-line inference platform, in production since the second half of 2025. Vera Rubin chips are in full production, but the NVL72 racks ship in H2 2026. The Rubin R100 GPU delivers 50 petaFLOPS at NVFP4 with 288 GB of HBM4 at 22 TB/s, and NVLink 6 pushes 3.6 TB/s bidirectional between GPUs.

Memory bandwidth per unit (TB/s)Groq 3 LPU SRAM (per chip)150TB/sRubin R100 HBM4 (per GPU)22TB/sLPX rack aggregate SRAM40000TB/sLPX scale-up interconnect640TB/s
Memory bandwidth per unit (TB/s)

The bandwidth numbers are not directly comparable; SRAM on the LPU is on-chip and tiny, HBM4 on the GPU is off-chip and large. That asymmetry is the whole argument.

How does LPU compare to GPU on inference latency?

LPUs win the decode phase because decode is sequential and memory-latency-bound, exactly the workload a deterministic SRAM datapath is built for. GPUs win prefill because prefill is embarrassingly parallel and bandwidth-hungry on a large working set, exactly what HBM-fed tensor cores are built for.

The Groq 3 LPU completes every operation in a predictable number of cycles. There is no scheduler variance under concurrent load, which is where GPU inference latency spikes come from. For an interactive chatbot or coding assistant, that consistency is often more valuable than peak throughput.

NVIDIA's own Groq 3 LPX announcement positions the LPX as "the low-latency inference accelerator for the NVIDIA Vera Rubin platform," and the GTC 2026 demo showed a 70% latency reduction for response generation versus GPU-only serving. Treat that 70% as a vendor demo on a friendly workload until independent benchmarks land, but the directionality matches the architecture.

GPUs keep a real edge in prefill. Processing a 100K-token context means reading the whole sequence and computing attention across it. The 22 TB/s of HBM4 on Rubin is what makes that tractable.

An LPU with 500 MB of SRAM per chip cannot hold the KV cache for very long contexts, so prefill stays on GPU.

The December 2025 NVIDIA-Groq partnership, plainly

On December 24, 2025, NVIDIA and Groq announced a non-exclusive inference technology licensing deal plus an acquihire of Groq's founders Jonathan Ross and Sunny Madra and key engineers. Jensen Huang explicitly said NVIDIA was "not acquiring Groq as a company."

The $20 billion figure everyone quotes traces back to a single CNBC report citing one Groq investor, Alex Davis of Disruptive. Neither NVIDIA nor Groq has confirmed the terms. Treat it as investor-reported, not verified.

Groq the company has since pivoted. It raised $650 million in June 2026 to fund a neocloud strategy, selling LPU inference as a service through data-center partners rather than competing head-on with NVIDIA silicon. The competitive landscape is now NVIDIA-plus-Groq-LPU on one side and Groq-as-neocloud on the other, with the same underlying technology.

There is regulatory noise. Senators Warren and Blumenthal sent an antitrust inquiry letter on March 19, 2026, and the deal skipped Hart-Scott-Rodino premerger review. If you are betting a deployment on the integrated LPX architecture, keep a GPU fallback plan.

AI neocloud cost comparison: who sells what

The neocloud market has bifurcated along the same silicon line as the hardware.

Provider Silicon Headline pricing (June 2026) Throughput profile
GroqCloud Groq 3 LPU Llama 3.1 8B: $0.05 in / $0.08 out per M tokens 800-1,000+ tok/sec, TTFT under 100 ms
CoreWeave GPU (H100/H200/GB300, early Rubin) Enterprise agreements, not published per-token Rack-scale NVL72, batch-optimized
Together AI GPU Competitive per-token on open models Balanced cost/performance
Fireworks AI GPU Via Azure Foundry Specialized optimization

GroqCloud's price advantage is most pronounced for small-batch, latency-sensitive inference on the models it supports. For large-batch offline throughput, GPU neoclouds can win on aggregate cost even at higher per-token rates, because utilization is what drives GPU economics.

The Meta-Groq collaboration on the official Llama API is worth noting: it gives GroqCloud a credible first-party model story, not just a discount story.

Why inference architecture bifurcation actually works

The mechanism behind the bifurcation is two-phase transformer inference. Prefill processes the input context, computing KV tensors across the whole sequence in a big parallel matmul. Decode generates tokens one at a time, attending to the cached KV tensors in a small, sequential, latency-bound loop.

GPUs are overprovisioned for decode. You are paying for thousands of tensor cores and a giant HBM stack to do a job that fits in on-chip SRAM and wants deterministic scheduling. LPUs are underprovisioned for prefill. You cannot fit a long context in 500 MB of SRAM per chip.

NVIDIA Dynamo, introduced at GTC 2025 and extended through 2026, is the serving framework that makes the split operational. It routes prefill to Rubin GPUs and decode to Groq LPUs, and co-locates the two halves of a request via KV cache transfer.

The GTC 2026 session "Split and Win: Dynamo's Prefill-Decode Disaggregation" is the canonical reference.

This is why the question has shifted. The choice is no longer "LPU or GPU." It is "what GPU-to-LPU ratio for my workload mix, and who routes it."

Where LPU wins decisively

  • Latency-sensitive interactive apps. Chatbots, coding assistants, real-time translation. Deterministic decode kills the tail-latency spikes GPU schedulers produce under load.
  • Decode-heavy workloads. Long output relative to input: code generation, agentic reasoning traces, long-form drafting.
  • Power-constrained sites. Higher tokens-per-watt on decode than equivalent GPU throughput, which matters when the rack is the power budget.
  • Small-batch real-time inference. Per-request latency dominates aggregate throughput, so the LPU's consistency wins.

Where GPU still wins

  • Training and fine-tuning. LPUs are inference accelerators. Anything with backpropagation stays on GPU.
  • Long-context prefill. Contexts approaching or exceeding 100K tokens need HBM capacity and bandwidth.
  • Multimodal models. Vision-language computational patterns map poorly to fixed systolic arrays.
  • Batch-optimized offline throughput. When latency is irrelevant, GPU utilization wins on aggregate cost.
  • Custom CUDA kernels and novel architectures. Programmable tensor cores absorb new operators without new silicon.
  • Mature MLOps. vLLM, SGLang, Nsight, TensorRT, and a decade of community knowledge are a real TCO line item.

The software-ecosystem counterargument

The honest pushback on LPU is not about bandwidth. It is about the software stack. CUDA has millions of developers, mature debuggers, and a deprecation policy that gives enterprises years of notice. VLLM has been battle-tested across thousands of production deployments with continuous batching, paged attention, and speculative decoding.

Groq's inference engine is the serving layer for LPU, and the community around it is much smaller. Teams adopting LPU should budget for a learning curve, a thinner troubleshooting base, integration work against existing Kubernetes and MLOps tooling, and the risk of hitting an unsupported operator.

For a small team with no LPU expertise, those integration costs can eat the per-token savings.

NVIDIA's integration of LPU into Vera Rubin is partly a response to this. By selling heterogeneous racks through its own channel with Dynamo doing the routing, NVIDIA offers the LPU efficiency win without a separate procurement and integration project. That may be the single most consequential thing in the whole report for buyers.

What this means for you

For latency-critical interactive apps. Start with GroqCloud on a supported model. The cost and consistency wins for decode-heavy traffic are large enough to justify the integration work, and the free tier lets you measure before you commit.

For large-batch offline inference. Stay on GPU. CoreWeave's GB300 NVL72 (and soon Vera Rubin) gives you the aggregate throughput and the software maturity. Per-token price looks worse but utilization looks better.

For mixed workloads. Plan around Vera Rubin NVL72 with Groq 3 LPX when it ships in H2 2026. The heterogeneous rack with Dynamo routing is the architecture that matches a real production traffic mix, and it comes through NVIDIA's enterprise channel with the support structure you already know.

For risk management. Keep a GPU fallback. The Warren-Blumenthal inquiry and the skipped HSR review mean the integrated LPX offering could be restructured. Do not let a single silicon path be your only path.

For capacity planning. Treat 2026 as the last year of homogeneous GPU inference at the high end. Roadmaps for 2027 should assume heterogeneous Vera Rubin systems and budget for Dynamo-based serving as a core competency, not a bolt-on.

The bifurcation is real, the numbers are real, and the routing layer exists. The mistake is treating it as a silicon war. It is a workload-routing decision, and the teams that learn to route will pay less per token and serve lower latency than the teams that pick one chip and defend it.

Sources

Frequently asked questions

Is LPU faster than GPU for inference?

LPUs are faster and more consistent for the decode phase of transformer inference, where Groq reports up to 70% latency reduction versus GPU-only serving. GPUs remain faster for prefill, long-context processing, and training, so the best 2026 deployments route each phase to the right silicon via NVIDIA Dynamo.

What is the cost per token for Groq LPU inference?

As of June 2026, GroqCloud lists Llama 3.1 8B at $0.05 per million input tokens and $0.08 per million output tokens, roughly a 10x cost advantage over comparable GPU endpoints for small, latency-sensitive workloads. GPU-based neoclouds like CoreWeave typically negotiate rack-scale NVL72 pricing through enterprise agreements rather than publishing per-token rates.

Did NVIDIA acquire Groq?

No. The December 24, 2025 deal was a non-exclusive technology licensing agreement plus an acquihire of Groq's founders and key engineers, not a corporate acquisition. The widely cited $20 billion figure comes from a single investor report and has not been confirmed by either company.

Should I move all inference workloads to LPU?

No. LPU wins for decode-heavy, latency-sensitive, small-batch workloads. Keep training, long-context prefill, multimodal models, custom-CUDA-kernel models, and large-batch offline throughput on GPU. The mature pattern in 2026 is heterogeneous routing, not wholesale migration.

When do Vera Rubin NVL72 racks with Groq 3 LPX ship?

Vera Rubin chips are in full production as of early 2026, but rack-scale NVL72 systems with integrated Groq 3 LPX are scheduled for customer shipments in the second half of 2026, with most sources pointing to a Q3 2026 timeline. Until then, GB300 NVL72 is the shipping top-of-line GPU inference platform.