cluster

DiffusionGemma 26B-A4B explained: can diffusion beat autoregression?

DeepMind's new open-weights model generates 256 tokens in parallel on a single RTX card, and it's the strongest test yet of whether diffusion can challenge next-token prediction.

June 12, 20269 min read
diffusiongemma 26b-a4bdiffusion vs autoregressive text generationdiffusion language model
DiffusionGemma 26B-A4B explained: can diffusion beat autoregression?

On June 10, 2026, Google DeepMind and NVIDIA released DiffusionGemma 26B-A4B, an open-weights diffusion language model that generates 256 tokens in parallel rather than one at a time. It carries 25.2B total parameters, activates only 3.8B per token, and runs in 18 GB of VRAM when quantized.

That makes it the first serious attempt to put diffusion text generation on consumer RTX hardware.

A diffusion language model starts from a fully masked sequence and reveals tokens in parallel across a series of denoising steps. An autoregressive model emits one token per forward pass, left to right. DiffusionGemma 26B-A4B is the strongest production test yet of whether that first approach can compete.

TL;DR: DiffusionGemma is fast (700+ tokens/sec on an RTX 5090), cheap to run locally, and structurally better at editing and infilling than any autoregressive model with a fill-in-the-middle patch. Google itself says to use standard Gemma 4 when you need maximum quality, and no first-party benchmark table exists yet. Treat it as a second paradigm for edit-heavy local workloads, and watch the next six months of independent benchmarks before believing anything bigger.

Key takeaways:

  • 25.2B total parameters, 3.8B active per token, 256K context, Apache 2.0 license, per the NVIDIA NIM model card.
  • Generates 256 tokens per denoising step with bidirectional attention, following the Block Diffusion design pattern.
  • Fits in 18 GB of VRAM at 4-bit per Unsloth's documentation, with day-one support in vLLM, Transformers, llama.cpp, and MLX.
  • Google's launch post recommends standard Gemma 4 "for applications that demand maximum quality." That caveat is the most important sentence in the release.
  • The honest framing for practitioners: diffusion turns text generation from typing into sculpting.

What is DiffusionGemma 26B-A4B?

DiffusionGemma is a sparse activation model built on the Gemma 4 26B A4B mixture-of-experts skeleton, retrained as a discrete diffusion generator. The NVIDIA Developer Blog lists 25.2B total parameters and 3.8B active per token; the "A4B" suffix is a rounded label.

The architecture is encoder-decoder with bidirectional attention. Each denoising step generates a 256-token block in parallel, and the model supports text, image, and video inputs, a 256K context window, native function calling, and a configurable reasoning mode per the NIM model card.

Weights are on Hugging Face under Apache 2.0 plus the Gemma Terms of Use. One detail to flag: the MoE expert count and routing policy circulating online (128 experts, top-8) appear only in third-party sources, so treat those numbers as unconfirmed.

How does diffusion differ from autoregressive text generation?

The difference is the probability factorization. An autoregressive model factorizes a sequence as a chain of next-token predictions, so producing 256 tokens requires 256 sequential forward passes, and causal attention means each position only sees what came before it.

A diffusion language model treats generation as denoising. The model sees a partially masked sequence and predicts every masked position at once, with attention flowing in both directions.

Producing 256 tokens might take 32 denoising steps instead of 256 sequential passes. The Hugging Face overview of diffusion language models traces this lineage from D3PM (Austin et al., 2021) through LLaDA and Dream 7B.

Each denoising forward pass costs more than an autoregressive one, roughly 2x for the same window because of bidirectional attention. The win comes from doing far fewer of them.

Two 2025 advances made this practical: Block Diffusion (Arriola et al., ICLR 2025) added prefix caching at block boundaries, and FreeCache/FlashDLM recovered AR-style KV caching by reusing projections for positions that aren't being remasked, reporting 2 to 10x speedups. The independent dKV-Cache work reached similar results.

DiffusionGemma's 256-token block design sits directly on this research.

The second structural advantage is editing. Autoregressive models handle infilling through fill-in-the-middle training, a reordering trick bolted onto a left-to-right factorization that doesn't generalize to multiple non-contiguous spans. A diffusion model edits natively: remask the span you want changed, denoise, done. The Google Developers Blog guide builds its entire walkthrough around this infill-and-revise loop.

How fast is DiffusionGemma on local RTX hardware?

Fast enough to change what local inference feels like. First-party figures from NVIDIA and Google put peak throughput at 1,000 tokens/sec on a single H100 in BF16, 2,000 on DGX Station, and 700+ on a GeForce RTX 5090.

DiffusionGemma peak throughput by hardware (first-party figures)DGX Station2000tok/sH100 (BF16)1000tok/sRTX 5090 (4-bit)700tok/sDGX Spark150tok/s
DiffusionGemma peak throughput by hardware (first-party figures)

At 4-bit quantization the model fits in 18 GB of VRAM, which puts a 25.2B-parameter model inside a single consumer card. First-week RTX 5090 owners on r/LocalLLaMA report 50 to 100 tokens/sec sustained on practical prompts, well below the advertised peak but still strong for the class.

The tooling story matters as much as the numbers. DiffusionGemma is the first diffusion LLM natively supported in vLLM, with simultaneous Transformers, Unsloth GGUF, llama.cpp, and MLX support. Earlier diffusion models like LLaDA 8B spent most of 2025 in the "interesting paper, can't run it" bucket. This one runs on day one.

Can diffusion beat autoregression on code?

On editing, plausibly already. On correctness, no published evidence says yes, and Google has published no HumanEval, MBPP, or LiveCodeBench numbers for DiffusionGemma at all. That absence is the loudest signal in the launch.

The structural case for diffusion code generation is real. Code is full of bidirectional dependencies: function bodies depend on signatures, tests depend on the functions they exercise. A bidirectional prior fits that structure, and the ability to revise a function body in place without regenerating the file is something developers actually want.

One first-week Hacker News commenter put it well: "For the first time I have a local 4-bit model that I can ask to rewrite this function and it actually respects the rest of the file."

The structural case against is equally real. Code completion is prefix-anchored, and a model that can re-attend to and revise your prefix can drift away from what you typed. Several first-week reports flag exactly this failure mode, along with inconsistent variable naming.

The published track record of diffusion coders supports caution. Dream 7B reports around 21.4% pass@1 on LiveCodeBench, below same-sized autoregressive baselines. Apple's DiffuCoder showed code-RL post-training adds about 4.4 points on EvalPlus.

The standout is Mercury Coder from Inception Labs, with first-party HumanEval numbers in the 88 to 90 range, though those figures remain unreproduced independently. The pattern across all of them: diffusion coders are reaching the scores autoregressive models hit 12 to 18 months earlier.

Why hasn't diffusion caught up at scale?

Three gaps, each independently sufficient. First, training-recipe immaturity: autoregressive models have had five-plus years of post-training refinement (SFT, RLHF, DPO), and standard RLHF doesn't directly apply to a generator that isn't left-to-right.

Second, evaluation infrastructure built around left-to-right assumptions. Third, inference kernels: vLLM and TensorRT-LLM represent years of CUDA work tuned for the autoregressive attention pattern, and diffusion scheduling support is only now arriving.

Scale is the other elephant. DiffusionGemma's 3.8B active parameters sit 5.8x below Qwen3-235B-A22B and 9.7x below DeepSeek V3 on an active-parameter basis. The interesting open question isn't whether diffusion works at 3.8B active.

It's whether diffusion works at 22B+ active, where it would compete with the open-weights frontier, and the literature has almost no evidence there.

It's also worth naming the launch dynamics. The release was tightly co-marketed across Google's blog, NVIDIA's developer blog, NIM, vLLM, and Unsloth on the same day, with no benchmark table and an explicit quality caveat. VentureBeat's coverage cited a 5 to 15 point reasoning gap versus Gemma 4 26B, a figure we could not locate verbatim in first-party material, so hold it loosely.

The shape of the launch reads as a productized inference story.

What this means for you

If you run local models on a 24 GB card, DiffusionGemma is worth installing this week for one specific job: in-place editing and infilling. Pull the Unsloth GGUF, load the 4-bit quant in 18 GB, and try the remask-and-revise workflow on a real file.

That's the use case where its structural advantage shows up immediately.

For correctness-critical code generation, keep your current autoregressive coder. The sensible local stack right now pairs a strong AR model for generation with DiffusionGemma for revision passes, since the diffusion model can rewrite a span without disturbing its surroundings.

For builders, the day-one vLLM support means you can serve it behind an OpenAI-compatible endpoint today and A/B it against Gemma 4 on your own evals. Run your own numbers; nobody else has published any.

What to watch over the next 6 to 12 months

The platform-shift question resolves on a few concrete signals. A frontier-lab diffusion model at 20B+ active parameters reaching benchmark parity with its autoregressive sibling would confirm the thesis. So would a diffusion model placing top-3 on SWE-bench Verified, or a major coding tool adopting one as its default local backend.

Independent confirmation of a persistent 5 to 15 point gap on HumanEval and MMLU would falsify it, as would silence from DeepMind. If no DiffusionGemma 2 appears by December 2026, this launch was a one-off experiment rather than the start of a program.

My read: the strong thesis fails and the weak one holds. Diffusion becomes a structurally important second paradigm for editing, infilling, and speed-critical local inference, while autoregression keeps the frontier.

That's still a meaningful shift for anyone running a local LLM on RTX hardware, and DiffusionGemma is the first model that lets you test it yourself.

Sources

Frequently asked questions

What is DiffusionGemma 26B-A4B?

DiffusionGemma 26B-A4B is an open-weights diffusion language model released by Google DeepMind and NVIDIA on June 10, 2026. It has 25.2B total parameters with 3.8B active per token via mixture-of-experts routing, and it generates 256 tokens in parallel per denoising step instead of one token at a time. It ships under Apache 2.0 with the Gemma Terms of Use.

How is diffusion text generation different from autoregressive generation?

Autoregressive models produce one token per forward pass, left to right, with causal attention. Diffusion language models start from a masked sequence and fill in all masked positions in parallel over a series of denoising steps, using bidirectional attention. That makes diffusion faster for long outputs and structurally better at editing and infilling, while autoregressive models still lead on peak quality.

How much VRAM does DiffusionGemma need to run locally?

Quantized to 4-bit (Q4_K_M or NVFP4), DiffusionGemma fits in about 18 GB of VRAM per Unsloth's documentation, which puts it within reach of an RTX 5090 and other 24 GB consumer cards. Full BF16 precision needs roughly 50 GB.

Is DiffusionGemma better than standard Gemma 4 for coding?

Probably for editing and infilling, but for raw code correctness Google itself recommends standard Gemma 4 when maximum quality matters. First-week practitioner reports praise its in-place editing while flagging inconsistent variable naming and prefix drift on completion tasks. No first-party HumanEval or LiveCodeBench numbers have been published yet.

Will diffusion models replace autoregressive LLMs?

The evidence so far points to coexistence rather than replacement. Diffusion models have historically trailed autoregressive models by 5 to 15 points at matched scale, and DiffusionGemma's 3.8B active parameters sit roughly an order of magnitude below the open-weights frontier. Watch for a 20B+ active-parameter diffusion model with independent benchmark parity before calling it a platform shift.