DiffusionGemma

Name: DiffusionGemma
Author: Google

Google's first open-weight text-diffusion model: a 26B Gemma 4 MoE that denoises text in parallel for up to 4x faster generation.

Overview

DiffusionGemma is an experimental open-weight large language model from Google, released on June 10, 2026 as part of the Gemini Diffusion / DiffusionGemma line. It is the first openly released text-diffusion model in the Gemma family: instead of predicting one token at a time like a normal autoregressive LLM, it starts from a canvas of placeholder tokens and iteratively denoises a 256-token block in parallel, the same idea behind AI image generators applied to text.

Under the hood DiffusionGemma reuses the Gemma 4 26B-A4B Mixture-of-Experts backbone (26B total parameters, about 3.8B active, 8 of 128 experts) with a new diffusion head and bidirectional attention. The instruction-tuned weights ship as google/diffusiongemma-26B-A4B-it under the Apache 2.0 license on Hugging Face, Kaggle, and Vertex AI Model Garden, support a 256K-token context window, and accept text, image, and video input (text output only; no audio).

The headline benefit is speed: DiffusionGemma reaches more than 1,000 tokens per second on a single NVIDIA H100 and 700+ tokens per second on an RTX 5090, up to roughly 4x faster than comparable autoregressive Gemma models, and fits in about 18GB of VRAM once quantized. The trade-off is quality: Google states its overall output quality is below standard Gemma 4, so it positions DiffusionGemma for speed-critical, interactive, local workflows rather than maximum-quality production use.

Released	2026-06-10
License	Apache-2.0
Weights	Open weights
Parameters	26B total (25.2B) MoE · 3.8B active
Context	256K
Max output	256K
Architecture	Mixture-of-Experts text diffusion (Gemma 4 26B-A4B backbone)
Knowledge cutoff	Jan 2025
Modalities	Text, Vision, Video
Status	Experimental open model

Benchmarks

DiffusionGemma 26B A4B vs Gemma 4 26B A4B benchmark scores (from the official model card)

Benchmark	DiffusionGemma 26B A4B	Gemma 4 26B A4B
MMLU Pro	77.6%	82.6%
AIME 2026 (no tools)	69.1%	88.3%
LiveCodeBench v6	69.1%	77.1%
Codeforces ELO	1429 ELO	1718 ELO
GPQA Diamond	73.2%	82.3%
Tau2 (average over 3)	56.2%	68.2%
HLE (no tools)	11%	8.7%
HLE (with search)	11.9%	17.2%
BigBench Extra Hard	47.6%	64.8%
MMMLU	81.5%	86.3%
MMMU Pro	54.3%	73.8%
OmniDocBench 1.5 (average edit distance, lower is better)	0.319 edit distance	0.149 edit distance
MATH-Vision	70.5%	82.4%
MedXPertQA MM	49%	58.1%
MRCR v2 8 needle 128k (average)	32%	44.1%

Comparison source ↗

This model's scores

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	Free (open weights) / 1M tokens

DiffusionGemma is released as open weights under Apache 2.0 — there is no per-token list price from Google; you self-host (Hugging Face / Kaggle / Vertex AI) or use a hosted endpoint such as NVIDIA's free NIM cloud API. Hosting and GPU costs depend on the provider.

Pricing source ↗

Strengths

First open-weight (Apache 2.0) text-diffusion model in the Gemma family, downloadable from Hugging Face, Kaggle, and Vertex AI
Very high throughput — 1,000+ tokens/sec on an H100 and 700+ on an RTX 5090, up to ~4x faster than autoregressive Gemma
Efficient 26B Mixture-of-Experts design with only ~3.8B active parameters; fits in ~18GB VRAM quantized for single-GPU local use
256K-token context window with text, image, and video input
Day-one tooling support across vLLM (first diffusion LLM in the framework), Transformers, MLX, SGLang, and NVIDIA NIM

Best for

Reach for it when generation latency matters more than top-tier quality — interactive local chat, autocomplete, or streaming UIs.
Reach for it to run a fast text model on a single consumer or data-center GPU after quantization.
Reach for it to experiment with text-diffusion decoding (parallel block denoising) on open Apache 2.0 weights.
Use standard Gemma 4 instead when you need maximum output quality for production-grade tasks.

How to access

Provider	Model ID
Hugging Face ↗	`google/diffusiongemma-26B-A4B-it`
NVIDIA NIM ↗	`google/diffusiongemma-26b-a4b-it`
Vertex AI Model Garden ↗	—

Gemini Diffusion / DiffusionGemma — every version

The full lineage of the Gemini Diffusion / DiffusionGemma line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
DiffusionGemmacurrent	2026-06-10	—	Apache-2.0
Gemini Diffusion	2025-05-20	—	Proprietary

FAQ

What is DiffusionGemma?

DiffusionGemma is Google's first open-weight text-diffusion language model, released June 10, 2026. It is built on the Gemma 4 26B-A4B Mixture-of-Experts backbone (26B total parameters, ~3.8B active) and generates text by denoising a 256-token block in parallel rather than one token at a time, which makes generation up to 4x faster than comparable autoregressive Gemma models.

How fast is DiffusionGemma?

Google reports more than 1,000 tokens per second on a single NVIDIA H100 and 700+ tokens per second on an RTX 5090 — up to roughly 4x the throughput of comparable autoregressive Gemma models. The model fits in about 18GB of VRAM after quantization, so it can run on a single high-end GPU.

Is DiffusionGemma open source and free?

Yes. The instruction-tuned weights (google/diffusiongemma-26B-A4B-it) are released under the Apache 2.0 license and can be downloaded from Hugging Face, Kaggle, and Vertex AI Model Garden. There is no per-token price from Google; you self-host or use a hosted endpoint such as NVIDIA's free NIM cloud API, paying only for the underlying compute.

Is DiffusionGemma better than Gemma 4?

Not in quality. Google states DiffusionGemma's overall output quality is lower than standard Gemma 4 and recommends it for speed-critical, interactive, local workflows. For maximum quality, Google still recommends the autoregressive Gemma 4 models.

// Overview

// Benchmarks

This model's scores

// Pricing

// Strengths

// Best for

// How to access

// Gemini Diffusion / DiffusionGemma — every version

// FAQ