Overview
DiffusionGemma is an experimental open-weight large language model from Google, released on June 10, 2026 as part of the Gemini Diffusion / DiffusionGemma line. It is the first openly released text-diffusion model in the Gemma family: instead of predicting one token at a time like a normal autoregressive LLM, it starts from a canvas of placeholder tokens and iteratively denoises a 256-token block in parallel, the same idea behind AI image generators applied to text.
Under the hood DiffusionGemma reuses the Gemma 4 26B-A4B Mixture-of-Experts backbone (26B total parameters, about 3.8B active, 8 of 128 experts) with a new diffusion head and bidirectional attention. The instruction-tuned weights ship as google/diffusiongemma-26B-A4B-it under the Apache 2.0 license on Hugging Face, Kaggle, and Vertex AI Model Garden, support a 256K-token context window, and accept text, image, and video input (text output only; no audio).
The headline benefit is speed: DiffusionGemma reaches more than 1,000 tokens per second on a single NVIDIA H100 and 700+ tokens per second on an RTX 5090, up to roughly 4x faster than comparable autoregressive Gemma models, and fits in about 18GB of VRAM once quantized. The trade-off is quality: Google states its overall output quality is below standard Gemma 4, so it positions DiffusionGemma for speed-critical, interactive, local workflows rather than maximum-quality production use.
| Released | 2026-06-10 |
|---|---|
| License | Apache-2.0 |
| Weights | Open weights |
| Parameters | 26B total (25.2B) MoE · 3.8B active |
| Context | 256K |
| Max output | 256K |
| Architecture | Mixture-of-Experts text diffusion (Gemma 4 26B-A4B backbone) |
| Knowledge cutoff | Jan 2025 |
| Modalities | Text, Vision, Video |
| Status | Experimental open model |
Benchmarks
DiffusionGemma 26B A4B vs Gemma 4 26B A4B benchmark scores (from the official model card)
| Benchmark | DiffusionGemma 26B A4B | Gemma 4 26B A4B |
|---|---|---|
| MMLU Pro | 77.6% | 82.6% |
| AIME 2026 (no tools) | 69.1% | 88.3% |
| LiveCodeBench v6 | 69.1% | 77.1% |
| Codeforces ELO | 1429 ELO | 1718 ELO |
| GPQA Diamond | 73.2% | 82.3% |
| Tau2 (average over 3) | 56.2% | 68.2% |
| HLE (no tools) | 11% | 8.7% |
| HLE (with search) | 11.9% | 17.2% |
| BigBench Extra Hard | 47.6% | 64.8% |
| MMMLU | 81.5% | 86.3% |
| MMMU Pro | 54.3% | 73.8% |
| OmniDocBench 1.5 (average edit distance, lower is better) | 0.319 edit distance | 0.149 edit distance |
| MATH-Vision | 70.5% | 82.4% |
| MedXPertQA MM | 49% | 58.1% |
| MRCR v2 8 needle 128k (average) | 32% | 44.1% |
This model's scores
- MMLU-Pro77.6%
- GPQA Diamond73.2%
- AIME 2026 (no tools)69.1%
- LiveCodeBench v669.1%
- MATH-Vision70.5%
- MMMU Pro54.3%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | Free (open weights) / 1M tokens |
|---|
DiffusionGemma is released as open weights under Apache 2.0 — there is no per-token list price from Google; you self-host (Hugging Face / Kaggle / Vertex AI) or use a hosted endpoint such as NVIDIA's free NIM cloud API. Hosting and GPU costs depend on the provider.
Strengths
- First open-weight (Apache 2.0) text-diffusion model in the Gemma family, downloadable from Hugging Face, Kaggle, and Vertex AI
- Very high throughput — 1,000+ tokens/sec on an H100 and 700+ on an RTX 5090, up to ~4x faster than autoregressive Gemma
- Efficient 26B Mixture-of-Experts design with only ~3.8B active parameters; fits in ~18GB VRAM quantized for single-GPU local use
- 256K-token context window with text, image, and video input
- Day-one tooling support across vLLM (first diffusion LLM in the framework), Transformers, MLX, SGLang, and NVIDIA NIM
Best for
- Reach for it when generation latency matters more than top-tier quality — interactive local chat, autocomplete, or streaming UIs.
- Reach for it to run a fast text model on a single consumer or data-center GPU after quantization.
- Reach for it to experiment with text-diffusion decoding (parallel block denoising) on open Apache 2.0 weights.
- Use standard Gemma 4 instead when you need maximum output quality for production-grade tasks.
How to access
| Provider | Model ID |
|---|---|
| Hugging Face ↗ | google/diffusiongemma-26B-A4B-it |
| NVIDIA NIM ↗ | google/diffusiongemma-26b-a4b-it |
| Vertex AI Model Garden ↗ | — |
Gemini Diffusion / DiffusionGemma — every version
The full lineage of the Gemini Diffusion / DiffusionGemma line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| DiffusionGemmacurrent | 2026-06-10 | — | Apache-2.0 |
| Gemini Diffusion | 2025-05-20 | — | Proprietary |
FAQ
What is DiffusionGemma?
DiffusionGemma is Google's first open-weight text-diffusion language model, released June 10, 2026. It is built on the Gemma 4 26B-A4B Mixture-of-Experts backbone (26B total parameters, ~3.8B active) and generates text by denoising a 256-token block in parallel rather than one token at a time, which makes generation up to 4x faster than comparable autoregressive Gemma models.
How fast is DiffusionGemma?
Google reports more than 1,000 tokens per second on a single NVIDIA H100 and 700+ tokens per second on an RTX 5090 — up to roughly 4x the throughput of comparable autoregressive Gemma models. The model fits in about 18GB of VRAM after quantization, so it can run on a single high-end GPU.
Is DiffusionGemma open source and free?
Yes. The instruction-tuned weights (google/diffusiongemma-26B-A4B-it) are released under the Apache 2.0 license and can be downloaded from Hugging Face, Kaggle, and Vertex AI Model Garden. There is no per-token price from Google; you self-host or use a hosted endpoint such as NVIDIA's free NIM cloud API, paying only for the underlying compute.
Is DiffusionGemma better than Gemma 4?
Not in quality. Google states DiffusionGemma's overall output quality is lower than standard Gemma 4 and recommends it for speed-critical, interactive, local workflows. For maximum quality, Google still recommends the autoregressive Gemma 4 models.