Google · 2026-05-05 · major

Gemma 4 MTP Drafters — Open-Weight Speculative Decoding Boosts Inference Up To 3x

Item: Gemma 4 MTP Drafters — Open-Weight Speculative Decoding Boosts Inference Up To 3x
Rating: 4
Author: AI/TLDR

Google released Apache-2.0 Multi-Token Prediction drafter models for every Gemma 4 size — small 78M–0.5B companions that pair with the main model to produce up to 3x faster inference at identical output.

Google blog post hero illustrating Gemma 4 multi-token prediction drafters

Tiny companion models pair with Gemma 4 to deliver up to 3x faster inference at identical output quality.

Key specs

License	Apache 2.0
Speedup	up to 3x
Rtx pro 6000 26b	~half wait time at batch 1
Apple silicon batch 4 8	~2.2x
Drafter sizes	78M / 78.8M / 0.4B / 0.5B

What is it?

Open-weight Multi-Token Prediction (MTP) drafter models for the Gemma 4 family. They are small companion checkpoints (78M for E2B/E4B, 0.4B for the 26B-A4B MoE, 0.5B for the 31B) that run alongside the larger target model and propose several tokens per step.

How does it work?

The drafter shares the target's input embedding and consumes the target's last-layer activations, down-projecting them to its own dimension to predict the next few tokens. The target verifies all proposed tokens in a single forward pass, accepting matches and falling back to its own prediction otherwise. The E2B and E4B drafters add an embedder that clusters similar tokens to skip most of the vocab calculation.

Why does it matter?

Inference cost dominates long-running LLM workloads. A 2.2–3x speedup with no quality change roughly halves wait time on common hardware (RTX PRO 6000 at batch 1, Apple Silicon at batch 4–8) and compounds across agent loops, chat, and on-device assistants. Apache 2.0 means anyone can ship them.

Who is it for?

Gemma 4 deployers, on-device app builders, inference platform teams.

Try it

huggingface.co/google/gemma-4-31B-it-assistant