Google DeepMind · 2026-04-23 · major

Decoupled DiLoCo: 236× Less Bandwidth for Distributed LLM Training

Google DeepMind's Decoupled DiLoCo cuts cross-datacenter training bandwidth 236× (198 Gbps → 0.84 Gbps) while hitting 88% goodput under hardware failures vs 27% for standard data-parallel. Validated on a 12B model across four US regions.

Decoupled DiLoCo diagram showing resilient distributed training with slice-granularity elasticity vs standard data-parallel

Google DeepMind cuts inter-datacenter training bandwidth 236× while maintaining 88% goodput when chips fail — validated at 12B scale.

Key specs

Bandwidth reduction	236× (198 Gbps → 0.84 Gbps)
Goodput under failure	88% vs 27% (data-parallel)
Model size validated	12B parameters
Training speed	20× faster than synchronous baseline

What is it?

Decoupled DiLoCo is a distributed LLM pre-training method from Google DeepMind that separates local gradient updates within each datacenter from periodic cross-datacenter synchronization. Where conventional data-parallel training requires constant 198 Gbps cross-datacenter links, Decoupled DiLoCo communicates asynchronously with only 0.84 Gbps, using periodic parameter averaging ('DiLoCo steps') instead of per-step gradient synchronization.

How does it work?

Each compute island runs local SGD steps independently, then periodically exchanges averaged model parameters via an inner-outer loop. These cross-datacenter averaging steps are fully decoupled: they don't block local training when a remote site is slow or failing. When a hardware failure takes out a slice of TPUs, the rest of the cluster continues at full speed — achieving 88% goodput vs 27% with standard data-parallel under simulated high-failure conditions with 1.2 million chips. Validated on a 12B parameter model across four US regions over multi-day runs.

Why does it matter?

Training frontier models currently requires co-located supercomputers because cross-datacenter bandwidth is too expensive for synchronous training. Decoupled DiLoCo makes geographically distributed training viable at 1/236th the bandwidth cost, opening the door to pooling compute across multiple data centers and tolerating chip failures without cluster-wide downtime. The method comes with a production-scale validation, not just a proof-of-concept.

Who is it for?

ML infrastructure engineers training large models across distributed compute

Try it

arxiv.org/abs/2604.21428