Google DeepMind · 2026-04-23 · major
Decoupled DiLoCo: 236× Less Bandwidth for Distributed LLM Training
Google DeepMind's Decoupled DiLoCo cuts cross-datacenter training bandwidth 236× (198 Gbps → 0.84 Gbps) while hitting 88% goodput under hardware failures vs 27% for standard data-parallel. Validated on a 12B model across four US regions.

Google DeepMind cuts inter-datacenter training bandwidth 236× while maintaining 88% goodput when chips fail — validated at 12B scale.
Key specs
| Bandwidth reduction | 236× (198 Gbps → 0.84 Gbps) |
|---|---|
| Goodput under failure | 88% vs 27% (data-parallel) |
| Model size validated | 12B parameters |
| Training speed | 20× faster than synchronous baseline |
What is it?
Decoupled DiLoCo is a distributed LLM pre-training method from Google DeepMind that separates local gradient updates within each datacenter from periodic cross-datacenter synchronization. Where conventional data-parallel training requires constant 198 Gbps cross-datacenter links, Decoupled DiLoCo communicates asynchronously with only 0.84 Gbps, using periodic parameter averaging ('DiLoCo steps') instead of per-step gradient synchronization.
How does it work?
Each compute island runs local SGD steps independently, then periodically exchanges averaged model parameters via an inner-outer loop. These cross-datacenter averaging steps are fully decoupled: they don't block local training when a remote site is slow or failing. When a hardware failure takes out a slice of TPUs, the rest of the cluster continues at full speed — achieving 88% goodput vs 27% with standard data-parallel under simulated high-failure conditions with 1.2 million chips. Validated on a 12B parameter model across four US regions over multi-day runs.
Why does it matter?
Training frontier models currently requires co-located supercomputers because cross-datacenter bandwidth is too expensive for synchronous training. Decoupled DiLoCo makes geographically distributed training viable at 1/236th the bandwidth cost, opening the door to pooling compute across multiple data centers and tolerating chip failures without cluster-wide downtime. The method comes with a production-scale validation, not just a proof-of-concept.
Who is it for?
ML infrastructure engineers training large models across distributed compute
Try it
arxiv.org/abs/2604.21428