Google · 2026-04-07 · major

TorchTPU: Google Makes PyTorch Native on TPUs, Replacing PyTorch/XLA

Item: TorchTPU: Google Makes PyTorch Native on TPUs, Replacing PyTorch/XLA
Rating: 4
Author: AI/TLDR

193 HN points today. Google's TorchTPU natively integrates PyTorch with TPUs via PrivateUse1 — no subclasses, no wrappers. Change device init to 'tpu', keep the rest of your code. Fused Eager mode delivers 50–100%+ speedup. Will replace PyTorch/XLA once the public GitHub repo ships.

PyTorch/XLA GitHub repository — being superseded by Google TorchTPU for native TPU support

PyTorch code now runs directly on Google TPUs with a one-line device change — Fused Eager mode squeezes 50–100%+ more throughput from the hardware.

Key specs

Hn points (today)	193
Fused eager speedup	50–100%+

What is it?

TorchTPU is Google's new engineering stack that makes PyTorch a first-class citizen on Tensor Processing Units (TPUs). It uses PyTorch's PrivateUse1 interface to provide ordinary PyTorch tensors on TPU hardware — no custom tensor subclasses, no wrapper types. Developers migrate an existing training script by changing the device initialization to 'tpu'; everything else stays the same. TorchTPU is in production at Google and will replace the older PyTorch/XLA library once the public GitHub repository launches.

How does it work?

TorchTPU offers three eager execution modes: Debug Eager (CPU sync after each op, for finding bugs), Strict Eager (async, mirrors standard PyTorch behavior), and Fused Eager (automatic operation fusion for 50–100%+ throughput gain). For compiled workloads, it integrates with torch.compile via XLA — PyTorch Dynamo captures the computation graph, lowers it to StableHLO intermediate representation, and compiles it into optimized TPU binaries. Distributed training supports DDP, FSDPv2, and DTensor with special handling for divergent code execution across ranks, fixing a known limitation of PyTorch/XLA.

Why does it matter?

NVIDIA's CUDA dominates ML training because the software ecosystem was built around it. TorchTPU removes the biggest friction — port-to-TPU effort — by allowing unmodified PyTorch code to run on Google TPU pods. As Google scales TPU capacity through its $40B Anthropic partnership and own AI workloads, a no-friction PyTorch path matters for any team that trains at scale and wants a cost-competitive alternative to NVIDIA hardware.

Who is it for?

ML engineers training large models on cloud infrastructure; teams evaluating non-NVIDIA accelerators

Try it

Change device init to 'tpu' in your PyTorch training script when TorchTPU's public GitHub launches