Google · 2026-04-22 · major
Google TPU 8t and TPU 8i — 8th-Gen AI Chips Built for Training and Agentic Inference
Google unveiled TPU 8t (training) and TPU 8i (inference), its 8th-gen AI chips. TPU 8t delivers 3× compute per pod with 121 ExaFlops; TPU 8i cuts inference cost 80% with 3× on-chip SRAM. Generally available later in 2026.

Google's new AI chips separate training and inference into dedicated silicon for the first time — each optimized for the agentic-era workload it serves.
Key specs
| Tpu 8t compute per pod | ~3× previous gen |
|---|---|
| Tpu 8t fp4 exa flops | 121 ExaFlops |
| Tpu 8t chips per superpod | 9,600 |
| Tpu 8t shared hbm | 2 petabytes |
| Tpu 8t goodput | >97% |
| Tpu 8i inference perf/dollar | +80% vs previous gen |
| Tpu 8i on chip sram | 384 MB (3× previous gen) |
| Tpu 8i hbm per chip | 288 GB |
| Tpu 8i ici bandwidth | 19.2 Tb/s (2×) |
| Both chips perf/watt | 2× previous gen |
| Hn points | 149 |
What is it?
At Google Cloud Next 2026, Google announced its 8th-generation TPU family with two distinct chips: TPU 8t for model training and TPU 8i for inference. Unlike previous TPU generations, which ran both workloads on the same die, the 8th gen splits responsibilities. TPU 8t is designed for massive training runs — a single superpod connects 9,600 chips with two petabytes of shared high-bandwidth memory and delivers 121 ExaFlops of FP4 compute with near-linear scaling up to one million chips. TPU 8i is designed for AI agent serving — 288 GB HBM per chip, 3× more on-chip SRAM than the previous generation, 19.2 Tb/s ICI bandwidth, and 80% better performance-per-dollar than the previous inference chip.
How does it work?
TPU 8t achieves near-linear scaling at million-chip scale by doubling interchip interconnect bandwidth versus the previous generation. The 97%+ goodput means almost all chip cycles go to productive computation rather than synchronization overhead. TPU 8i addresses the inference bottleneck for multi-step agent workloads: agents require low-latency, high-throughput serving of iterative token generation across many concurrent sessions. The 3× SRAM increase reduces the fraction of inference cycles stalled waiting on HBM, improving throughput for the short-burst, high-frequency request pattern that distinguishes agent traffic from batch inference.
Why does it matter?
Agentic workloads — where a model plans, calls tools, and iterates over multiple turns per user session — have a fundamentally different compute profile than single-turn inference. The TPU 8i's SRAM and bandwidth improvements directly address the bottleneck Google sees at scale. For training, the TPU 8t's near-linear million-chip scaling removes the ceiling that has forced large training runs to split across multiple logical clusters with expensive cross-cluster communication.
Who is it for?
ML teams training large models on Google Cloud; teams running AI agent workloads on Vertex AI at scale.
Try it
cloud.google.com/resources/tpu-interest — sign up to request access when GA later in 2026