Google · 2026-04-22 · major

Google TPU 8t and TPU 8i — 8th-Gen AI Chips Built for Training and Agentic Inference

Google unveiled TPU 8t (training) and TPU 8i (inference), its 8th-gen AI chips. TPU 8t delivers 3× compute per pod with 121 ExaFlops; TPU 8i cuts inference cost 80% with 3× on-chip SRAM. Generally available later in 2026.

Google TPU 8t and TPU 8i chips — 8th-generation AI accelerators designed for training and agentic inference workloads

Google's new AI chips separate training and inference into dedicated silicon for the first time — each optimized for the agentic-era workload it serves.

Key specs

Tpu 8t compute per pod	~3× previous gen
Tpu 8t fp4 exa flops	121 ExaFlops
Tpu 8t chips per superpod	9,600
Tpu 8t shared hbm	2 petabytes
Tpu 8t goodput	>97%
Tpu 8i inference perf/dollar	+80% vs previous gen
Tpu 8i on chip sram	384 MB (3× previous gen)
Tpu 8i hbm per chip	288 GB
Tpu 8i ici bandwidth	19.2 Tb/s (2×)
Both chips perf/watt	2× previous gen
Hn points	149

What is it?

At Google Cloud Next 2026, Google announced its 8th-generation TPU family with two distinct chips: TPU 8t for model training and TPU 8i for inference. Unlike previous TPU generations, which ran both workloads on the same die, the 8th gen splits responsibilities. TPU 8t is designed for massive training runs — a single superpod connects 9,600 chips with two petabytes of shared high-bandwidth memory and delivers 121 ExaFlops of FP4 compute with near-linear scaling up to one million chips. TPU 8i is designed for AI agent serving — 288 GB HBM per chip, 3× more on-chip SRAM than the previous generation, 19.2 Tb/s ICI bandwidth, and 80% better performance-per-dollar than the previous inference chip.

How does it work?

TPU 8t achieves near-linear scaling at million-chip scale by doubling interchip interconnect bandwidth versus the previous generation. The 97%+ goodput means almost all chip cycles go to productive computation rather than synchronization overhead. TPU 8i addresses the inference bottleneck for multi-step agent workloads: agents require low-latency, high-throughput serving of iterative token generation across many concurrent sessions. The 3× SRAM increase reduces the fraction of inference cycles stalled waiting on HBM, improving throughput for the short-burst, high-frequency request pattern that distinguishes agent traffic from batch inference.

Why does it matter?

Agentic workloads — where a model plans, calls tools, and iterates over multiple turns per user session — have a fundamentally different compute profile than single-turn inference. The TPU 8i's SRAM and bandwidth improvements directly address the bottleneck Google sees at scale. For training, the TPU 8t's near-linear million-chip scaling removes the ceiling that has forced large training runs to split across multiple logical clusters with expensive cross-cluster communication.

Who is it for?

ML teams training large models on Google Cloud; teams running AI agent workloads on Vertex AI at scale.

Try it

cloud.google.com/resources/tpu-interest — sign up to request access when GA later in 2026