OpenAI · 2026-05-04 · major

OpenAI Engineering: Inside the WebRTC Stack Rebuild That Keeps Voice AI Low-Latency at Scale

OpenAI engineers Yi Zhang and William McDonald describe rearchitecting their WebRTC stack to fix three constraints that broke at scale: one-port-per-session media termination, stateful ICE/DTLS sessions, and global routing for low first-hop latency.

OpenAI 'Delivering low-latency voice AI at scale' post hero card — OpenAI

OpenAI's engineering write-up on the WebRTC stack rebuild that keeps Realtime API voice traffic under conversational latency at scale.

What is it?

An OpenAI engineering post by Yi Zhang and William McDonald, the team responsible for real-time AI interactions. It explains the architecture changes they made so voice AI sessions can be terminated efficiently across pods and regions while preserving the encrypted negotiation state that WebRTC requires.

How does it work?

Three constraints had to be solved together. WebRTC traditionally pins one UDP port per session to a single server (one-port-per-session media termination), the ICE and DTLS handshake state is stateful and hard to migrate once established, and the routing has to keep first-hop latency low even when the user is far from the GPU pool. The post details how the team rebuilt these layers and where they had to depart from off-the-shelf WebRTC stacks.

Why does it matter?

Voice AI only feels conversational when audio comes back below a couple hundred milliseconds; below that, users hear a real conversation, above it they hear a turn-based bot. Production WebRTC at OpenAI's scale is rare to read about in this depth, and the patterns generalise to anyone building real-time voice or telephony agents.

Who is it for?

voice/real-time engineers, infra teams running WebRTC at scale

OpenAI Engineering: Inside the WebRTC Stack Rebuild That Keeps Voice AI Low-Latency at Scale

What is it?

How does it work?

Why does it matter?

Who is it for?

Sources · 2 outlets

Tags