How OpenAI delivers low-latency voice AI at scale

TL;DR Highlight

OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.

Who Should Read

Backend/infrastructure developers aiming to add real-time voice/audio features to apps, or developers struggling with port management or routing issues while operating WebRTC in a Kubernetes environment.

Core Mechanics

OpenAI chose WebRTC because it’s a standardized protocol already implemented in browsers, mobile devices, and servers, eliminating the need to implement low-level processing like ICE (NAT traversal), DTLS/SRTP (encrypted transmission), codec negotiation, RTCP (quality control), and echo cancellation/jitter buffering.
The most crucial characteristic of voice AI is that audio arrives as a continuous stream, allowing the model to simultaneously transcribe, infer, call tools, and generate speech while the user is speaking – creating the difference between a ‘conversational’ and a ‘push-to-talk’ feel.
The traditional WebRTC server approach, SFU (Selective Forwarding Unit), requires opening a separate port for each session, and this ‘one port per session’ model was a core problem colliding with Kubernetes at OpenAI’s scale, making horizontal scaling difficult due to stateful ICE/DTLS sessions needing to be pinned to specific nodes.
To solve this, OpenAI designed a relay + transceiver split architecture, placing relays at the global edge to minimize first-hop latency to clients, while transceivers handle actual media processing and model connections within the internal infrastructure.
Clients experience standard WebRTC behavior, while the underlying packet routing is completely changed, using ICE credentials to route to the correct transceiver and maintain stateful sessions.
Combining global relays with geostering (automatic routing based on user location) ensures that connections are routed to the nearest relay worldwide, which is critical for maintaining low latency at a scale of 900 million users.
The implementation leveraged the open-source Go WebRTC library Pion (https://github.com/pion/webrtc), and Pion’s creator, Sean DuBois, has since joined OpenAI.
Currently, the Realtime API’s voice models are limited to the GPT-4o family, meaning the model’s capabilities aren’t at the level of the latest frontier models despite the architectural improvements.

Evidence

"Pion library developers commented thanking OpenAI for publicly acknowledging its use and recommended 'WebRTC for the Curious (webrtcforthecurious.com)' as a WebRTC introductory resource. A WebRTC + Kubernetes game streaming product veteran strongly disagreed, arguing that the problems OpenAI described were mostly issues with the libwebrtc implementation, and that proper feature flag configuration could reduce latency without paid network workarounds. Users shared experiences where low latency itself created UX problems, with the system incorrectly interpreting pauses as turn endings. OpenAI mentioned their open-source Voice AI pipeline framework pipecat (https://github.com/pipecat-ai/pipecat), with comments recommending it as a good starting point. Questions arose about whether OpenAI replaced LiveKit with a custom WebRTC stack, but the architecture explanation itself implied a custom build."

How to Apply

If you’re running WebRTC servers in Kubernetes and facing scale-out limitations due to the one-port-per-session problem, consider redesigning your architecture with a relay (edge, stateless) and transceiver (internal, stateful) split, routing based on ICE credentials.
To quickly prototype real-time voice AI services, explore pipecat (https://github.com/pipecat-ai/pipecat) or Pion (https://github.com/pion/webrtc) before implementing a WebRTC stack from scratch, allowing you to start quickly without low-level implementation.
When implementing ‘end-of-turn detection’ logic for Voice AI, avoid relying solely on silence timers, as they can prematurely cut off users pausing to find a word; instead, make the silence threshold user-adjustable or design separate logic to distinguish mid-utterance pauses from turn endings.
If you’re operating WebRTC based on libwebrtc, consider checking feature flag settings, as latency issues may be solvable through configuration before resorting to paid network solutions or complex infrastructure changes.

Terminology

WebRTCAn open standard for real-time audio/video transmission in browsers and mobile apps without plugins. It already implements complex aspects like NAT firewall traversal, encryption, and codec negotiation.

ICEInteractive Connectivity Establishment. A protocol that helps two devices behind firewalls or NATs discover each other and establish a connection by exploring possible paths.

SFUSelective Forwarding Unit. A media server that receives streams from each participant and selectively forwards them to others, commonly used in multi-person calls like Zoom or Discord.

DTLSDatagram Transport Layer Security. Provides encryption over UDP, similar to TLS. Used for encrypting WebRTC media transmission.

SRTPSecure Real-time Transport Protocol. A protocol that encrypts the audio/video stream itself. It uses keys exchanged via DTLS to protect the media payload.

GeosteringA technique where DNS or a load balancer automatically routes users to the nearest server based on their location, crucial for reducing first-hop latency in global services.