How OpenAI delivers low-latency voice AI at scale
TL;DR Highlight
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Who Should Read
Backend/infrastructure developers aiming to add real-time voice/audio features to apps, or developers struggling with port management or routing issues while operating WebRTC in a Kubernetes environment.
Core Mechanics
- OpenAI chose WebRTC because it’s a standardized protocol already implemented in browsers, mobile devices, and servers, eliminating the need to implement low-level processing like ICE (NAT traversal), DTLS/SRTP (encrypted transmission), codec negotiation, RTCP (quality control), and echo cancellation/jitter buffering.
- The most crucial characteristic of voice AI is that audio arrives as a continuous stream, allowing the model to simultaneously transcribe, infer, call tools, and generate speech while the user is speaking – creating the difference between a ‘conversational’ and a ‘push-to-talk’ feel.
- The traditional WebRTC server approach, SFU (Selective Forwarding Unit), requires opening a separate port for each session, and this ‘one port per session’ model was a core problem colliding with Kubernetes at OpenAI’s scale, making horizontal scaling difficult due to stateful ICE/DTLS sessions needing to be pinned to specific nodes.
- To solve this, OpenAI designed a relay + transceiver split architecture, placing relays at the global edge to minimize first-hop latency to clients, while transceivers handle actual media processing and model connections within the internal infrastructure.
- Clients experience standard WebRTC behavior, while the underlying packet routing is completely changed, using ICE credentials to route to the correct transceiver and maintain stateful sessions.
- Combining global relays with geostering (automatic routing based on user location) ensures that connections are routed to the nearest relay worldwide, which is critical for maintaining low latency at a scale of 900 million users.
- The implementation leveraged the open-source Go WebRTC library Pion (https://github.com/pion/webrtc), and Pion’s creator, Sean DuBois, has since joined OpenAI.
- Currently, the Realtime API’s voice models are limited to the GPT-4o family, meaning the model’s capabilities aren’t at the level of the latest frontier models despite the architectural improvements.
Evidence
- "Pion library developers commented thanking OpenAI for publicly acknowledging its use and recommended 'WebRTC for the Curious (webrtcforthecurious.com)' as a WebRTC introductory resource. A WebRTC + Kubernetes game streaming product veteran strongly disagreed, arguing that the problems OpenAI described were mostly issues with the libwebrtc implementation, and that proper feature flag configuration could reduce latency without paid network workarounds. Users shared experiences where low latency itself created UX problems, with the system incorrectly interpreting pauses as turn endings. OpenAI mentioned their open-source Voice AI pipeline framework pipecat (https://github.com/pipecat-ai/pipecat), with comments recommending it as a good starting point. Questions arose about whether OpenAI replaced LiveKit with a custom WebRTC stack, but the architecture explanation itself implied a custom build."
How to Apply
- If you’re running WebRTC servers in Kubernetes and facing scale-out limitations due to the one-port-per-session problem, consider redesigning your architecture with a relay (edge, stateless) and transceiver (internal, stateful) split, routing based on ICE credentials.
- To quickly prototype real-time voice AI services, explore pipecat (https://github.com/pipecat-ai/pipecat) or Pion (https://github.com/pion/webrtc) before implementing a WebRTC stack from scratch, allowing you to start quickly without low-level implementation.
- When implementing ‘end-of-turn detection’ logic for Voice AI, avoid relying solely on silence timers, as they can prematurely cut off users pausing to find a word; instead, make the silence threshold user-adjustable or design separate logic to distinguish mid-utterance pauses from turn endings.
- If you’re operating WebRTC based on libwebrtc, consider checking feature flag settings, as latency issues may be solvable through configuration before resorting to paid network solutions or complex infrastructure changes.
Terminology
Related Papers
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Apple Silicon에서 Swift로 직접 행렬 곱셈 커널을 구현하며 CPU, SIMD, AMX, GPU(Metal)를 단계별로 최적화해 Gflop/s에서 Tflop/s 수준까지 성능을 높이는 과정을 상세히 설명한 글이다. 프레임워크 없이 LLM 학습의 핵심 연산을 밑바닥부터 구현하고 싶은 개발자에게 Apple Silicon의 성능 한계를 체감할 수 있는 드문 자료다.
Removing fsync from our local storage engine
FractalBits가 fsync 없이 SSD 전용 KV 스토리지 엔진을 구현해 동일 조건 대비 약 65% 높은 쓰기 성능을 달성한 설계 방법을 공유했다. fsync의 메타데이터 오버헤드를 피하기 위해 사전 할당, O_DIRECT, SSD 원자 쓰기 단위 정렬 저널을 조합한 구조가 핵심이다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.
Show HN: GoModel – an open-source AI gateway in Go
GoModel unifies access to OpenAI, Anthropic, Gemini, and other AI providers through a single, OpenAI-compatible API, offering a compiled-language alternative to LiteLLM.
Claude Token Counter, now with model comparisons