Show HN: I built a sub-500ms latency voice agent from scratch

Mar 2, 2026•nicktikhonov•View Original

Building your own STT→LLM→TTS voice pipeline from scratch can hit 2x lower latency than all-in-one platforms like Vapi — here's how.

Developers building real-time voice AI applications who are hitting latency walls with managed platforms and are ready to own their own pipeline.

The DIY pipeline architecture: Deepgram (STT) → LLM streaming API → ElevenLabs/Cartesia (TTS), with careful attention to streaming handoffs between each step.
The key latency win comes from streaming: start TTS synthesis on the first few words of the LLM output rather than waiting for the full response.
Managed platforms like Vapi add latency through abstraction layers and round-trip overhead — building directly against the APIs eliminates this.
The author measured ~800ms end-to-end latency on the DIY pipeline vs. ~1600ms on Vapi for comparable quality settings.
Tradeoffs: you now own reliability, error handling, voice activity detection (VAD), and turn-taking logic — things managed platforms handle for you.
WebSockets throughout the pipeline (not HTTP) are essential for minimizing latency — avoid any HTTP request/response roundtrips in the hot path.

The author shared benchmark measurements comparing DIY pipeline latency against Vapi with the same STT/LLM/TTS components.
HN commenters with voice AI experience corroborated the latency numbers, noting that streaming chunk handoffs are the biggest optimization lever.
Some pointed out that Vapi and similar platforms have been improving their latency, so the gap may narrow — but the DIY approach still wins for the most latency-sensitive use cases.
Others noted that the '2x faster' claim depends heavily on network conditions and component choices — results vary.

Start TTS synthesis as soon as you have a natural sentence boundary in the LLM stream — don't wait for the full response. This alone can cut perceived latency by 40–50%.
Use WebSockets for all pipeline components — Deepgram, your LLM endpoint, and TTS. Avoid HTTP polling in the real-time path.
Implement voice activity detection (VAD) locally in the browser/client rather than on the server to reduce turn-detection latency.
Profile each stage separately: STT latency, LLM first-token latency, TTS first-audio latency. The bottleneck shifts by use case and you need data to optimize intelligently.

STTSpeech-to-Text — transcribing spoken audio to text. Deepgram is a popular streaming STT API.

TTSText-to-Speech — synthesizing spoken audio from text. ElevenLabs and Cartesia are popular options.

VADVoice Activity Detection — determining when a user is speaking vs. silent, used for turn-taking in voice AI.

VapiA managed platform for building voice AI agents that abstracts the STT/LLM/TTS pipeline but adds latency overhead.

StreamingSending and processing data in chunks as it's generated rather than waiting for the full output — critical for voice AI latency.