Show HN: I built a sub-500ms latency voice agent from scratch

I built a voice agent from scratch that averages ~400ms end-to-end latency (phone stop → first syllable). That’s with full STT → LLM → TTS in the loop, clean barge-ins, and no precomputed responses. What moved the needle: Voice is a turn-taking problem, not a transcription problem. VAD alone fails; you need semantic end-of-turn detection. The system reduces to one loop: speaking vs listening. The two transitions - cancel instantly on barge-in, respond instantly on end-of-turn - define the experience. STT → LLM → TTS must stream. Sequential pipelines are dead on arrival for natural conversation. TTFT dominates everything. In voice, the first token is the critical path. Groq’s ~80ms TTFT was the single biggest win. Geography matters more than prompts. Colocate everything or you lose before you start. Comments URL: https://news.ycombinator.com/item?id=47224295 Points: 11 # Comments: 3

Show HN: I built a sub-500ms latency voice agent from scratch

Related Articles

My life value

Hands-on with Lenovo's modular laptop: a promising concept (and not too far-fetched)

The two kinds of error

BlipBlox After Dark Review: a Synthesizer for Everybody

Lenovo's new PCs offer a glimpse of the future - and it's modular

Related Articles

News
My life value
Medium Programming • 5h ago

News
Hands-on with Lenovo's modular laptop: a promising concept (and not too far-fetched)
ZDNet • 6h ago

News
The two kinds of error
Lobsters • 6h ago

News
BlipBlox After Dark Review: a Synthesizer for Everybody
Wired • 7h ago

News
Lenovo's new PCs offer a glimpse of the future - and it's modular
ZDNet • 7h ago