Back to articles
I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About
How-ToTools

I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

via Dev.toKonstantin

When I started building GoNoGo.team — a platform where AI agents interview founders by voice to validate startup ideas — I thought the hard part would be the AI reasoning. The multi-agent orchestration. The 40+ function-calling tools. I was wrong. The hard part was echo. Specifically: how do you stop an AI agent from hearing itself talk, freaking out, and interrupting its own sentence? After 500+ voice sessions and too many late nights staring at RMS waveforms, here's what I actually learned. The Setup: Speech-to-Speech, Not STT → LLM → TTS GoNoGo runs on Gemini 2.5 Flash Live API — a true speech-to-speech pipeline. There's no intermediate transcription step, no text-to-speech synthesis layer bolted on afterward. Audio goes in, audio comes out. Direct. This is important because it changes everything about how you handle audio on the client. You're not working with text buffers. You're working with raw PCM, 16kHz input from the browser mic, 24kHz output from the agent voice. Base64-enco

Continue reading on Dev.to

Opens in a new tab

Read Full Article
2 views

Related Articles