I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

When I started building GoNoGo.team — a platform where AI agents interview founders by voice to validate startup ideas — I thought the hard part would be the AI reasoning. The multi-agent orchestration. The 40+ function-calling tools. I was wrong. The hard part was echo. Specifically: how do you stop an AI agent from hearing itself talk, freaking out, and interrupting its own sentence? After 500+ voice sessions and too many late nights staring at RMS waveforms, here's what I actually learned. The Setup: Speech-to-Speech, Not STT → LLM → TTS GoNoGo runs on Gemini 2.5 Flash Live API — a true speech-to-speech pipeline. There's no intermediate transcription step, no text-to-speech synthesis layer bolted on afterward. Audio goes in, audio comes out. Direct. This is important because it changes everything about how you handle audio on the client. You're not working with text buffers. You're working with raw PCM, 16kHz input from the browser mic, 24kHz output from the agent voice. Base64-enco

I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

Related Articles

Building DNS query tool from scratch using C

How to build .NET obfuscator - Part I

How to Use Traceroute and MTR to Diagnose Network Issues

apt-key Deprecation: Add Repositories with GPG on Ubuntu

How To Use Variadic Functions in Go

Related Articles

How-To
Building DNS query tool from scratch using C
Reddit Programming • 2d ago

How-To
How to build .NET obfuscator - Part I
Reddit Programming • 2d ago

How-To
How to Use Traceroute and MTR to Diagnose Network Issues
DigitalOcean Tutorials • 1w ago

How-To
apt-key Deprecation: Add Repositories with GPG on Ubuntu
DigitalOcean Tutorials • 1w ago

How-To
How To Use Variadic Functions in Go
DigitalOcean Tutorials • 2w ago