
I Built a Voice Interface for My AI Agent in 2 Hours (Flask + Web Speech API + TTS)
I had a free Saturday afternoon and a clear goal: talk to my AI agent out loud and hear it talk back. Two hours later, Atlas had a voice. Here's exactly how I built it — Flask backend, Web Speech API for input, Mistral's Voxtral TTS for output, and a canvas animation that makes the avatar's eyes glow in sync with the audio. The Stack Flask — tiny backend, two endpoints Web Speech API — browser-native speech-to-text (Chrome only, push-to-talk) Mistral Voxtral TTS — voxtral-mini-tts-2603 , returns base64 MP3 macOS say command — fallback when Voxtral is unavailable Web Audio API AnalyserNode — drives the canvas glow animation Architecture in 30 Seconds The flow is simple: User holds Space → Chrome's SpeechRecognition runs locally On final result, transcript POSTs to /api/chat Flask calls Mistral chat API (mistral-large-latest) → gets text response Flask calls Voxtral TTS → returns base64 MP3 Browser decodes the MP3, plays it through an AnalyserNode Canvas reads frequency data every frame
Continue reading on Dev.to
Opens in a new tab




