Building a Continuous Voice Interface with the OpenAI Realtime API

A technical walkthrough of how the ABD Assistant voice command system works end-to-end, from raw microphone bytes to tool execution. The Core Architecture The system has three moving parts: a browser Web Audio capture layer, an Express WebSocket relay, and OpenAI's Realtime API as the voice brain. The browser streams PCM audio directly to OpenAI via a WebSocket that stays open for the entire session. OpenAI performs server-side voice activity detection (VAD), transcribes speech incrementally, runs its LLM over the conversation history, and streams back audio tokens as they're generated. This means no client-side silence detection, no turn-management logic, and no separate transcription step — one pipeline, fully server-driven. Audio Capture: The Hard Part Capturing audio correctly is where most implementations fall apart. The key constraint: OpenAI's Realtime API expects mono PCM at 24kHz, 16-bit signed integers. Browser MediaRecorder produces audio/webm or audio/opus — a completely di

Building a Continuous Voice Interface with the OpenAI Realtime API

Related Articles

Building a dry-run mode for the OpenTelemetry Collector

Building slogbox

Learning to Generate Images of Outdoor Scenes from Attributes and SemanticLayouts

Building DNS query tool from scratch using C

How to build .NET obfuscator - Part I

Related Articles

How-To
Building a dry-run mode for the OpenTelemetry Collector
Lobsters • 4h ago

How-To
Building slogbox
Lobsters • 7h ago

How-To
Learning to Generate Images of Outdoor Scenes from Attributes and SemanticLayouts
Dev.to • 9h ago

How-To
Building DNS query tool from scratch using C
Reddit Programming • 2d ago

How-To
How to build .NET obfuscator - Part I
Reddit Programming • 2d ago