
We analyzed 10,000 voice AI calls. The LLM was rarely the problem.
We built Dograh OSS , an open-source voice AI platform. When we started, we assumed most failures would come from the LLM - bad answers, missed intent, prompt edge cases. So we spent a lot of early effort there. Then we looked at the data. We ran automated QA where an LLM reviews every turn in every call and tags what went right and wrong, and we spent hours listening to calls ourselves. Across roughly 10,000 calls spanning customer support, appointment booking, and lead qualification, the failure picture looked nothing like what we expected. The problems that showed up again and again were about the phone call as a medium. Timing, audio physics, and infrastructure designed decades before LLMs existed. Here is what we found, roughly ranked by frequency. Failure area Share Primary driver STT / word error rate ~38% Low-quality telephony audio and accent variation First-8-second chaos ~34% Greeting latency, barge-in, variable user behavior Interruption handling ~28% Filler words breaking
Continue reading on Dev.to Webdev
Opens in a new tab



