
Agents Don't Fail at AI — They Fail at DevOps
When people imagine AI agents failing, they picture the wrong things. They imagine hallucinations, confused responses, bad reasoning. Those happen, but they are not where production systems actually break. Production agents fail at DevOps. Orphaned processes nobody is watching. Context windows that hit the ceiling mid-task. Auth tokens that expire silently and cause agents to fail for hours before anyone notices. Logs that fill disks. Services that restart into broken state and stay there. I have been running 23 agents in production across five businesses for six months. The model has almost never been the problem. The ops layer has been the problem, repeatedly, in ways that were entirely preventable. Here is what broke and how I fixed it. Failure 1: Orphaned Processes The first major incident was an agent that got stuck in a loop. The session appeared active, was consuming API credits, but was not producing useful output. Nobody noticed for almost four hours because there was no alert
Continue reading on Dev.to
Opens in a new tab

