
Your AI Agent Stopped Responding 2 Hours Ago. Nobody Noticed.
Your agent is deployed. Pod is running. Container passes liveness probes. Grafana shows a flat green line. Everything looks fine. Except the agent stopped processing work 2 hours ago. It's alive - the process is there - but it's stuck. Deadlocked on a thread. Blocked on a full queue. Spinning in a retry loop that will never succeed. Silently swallowing exceptions in a while True . Nobody knows until a customer reports it. Or until someone opens a dashboard at 5 PM and wonders why the task queue has been growing all afternoon. Why Container Health Checks Don't Work for Agents Kubernetes liveness probes check one thing: is the process responding to HTTP? If your agent serves a /healthz endpoint, the probe passes. The agent is "healthy." But responding to /healthz and processing work are two different things. An agent can: Deadlock on an internal lock while still serving HTTP OOM-kill its worker thread while the main thread stays alive Enter an infinite retry loop on a broken downstream A
Continue reading on Dev.to
Opens in a new tab

