When Your AI Agent Has an Incident, Your Runbook Isn't Ready

Your on-call engineer gets paged at 2am. The alert says your customer-facing AI agent is misbehaving — producing garbled outputs, possibly taking unintended actions, burning through tokens at ten times the expected rate. They open the runbook. The runbook says: check the error rate, examine the trace, identify the failing component, roll back or patch. None of that applies. The error rate is fine — the agent is executing successfully, it's just doing the wrong thing. The "trace" is a wall of LLM completions with no clear causal structure. There's no component to isolate because the failure is in reasoning, not in code. And rolling back the agent deployment doesn't roll back whatever it already did. This is the gap that most engineering teams discover at the worst possible time: 57% of organizations now have agents running in production, according to LangChain's 2026 State of Agent Engineering report, but the same research found that quality (cited by 32% of respondents as their top blo

When Your AI Agent Has an Incident, Your Runbook Isn't Ready

Related Articles

Official White House app developer also a UFO conspiracy theorist

The Artemis Moon base project is legally dubious

The HP OmniBook 5 Is a MacBook Neo Killer, and It's Only $500

Trump defunding of NPR and PBS blocked by judge, but damage is already done

Everything is iPhone now

Related Articles

News
Official White House app developer also a UFO conspiracy theorist
Ars Technica • 31m ago

News
The Artemis Moon base project is legally dubious
The Verge • 58m ago

News
The HP OmniBook 5 Is a MacBook Neo Killer, and It's Only $500
Wired • 1h ago

News
Trump defunding of NPR and PBS blocked by judge, but damage is already done
Ars Technica • 2h ago

News
Everything is iPhone now
The Verge • 2h ago