
v0.3: Your AI Agent's Decisions Now Get Graded Automatically
Agent Forensics started as a simple decision logger. Record what the agent did, generate a report, figure out what went wrong. That's no longer enough. After publishing the v0.2 update, a Reddit commenter dropped this: "A decision log helps you find these after the fact, but the harder problem is preventing them. The most useful insight isn't 'what went wrong' — it's 'where did the model encounter ambiguity and pick one interpretation without flagging it.'" Another commenter laid out three concrete features they wanted: "Deterministic replay: store model name, temperature, seed so you can rerun the exact trace. Guardrail checkpoints: log pre and post tool-call intent plus an allow/deny reason. Eval hooks: auto-label common failure modes so you can aggregate across sessions." So I built all three. What's New in v0.3 1. Guardrail Checkpoints — "Was This Action Approved?" The single most common agent failure mode: the agent does something the user didn't approve. A shopping agent buys a s
Continue reading on Dev.to Webdev
Opens in a new tab



