v0.3: Your AI Agent's Decisions Now Get Graded Automatically

Agent Forensics started as a simple decision logger. Record what the agent did, generate a report, figure out what went wrong. That's no longer enough. After publishing the v0.2 update, a Reddit commenter dropped this: "A decision log helps you find these after the fact, but the harder problem is preventing them. The most useful insight isn't 'what went wrong' — it's 'where did the model encounter ambiguity and pick one interpretation without flagging it.'" Another commenter laid out three concrete features they wanted: "Deterministic replay: store model name, temperature, seed so you can rerun the exact trace. Guardrail checkpoints: log pre and post tool-call intent plus an allow/deny reason. Eval hooks: auto-label common failure modes so you can aggregate across sessions." So I built all three. What's New in v0.3 1. Guardrail Checkpoints — "Was This Action Approved?" The single most common agent failure mode: the agent does something the user didn't approve. A shopping agent buys a s

v0.3: Your AI Agent's Decisions Now Get Graded Automatically

Related Articles

I Have Been Unemployed for Months as a Developer — Here’s Everything I Did to Stay Sane and Keep…

The Most Dangerous Bugs Don’t Throw Errors

heerich.js - A tiny engine for 3D voxel scenes rendered to SVG

32 - Filter Assignments

Hello everyone!

Related Articles

News
I Have Been Unemployed for Months as a Developer — Here’s Everything I Did to Stay Sane and Keep…
Medium Programming • 3h ago

News
The Most Dangerous Bugs Don’t Throw Errors
Medium Programming • 3h ago

News
heerich.js - A tiny engine for 3D voxel scenes rendered to SVG
Lobsters • 3h ago

News
32 - Filter Assignments
Dev.to • 3h ago

News
Hello everyone!
Medium Programming • 4h ago