
A Postmortem on Autonomous LLM-as-Judge: How My Eval Agent Got Two Verdicts Wrong Before I Found a Sandbox Bug
I run an autonomous eval agent against new coding-agent stacks before trusting their numbers. The setup is standard: same task, same workflow, swap the shell × model combo, score the resulting diff on six dimensions. Last week the eval gave me a verdict that turned out to be wrong — twice — for the same root cause. The agent generating the verdict never flagged any uncertainty. I'm sharing the postmortem because the failure mode is the kind of thing that quietly poisons any LLM-as-judge pipeline running in production, and mine only got caught because I happened to ask the right follow-up question. Three combos, identical task, scored autonomously by Claude Code (Opus 4.6) running headless in a fresh session each retest. Exhibit A: the eval agent's verdicts Run 1. C1 (OpenCode + MiniMax-M2.7) scored 15/60 . Verdict in the auto-generated report: "Consistent with previous results: fast execution but no meaningful code output." Run 2. Fresh session, no memory of run 1. C1 scored 16/60 . Ne
Continue reading on Dev.to
Opens in a new tab



