A Postmortem on Autonomous LLM-as-Judge: How My Eval Agent Got Two Verdicts Wrong Before I Found a Sandbox Bug

I run an autonomous eval agent against new coding-agent stacks before trusting their numbers. The setup is standard: same task, same workflow, swap the shell × model combo, score the resulting diff on six dimensions. Last week the eval gave me a verdict that turned out to be wrong — twice — for the same root cause. The agent generating the verdict never flagged any uncertainty. I'm sharing the postmortem because the failure mode is the kind of thing that quietly poisons any LLM-as-judge pipeline running in production, and mine only got caught because I happened to ask the right follow-up question. Three combos, identical task, scored autonomously by Claude Code (Opus 4.6) running headless in a fresh session each retest. Exhibit A: the eval agent's verdicts Run 1. C1 (OpenCode + MiniMax-M2.7) scored 15/60 . Verdict in the auto-generated report: "Consistent with previous results: fast execution but no meaningful code output." Run 2. Fresh session, no memory of run 1. C1 scored 16/60 . Ne

A Postmortem on Autonomous LLM-as-Judge: How My Eval Agent Got Two Verdicts Wrong Before I Found a Sandbox Bug

Related Articles

Welcome Thread - v372

ShadCN UI in 2026: the component library that changed how we build UIs

Why OpenClaw Agents Lose Their Minds Mid-Session (And What It Takes to Fix It)

Logos Privacy Builders Bootcamp

#05 Frozen Pipes

Related Articles

How-To
Welcome Thread - v372
Dev.to • 4h ago

How-To
ShadCN UI in 2026: the component library that changed how we build UIs
Dev.to • 11h ago

How-To
Why OpenClaw Agents Lose Their Minds Mid-Session (And What It Takes to Fix It)
Dev.to • 12h ago

How-To
Logos Privacy Builders Bootcamp
Reddit Programming • 1d ago

How-To
#05 Frozen Pipes
Dev.to • 1d ago