LLM-as-Judge: using Claude to review a Gemini agent

In the previous article , I compared 7 models from 4 providers on the same agentic task. Gemini 3 Flash won on the balance of accuracy, cost, and latency. But winning the benchmark doesn't mean the agent is good. 74.5% accuracy means 1 in 4 products gets the wrong answer. And some of those wrong answers come with high confidence. The benchmark tells you what fails. It doesn't tell you why . For that, I needed something that could look at the agent's reasoning step by step and tell me where the logic broke down. So I built a judge. The idea The production agent runs on Gemini 3 Flash. It's fast and cheap, which is why it's in production. But it makes mistakes. Some of those mistakes share patterns that, if I could identify them, would tell me exactly what to fix in the prompt or the pipeline. Manually reviewing agent traces is possible but painful. Each trace has 3-6 tool calls, each with a search query, results, page content, and a reasoning step. Reviewing one product takes 10-15 minu

LLM-as-Judge: using Claude to review a Gemini agent

Related Articles

Welcome Thread - v372

ShadCN UI in 2026: the component library that changed how we build UIs

Why OpenClaw Agents Lose Their Minds Mid-Session (And What It Takes to Fix It)

Logos Privacy Builders Bootcamp

#05 Frozen Pipes

Related Articles

How-To
Welcome Thread - v372
Dev.to • 2h ago

How-To
ShadCN UI in 2026: the component library that changed how we build UIs
Dev.to • 8h ago

How-To
Why OpenClaw Agents Lose Their Minds Mid-Session (And What It Takes to Fix It)
Dev.to • 9h ago

How-To
Logos Privacy Builders Bootcamp
Reddit Programming • 1d ago

How-To
#05 Frozen Pipes
Dev.to • 1d ago