Back to articles
LLM-as-Judge: using Claude to review a Gemini agent

LLM-as-Judge: using Claude to review a Gemini agent

via Dev.toThomasP

In the previous article , I compared 7 models from 4 providers on the same agentic task. Gemini 3 Flash won on the balance of accuracy, cost, and latency. But winning the benchmark doesn't mean the agent is good. 74.5% accuracy means 1 in 4 products gets the wrong answer. And some of those wrong answers come with high confidence. The benchmark tells you what fails. It doesn't tell you why . For that, I needed something that could look at the agent's reasoning step by step and tell me where the logic broke down. So I built a judge. The idea The production agent runs on Gemini 3 Flash. It's fast and cheap, which is why it's in production. But it makes mistakes. Some of those mistakes share patterns that, if I could identify them, would tell me exactly what to fix in the prompt or the pipeline. Manually reviewing agent traces is possible but painful. Each trace has 3-6 tool calls, each with a search query, results, page content, and a reasoning step. Reviewing one product takes 10-15 minu

Continue reading on Dev.to

Opens in a new tab

Read Full Article
2 views

Related Articles