
28 Real Tasks Reveal What AI Leaderboards Miss
Originally published on MakerPulse . 4.61 versus 4.55. That's the gap between the top two models in our first AgentPulse benchmark run: GPT-5.2 and Gemini 3.1 Pro, separated by six hundredths of a point on task quality, scored by three independent AI evaluators across 28 real-world prompts. One costs $0.74 to run the full suite. The other costs $1.61. A third model, Claude Opus 4.6, sits at 4.30 but finishes in about two-thirds the time, and at less than half the latency of the most expensive option. And a speed-tier model from xAI that nobody is talking about costs two cents for the entire run while scoring within striking distance of models costing 30-80x more. These aren't the numbers you'll find on any company's marketing page. They're from AgentPulse, a benchmark we built specifically because no existing evaluation answers the question practitioners actually ask: which model should I use for the work I'm doing right now? Why We Built This Every major AI lab publishes benchmark sco
Continue reading on Dev.to
Opens in a new tab



