
What should an agent capability bench test?
We have SWE-bench for coding and GAIA for reasoning. We have BFCL for function calling and LoCoMo for long-term memory. But ask a simple question — can the agent remember its own name after context compaction? — and no benchmark has an answer. The benchmarks we have test impressive things: resolving real GitHub issues, navigating websites, reasoning across documents. What they don't test is whether an agent can do the mundane things that actually matter in daily use: remembering your preferences, recovering gracefully from a failed tool call, staying within its permissions, or knowing when to ask for help instead of guessing. This post surveys the benchmark landscape, identifies what's missing, and proposes 120+ concrete questions that a practical agent capability bench should answer. The benchmark landscape The agent evaluation ecosystem has exploded. Here's what exists today, organized by what each benchmark family actually tests. [Interactive chart — see original post] Memory Benchm
Continue reading on Dev.to
Opens in a new tab



