What should an agent capability bench test?

We have SWE-bench for coding and GAIA for reasoning. We have BFCL for function calling and LoCoMo for long-term memory. But ask a simple question — can the agent remember its own name after context compaction? — and no benchmark has an answer. The benchmarks we have test impressive things: resolving real GitHub issues, navigating websites, reasoning across documents. What they don't test is whether an agent can do the mundane things that actually matter in daily use: remembering your preferences, recovering gracefully from a failed tool call, staying within its permissions, or knowing when to ask for help instead of guessing. This post surveys the benchmark landscape, identifies what's missing, and proposes 120+ concrete questions that a practical agent capability bench should answer. The benchmark landscape The agent evaluation ecosystem has exploded. Here's what exists today, organized by what each benchmark family actually tests. [Interactive chart — see original post] Memory Benchm

What should an agent capability bench test?

Related Articles

Forecast Formats and Products

Unacademy to be acquired by upGrad in share-swap deal as India’s edtech sector consolidates

RHAPSODY OF REALITIES - 15TH MARCH 2026 "So, walking in truth is much more than just living…

Sotomayor’s Wabi Sabi is the funnest record of 2026

Speaking into Existence 2

Related Articles

News
Forecast Formats and Products
Medium Programming • 4h ago

News
Unacademy to be acquired by upGrad in share-swap deal as India’s edtech sector consolidates
TechCrunch • 4h ago

News
RHAPSODY OF REALITIES - 15TH MARCH 2026 "So, walking in truth is much more than just living…
Medium Programming • 4h ago

News
Sotomayor’s Wabi Sabi is the funnest record of 2026
The Verge • 5h ago

News
Speaking into Existence 2
Medium Programming • 5h ago