Why I Wouldn't Act on SkillsBench

I came across SkillsBench ( paper , Feb 2026) while watching Theo , and was genuinely excited. It asks two critical questions: do curated procedural documents — "Skills" — actually help coding agents, and which coding agent utilizes them best? The headline number — +16.2pp from curated Skills — felt immediately actionable. Then I started pulling at the methodology, and things unraveled. Setup SkillsBench is ambitious in scope — 84 tasks, 11 domains, 7 coding agents, 7,308 trajectories. It evaluates tasks under three conditions: no Skills, curated (expert-written) Skills, and self-generated Skills. Each task ships with a fixed Skill package — markdown instructions, sometimes with scripts or templates — provided to the agent alongside the task. The leaderboard In every benchmark the central outcome is the leaderboard. Here it is Finding 2 (§4.1.1), which crowns Gemini CLI + Flash for best raw performance (48.7%) and Claude Code + Opus 4.5 for largest uplift (+23.3pp). This is a legitimat

Why I Wouldn't Act on SkillsBench

Related Articles

Learn Something Old Every Day, Part XVIII: How Does FPU Detection Work?

“Learn to Code” Is Dead… Learn to Think Instead

How One File Makes Claude Code Actually Follow Your Instructions

LeetCode Solution: 121. Best Time to Buy and Sell Stock

The Feature Took 2 Hours to Build — and 2 Weeks to Fix

Related Articles

How-To
Learn Something Old Every Day, Part XVIII: How Does FPU Detection Work?
Lobsters • 3d ago

How-To
“Learn to Code” Is Dead… Learn to Think Instead
Medium Programming • 3d ago

How-To
How One File Makes Claude Code Actually Follow Your Instructions
Medium Programming • 3d ago

How-To
LeetCode Solution: 121. Best Time to Buy and Sell Stock
Dev.to Tutorial • 3d ago

How-To
The Feature Took 2 Hours to Build — and 2 Weeks to Fix
Medium Programming • 3d ago