
Why I Wouldn't Act on SkillsBench
I came across SkillsBench ( paper , Feb 2026) while watching Theo , and was genuinely excited. It asks two critical questions: do curated procedural documents — "Skills" — actually help coding agents, and which coding agent utilizes them best? The headline number — +16.2pp from curated Skills — felt immediately actionable. Then I started pulling at the methodology, and things unraveled. Setup SkillsBench is ambitious in scope — 84 tasks, 11 domains, 7 coding agents, 7,308 trajectories. It evaluates tasks under three conditions: no Skills, curated (expert-written) Skills, and self-generated Skills. Each task ships with a fixed Skill package — markdown instructions, sometimes with scripts or templates — provided to the agent alongside the task. The leaderboard In every benchmark the central outcome is the leaderboard. Here it is Finding 2 (§4.1.1), which crowns Gemini CLI + Flash for best raw performance (48.7%) and Claude Code + Opus 4.5 for largest uplift (+23.3pp). This is a legitimat
Continue reading on Dev.to
Opens in a new tab

