
AI Writes Your Tests. Here's What It Systematically Misses.
AI Writes Your Tests. Here's What It Systematically Misses. We ran a tool called Optinum against 16 real bugs from SWE-bench Verified — a dataset of production OSS issues with human-verified patches. In 62.5% of cases, the AI-written tests that accompanied each fix missed the exact failure class the bug belonged to. Not random misses. The same categories, over and over. We also took one instance, synthesized a test, and proved it in Docker: the test fails on the bug commit and passes on the fix commit. No spreadsheets, no hand-waving. $ optinum benchmark --verify sympy__sympy-18199 Optinum E2E Verify — sympy__sympy-18199 Pattern: cascade-change (cascade-blindness catalog) Test code: def test_nthroot_mod_cubic_composite(): test_fails_on_bug: true test_passes_on_fix: true execution_verified: true That's the headline. Here's the full story. The Problem Is Structural, Not a Quality Issue When an AI coding tool fixes a bug, it typically generates a test alongside the code. The test covers t
Continue reading on Dev.to
Opens in a new tab
