I wrapped Gemini Flash with memory and a swarm. It went from 9/12 to 12/12 on a bug benchmark, and the 3 it failed were brutal

I've been building SHARD for a few months: an agentic scaffold that wraps LLMs with persistent memory, multi-agent swarms, and a nightly self-study loop. Last night I ran a full benchmark — 12 hard Python bug-fix tasks, naked Gemini Flash vs SHARD wrapping the same model. Tasks fully solved: naked 9/12 → SHARD 12/12. The 3 it couldn't close alone are worth examining. The 3 tasks naked LLM failed T1 — html_trap (naked: 38.9%, SHARD: 100%) HTML rendering pipeline with XSS injection via unescaped f-strings. The model kept fixing the obvious paths and missing the edge cases. SHARD's Security reviewer flagged the exact injection vector on attempt 2. T10 — template_parser (naked: 20%, SHARD: 100%) Real bug from pylint#7993 — regex .+? vs \w+? inside a template parser. Naked model passed 2/10 tests and confidently produced wrong output. SHARD passed all 10 on attempt 1 because the GraphRAG had causal context from a prior study session on regex semantics. T2 — ghost_bug (naked: 93.8%, SHARD: 1

I wrapped Gemini Flash with memory and a swarm. It went from 9/12 to 12/12 on a bug benchmark, and the 3 it failed were brutal

Related Articles

Lululemon bets Epoch Biodesign can eat its shorts, literally

Crusoe makes big battery buys for its data centers

What Your Engineering Manager Actually Does All Day

The Lego Game Boy makes for a great gift, and it’s $10 off today

How To Apply Global Filters With EF Core Query Filters

Related Articles

How-To
Lululemon bets Epoch Biodesign can eat its shorts, literally
TechCrunch • 1h ago

How-To
Crusoe makes big battery buys for its data centers
TechCrunch • 5h ago

How-To
What Your Engineering Manager Actually Does All Day
Medium Programming • 6h ago

How-To
The Lego Game Boy makes for a great gift, and it’s $10 off today
The Verge • 7h ago

How-To
How To Apply Global Filters With EF Core Query Filters
Medium Programming • 7h ago