Back to articles
I wrapped Gemini Flash with memory and a swarm. It went from 9/12 to 12/12 on a bug benchmark, and the 3 it failed were brutal

I wrapped Gemini Flash with memory and a swarm. It went from 9/12 to 12/12 on a bug benchmark, and the 3 it failed were brutal

via Dev.to PythonMirage995

I've been building SHARD for a few months: an agentic scaffold that wraps LLMs with persistent memory, multi-agent swarms, and a nightly self-study loop. Last night I ran a full benchmark — 12 hard Python bug-fix tasks, naked Gemini Flash vs SHARD wrapping the same model. Tasks fully solved: naked 9/12 → SHARD 12/12. The 3 it couldn't close alone are worth examining. The 3 tasks naked LLM failed T1 — html_trap (naked: 38.9%, SHARD: 100%) HTML rendering pipeline with XSS injection via unescaped f-strings. The model kept fixing the obvious paths and missing the edge cases. SHARD's Security reviewer flagged the exact injection vector on attempt 2. T10 — template_parser (naked: 20%, SHARD: 100%) Real bug from pylint#7993 — regex .+? vs \w+? inside a template parser. Naked model passed 2/10 tests and confidently produced wrong output. SHARD passed all 10 on attempt 1 because the GraphRAG had causal context from a prior study session on regex semantics. T2 — ghost_bug (naked: 93.8%, SHARD: 1

Continue reading on Dev.to Python

Opens in a new tab

Read Full Article
6 views

Related Articles