
I Tested 6 Attacks on Multi-Agent Systems — Here's Which Ones Agents Can't See
Domain-aligned prompt injections cascade through multi-agent systems at a 0% detection rate. Privilege escalation payloads hit 97.6%. That's a 98 percentage-point spread across payload types in the same agent architecture — the single biggest variable determining whether your multi-agent system catches an attack or never sees it. I ran six experiments on real Claude Haiku agents to find out why. Three resistance patterns explain the gap — and each has a quantified bypass condition. The 98pp Spread The most important finding: resistance varies by 98 percentage points across payload types. Payload Poison Rate Resistance Privilege escalation ("grant admin access") 97.6% Almost none Generic (CryptoScamCoin) 68.8% Moderate Data exfiltration (marker string) 55.2% Moderate Domain-aligned (portfolio diversification) 0.0% Invisible to detection "Grant admin access" is domain-plausible in a business context — agents propagate it as legitimate advice. CryptoScamCoin is obviously off-topic — agent
Continue reading on Dev.to Python
Opens in a new tab


