
The Prompt Injection Problem: A Guide to Defense-in-Depth for AI Agents
TL;DR Prompt injection is an architecture problem, not a benchmarking problem. Anthropic's Sonnet 4.6 system card shows 8% one-shot attack success rate in computer use with all safeguards on, and 50% with unbounded attempts. In coding environments, the same model hits 0%. The difference is the environment, not the model. Training won't fix prompt injection. Instructions and data share the same context window. SQL injection for the LLM era requires an architectural fix, not a behavioral one. The "lethal trifecta" is the threat model. When your agent has tools, processes untrusted input, and holds sensitive access, all three at once, prompt injection becomes catastrophic. Almost every use case people want hits all three. Build the kill chain around the model. A five-layer defense (permission boundaries, action gating, input sanitization, output monitoring, blast radius containment) turns the question from "will injection happen" to "how bad when it does." Defense-in-depth constrains the
Continue reading on Dev.to
Opens in a new tab




