
Why I stopped trusting AI agents and built a security enforcer.
Every tutorial on building AI agents includes some version of this line: "Add a system prompt telling the model not to access sensitive data." I followed that advice for a while. Then I started thinking about what it actually means. You're asking a probabilistic text predictor to enforce a security boundary. The same model that confidently hallucinates API documentation is now your permission system. The same model that gets prompt-injected through a malicious PDF is now your secret redaction layer. That's not security. That's optimism. What actually goes wrong AI agents fail in predictable, documented ways: Tool misuse. The agent calls a tool it shouldn't — because the model inferred it was appropriate, because it was hallucinating, or because an attacker crafted an input that made it seem right. Your system prompt says "don't delete files." The model tries to delete a file anyway. What stops it? Prompt injection through tool outputs. The agent browses a webpage, calls an API, reads a
Continue reading on Dev.to
Opens in a new tab




