
We Fine-Tuned a 3B Model to Refuse Prompt Injections
If you're running LLMs in production, prompt injection is the attack you can't fully patch. Someone wraps "ignore your instructions" inside a polite customer support query, or buries a hijack command in a document your RAG pipeline retrieves, and your model follows it. The standard defenses (regex filters, classifier ensembles, guardrail APIs) catch the attacks they've been trained on. The ones they haven't seen walk right through. We hit this wall ourselves. Together with George Politis , we've been running LLMTrace , an open-source security proxy that sits between applications and their LLM providers. It intercepts every request and runs it through an ensemble of detectors (regex patterns, a DeBERTa classifier, InjecGuard, jailbreak classifiers) at ~50ms overhead on the hot path. On known jailbreak datasets it hits 99% recall. We were reasonably confident in it until we ran 12,000+ adversarial prompts against it and watched 498 attacks sail through. Most of the damage came from the S
Continue reading on Dev.to
Opens in a new tab



