We Fine-Tuned a 3B Model to Refuse Prompt Injections

If you're running LLMs in production, prompt injection is the attack you can't fully patch. Someone wraps "ignore your instructions" inside a polite customer support query, or buries a hijack command in a document your RAG pipeline retrieves, and your model follows it. The standard defenses (regex filters, classifier ensembles, guardrail APIs) catch the attacks they've been trained on. The ones they haven't seen walk right through. We hit this wall ourselves. Together with George Politis , we've been running LLMTrace , an open-source security proxy that sits between applications and their LLM providers. It intercepts every request and runs it through an ensemble of detectors (regex patterns, a DeBERTa classifier, InjecGuard, jailbreak classifiers) at ~50ms overhead on the hot path. On known jailbreak datasets it hits 99% recall. We were reasonably confident in it until we ran 12,000+ adversarial prompts against it and watched 498 attacks sail through. Most of the damage came from the S

We Fine-Tuned a 3B Model to Refuse Prompt Injections

Related Articles

Wiim Sound review: This smart speaker is so close to fully replacing my Sonos

Updated Test Article

Own a Sony TV? Changing these 3 settings will greatly improve its picture quality

Stop Using Switch Statements: Keyed Services in .NET — A Practical Approach

Workers report watching Ray-Ban Meta-shot footage of people using the bathroom

Related Articles

News
Wiim Sound review: This smart speaker is so close to fully replacing my Sonos
ZDNet • 21m ago

News
Updated Test Article
Dev.to • 38m ago

News
Own a Sony TV? Changing these 3 settings will greatly improve its picture quality
ZDNet • 40m ago

News
Stop Using Switch Statements: Keyed Services in .NET — A Practical Approach
Medium Programming • 1h ago

News
Workers report watching Ray-Ban Meta-shot footage of people using the bathroom
Ars Technica • 2h ago