Prompt Injection Still Beats Production LLMs

Three things we learned running a two-stage SFT+GRPO safety fine-tuning pipeline on Ministral-3B (single H200, 7.5 hours, 8,344 prompts from 19 security datasets): Train only what you’re adding. SFT on malicious examples only. Don’t retrain benign behavior the base model already has. Result: 100% benign helpfulness preserved, zero over-refusal. Watch frac_reward_zero_std, not reward. GRPO applied directly to the base model hit 0.955 reward but 95% of training steps had zero gradient signal. The model had collapsed. This metric catches entropy collapse before reward curves do. Your safety eval is measuring the wrong thing. All three models scored within 3.3% of each other on keyword-based refusal detection. But the GRPO model learned to cite legal frameworks, redirect to crisis resources, and educate. Behaviors the keyword detector counts as “not refusing.” Verdict: Two-stage SFT+GRPO works on a single GPU in an afternoon. But your eval methodology will be the bottleneck, not the traini

Prompt Injection Still Beats Production LLMs

Related Articles

These 2 Apps Help Me Make Sense of My 100K Screenshots

How to Build a Migration-Proof Icon Picker in AEM

You Don’t Need to Know How to Code Anymore — Here’s the Proof

The Smart Bird Feeders Everyone’s Talking About (and Actually Buying) (2026)

How I used CloneZilla to fully back up my PC in case disaster strikes (and it's free)

Related Articles

How-To
These 2 Apps Help Me Make Sense of My 100K Screenshots
Wired • 9h ago

How-To
How to Build a Migration-Proof Icon Picker in AEM
Medium Programming • 9h ago

How-To
You Don’t Need to Know How to Code Anymore — Here’s the Proof
Medium Programming • 9h ago

How-To
The Smart Bird Feeders Everyone’s Talking About (and Actually Buying) (2026)
Wired • 10h ago

How-To
How I used CloneZilla to fully back up my PC in case disaster strikes (and it's free)
ZDNet • 11h ago