
Prompt Injection Still Beats Production LLMs
Three things we learned running a two-stage SFT+GRPO safety fine-tuning pipeline on Ministral-3B (single H200, 7.5 hours, 8,344 prompts from 19 security datasets): Train only what you’re adding. SFT on malicious examples only. Don’t retrain benign behavior the base model already has. Result: 100% benign helpfulness preserved, zero over-refusal. Watch frac_reward_zero_std, not reward. GRPO applied directly to the base model hit 0.955 reward but 95% of training steps had zero gradient signal. The model had collapsed. This metric catches entropy collapse before reward curves do. Your safety eval is measuring the wrong thing. All three models scored within 3.3% of each other on keyword-based refusal detection. But the GRPO model learned to cite legal frameworks, redirect to crisis resources, and educate. Behaviors the keyword detector counts as “not refusing.” Verdict: Two-stage SFT+GRPO works on a single GPU in an afternoon. But your eval methodology will be the bottleneck, not the traini
Continue reading on Hackernoon
Opens in a new tab




