
The Prompt Change That Broke Production at 2am
Why This Keeps Happening When you test traditional software, you test a deterministic function. Same input, same output. If the output changes, something broke, the test fails, you investigate. LLM agents are not deterministic functions. They’re probabilistic systems with behavioral contracts. The contract isn’t “return exactly this string.” The contract is: given this class of inputs, the output must satisfy these structural and semantic properties. The word “liability” should appear. The summary should be in bullet points. The termination clause should be mentioned. These are the invariants your downstream systems depend on — and they’re completely untested in most production LLM pipelines. The industry’s response to this has been more evals. MMLU benchmarks, human preference ratings, red-team suites. Valuable for model builders. Useless for application developers who need to know whether their specific prompts still produce outputs their specific systems can rely on. You’re not tryi
Continue reading on Dev.to DevOps
Opens in a new tab


