
Waxell vs. Braintrust: When Evaluation Isn't Enough
Consider a team running a tight eval suite. Every Friday, they run 500 real production transcripts through Braintrust scorers, iterate on prompts with Loop, and ship only when quality hits above 8.5/10. Their evals are genuinely good — not the performative kind. Then one of their agents starts routing customer support tickets through an external summarization API. PII goes with them. The eval score? Still 8.7/10. The summarization is excellent. The governance isn't. The problem wasn't Braintrust. Braintrust was doing exactly what it's designed to do: measure and optimize quality. The problem was that "quality" and "safe to run in production" are different questions, and the team was using one tool to answer both. Braintrust is a developer-centric evaluation and experiment platform: score outputs, tune prompts, track quality regressions, and use AI-powered optimization to improve agent behavior before you ship. Waxell is a runtime governance control plane: enforce policies at execution
Continue reading on Dev.to
Opens in a new tab



