
Why I built a neutral LLM eval framework after Promptfoo joined OpenAI
A few weeks ago, Promptfoo — one of the most popular open-source LLM evaluation frameworks — joined OpenAI. I don't think that's inherently bad. But it created a real problem for the ecosystem: the tools we use to evaluate AI systems are increasingly owned by the same companies that build those AI systems. That's a conflict of interest that matters. So I built Rubric — an independent, MIT-licensed LLM and AI agent evaluation framework. No corporate parent. Open source forever. Here's what I learned building it, and why I think agent trace evaluation is the missing piece in most teams' LLM testing story. The gap: everyone evaluates output, nobody evaluates the journey Most LLM eval frameworks work like this: input → model → output → did the output match expected? That's fine for simple Q&A. But if you're building an AI agent — something that calls tools, makes decisions, and takes multi-step actions — the final output is only part of the story. What if the agent got the right answer but
Continue reading on Dev.to Python
Opens in a new tab




