
Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation
Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation TL;DR: We built agent-eval-lite , a zero-dependency Python framework for LLM-as-judge evaluation. It achieves κ=0.68 on FaithBench (faithfulness) and PCAcc=91-100% on JudgeBench (pairwise comparison) — competitive with heavy frameworks that require 40+ dependencies. The Problem You've built an AI agent. It answers 10,000 questions a day. How do you know it's not hallucinating? Manual review doesn't scale. LLM-as-judge — using one LLM to evaluate another — is the practical answer. But existing frameworks (DeepEval, Ragas) drag in torch, transformers, langchain, and dozens of transitive dependencies. agent-eval-lite does the same job with zero external dependencies. Just urllib from Python's stdlib. What's New in v0.5 1. Multi-Model Jury Voting Different models have different biases. GPT-5.2 is lenient (high false positive rate), while Grok is too strict (high false negative rate). Claude Sonnet 4.6 is the most
Continue reading on Dev.to Python
Opens in a new tab



