Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation

Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation TL;DR: We built agent-eval-lite , a zero-dependency Python framework for LLM-as-judge evaluation. It achieves κ=0.68 on FaithBench (faithfulness) and PCAcc=91-100% on JudgeBench (pairwise comparison) — competitive with heavy frameworks that require 40+ dependencies. The Problem You've built an AI agent. It answers 10,000 questions a day. How do you know it's not hallucinating? Manual review doesn't scale. LLM-as-judge — using one LLM to evaluate another — is the practical answer. But existing frameworks (DeepEval, Ragas) drag in torch, transformers, langchain, and dozens of transitive dependencies. agent-eval-lite does the same job with zero external dependencies. Just urllib from Python's stdlib. What's New in v0.5 1. Multi-Model Jury Voting Different models have different biases. GPT-5.2 is lenient (high false positive rate), while Grok is too strict (high false negative rate). Claude Sonnet 4.6 is the most

Zero-Dependency LLM Judge: Benchmarking Faithfulness and Pairwise Evaluation

Related Articles

Use Calculation Groups to Eliminate Redundant Measures in Power BI

8 Wireshark Patterns That Instantly Signal Something Is Wrong

Let the commits tell the story

Good CTE, bad CTE

Weekly Digest #264

Related Articles

News
Use Calculation Groups to Eliminate Redundant Measures in Power BI
Medium Programming • 16h ago

News
8 Wireshark Patterns That Instantly Signal Something Is Wrong
Medium Programming • 16h ago

News
Let the commits tell the story
Lobsters • 16h ago

News
Good CTE, bad CTE
Lobsters • 16h ago

News
Weekly Digest #264
Medium Programming • 16h ago