
I built an open-source benchmark that scores AI agents, not models
Two agents built on the same GPT-4o can have wildly different reliability. But every benchmark only evaluates the model. So I built Legit — an open-source platform that scores the agent as a whole. How it works pip install getlegit legit init --agent "MyBot" --endpoint "http://localhost:8000/run" legit run v1 --local 36 tasks across 6 categories (Research, Extract, Analyze, Code, Write, Operate). Two scoring layers: Layer 1: deterministic checks, runs locally, free Layer 2: 3 AI judges (Claude, GPT-4o, Gemini), median score Agents get an Elo rating and tier (Platinum/Gold/Silver/Bronze). Free, Apache 2.0. GitHub: https://github.com/getlegitdev/legit Would love feedback on the scoring methodology!
Continue reading on Dev.to
Opens in a new tab



