
Evaluate LLM code generation with LLM-as-judge evaluators
Which AI model writes the best code for your codebase? Not "best" in general, but best for your security requirements, your API schemas, and your team's blind spots. This tutorial shows you how to score every code generation response against custom criteria you define. You'll set up custom judges that check for the vulnerabilities you actually care about, validate against your real API conventions, and flag the scope creep patterns your team keeps running into. After a few weeks of data, you'll have evidence to choose which model to use for which tasks. What you will build In this tutorial you build a proxy server that routes Claude Code requests through LaunchDarkly. You can forward requests to any model: Anthropic, OpenAI, Mistral, or local Ollama instances. Every response gets scored by custom judges you create. You will build three judges: Security : Checks for SQL injection, XSS, hardcoded secrets, and the specific vulnerabilities you care about API contract : Validates code again
Continue reading on Dev.to
Opens in a new tab


