
Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs
Most LLM benchmarks evaluate text. HumanEval checks if a function passes unit tests. SWE-bench measures whether a model can patch a repository. MBPP scores single-function completions. None of these work when your AI agent generates an entire website. I run a site builder agent that takes a template, a set of business requirements (brand colours, fonts, content, images, layout), and produces a deployable multi-file artifact: index.html , css/styles.css , js/main.js , and an assets/ directory. The output isn't a string. It's a folder. And a correct index.html paired with broken styles.css produces a broken site — even though each file might look reasonable in isolation. I needed an evaluation framework that could score these outputs the way a QA engineer would: structurally, visually, semantically, and at the code level. Over six days, I built one. It evaluated 467 actions across 5 models, and the results changed how I think about AI code generation. This article explains the framework.
Continue reading on Dev.to Webdev
Opens in a new tab



