Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs

Most LLM benchmarks evaluate text. HumanEval checks if a function passes unit tests. SWE-bench measures whether a model can patch a repository. MBPP scores single-function completions. None of these work when your AI agent generates an entire website. I run a site builder agent that takes a template, a set of business requirements (brand colours, fonts, content, images, layout), and produces a deployable multi-file artifact: index.html , css/styles.css , js/main.js , and an assets/ directory. The output isn't a string. It's a folder. And a correct index.html paired with broken styles.css produces a broken site — even though each file might look reasonable in isolation. I needed an evaluation framework that could score these outputs the way a QA engineer would: structurally, visually, semantically, and at the code level. Over six days, I built one. It evaluated 467 actions across 5 models, and the results changed how I think about AI code generation. This article explains the framework.

Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs

Related Articles

What we’re looking for in Startup Battlefield 2026 and how to put your best application forward

Build Days That Actually Mean Something

I have blogged about the difference between code coverage and test coverage and why it matters to distinguish between these 2.

The origin story of Apple’s long-running relationship with FoxConn

How to Optimize Big Data Platform Costs Across the Data Lifecycle

Related Articles

How-To
What we’re looking for in Startup Battlefield 2026 and how to put your best application forward
TechCrunch • 3h ago

How-To
Build Days That Actually Mean Something
Medium Programming • 4h ago

How-To
I have blogged about the difference between code coverage and test coverage and why it matters to distinguish between these 2.
Dev.to Beginners • 9h ago

How-To
The origin story of Apple’s long-running relationship with FoxConn
The Verge • 9h ago

How-To
How to Optimize Big Data Platform Costs Across the Data Lifecycle
Hackernoon • 9h ago