
5 Models, 467 Actions, 1 Winner — What We Learned Comparing LLMs on Real Code Generation
We tested five AI models on the same task 467 times. Each run produced a complete deployable website — not a code snippet, not a function, not a patch. A real site with HTML, CSS, JavaScript, and assets. The question: can cheaper models match Claude Sonnet for production code generation? The short answer is no. The longer answer is more interesting. The Models Five models, spanning a 15x cost range: Model Provider Input/1M Tokens Output/1M Tokens Why We Tested It Claude Sonnet 4.6 OpenRouter $3.00 $15.00 Assumed gold standard Claude Haiku 4.5 OpenRouter/CLI $1.00 $5.00 Same family, lower tier Kimi K2.5 OpenRouter $0.42 $2.20 Moonshot AI's latest DeepSeek V3.2 OpenRouter $0.26 $0.38 Budget option DeepSeek R1 OpenRouter $0.70 $2.50 Reasoning-focused These five represent distinct price tiers and architectural approaches. Sonnet and Haiku share a lineage. Kimi is multimodal. DeepSeek V3.2 optimises for cost. R1 optimises for step-by-step reasoning. The 16-Action Pipeline Each model receive
Continue reading on Dev.to Webdev
Opens in a new tab



