
Inside Agent Arcade: Building a Real-Time AI Benchmarking Arena
In the rapidly evolving world of Large Language Models (LLMs), we often ask: "How smart is this model, really?" Standard benchmarks like MMLU or HumanEval are great, but they are increasingly "contaminated" by training data. Enter Agent Arcade (formerly Prison Break AI) — a project designed to test AI models in a dynamic, visual, and interactive environment. The Vision: Beyond Static Text The goal was to create an app where users could watch an AI model solve puzzles in real-time. I wanted to see the "thinking" process — the failures, the retries, and the eventual "Aha!" moments. Technical Architecture 1. The Engine-Agent Loop The core of the app is a state machine. The Agent Runner manages the lifecycle of a "Run": Generate : The Game Engine creates a fresh puzzle state. Prompt : The Engine converts that state into a natural language prompt for the AI. Inference : The Model Provider sends the prompt to either a local Ollama instance or a cloud API (AIsa.one). Validate : The Engine par
Continue reading on Dev.to JavaScript
Opens in a new tab




