
Did that actually help? Evaluating AI coding assistants with hard numbers
You are building a Skill, an MCP server, or a custom prompt strategy that is supposed to make an AI coding assistant better at a specific job. You make a change. The next session feels smoother. The agent seems to reach for the right context at the right time. But how do you know? That question came up in two parallel problems. I was building and iterating on MCP servers to support a coding agent. New tool, new tool definition, new prompting strategy. Each change felt like an improvement. Sessions seemed smoother. But I had no numbers. I had vibes. A colleague was working on the same problem from the other side: he was building and refining AI coding Skills -- structured prompt packs that teach the agent how to work in a specific context. Same issue. A lot of iteration, a lot of gut feel, no hard signal on whether the changes were actually moving the needle. We joined forces and built something to fix this. The result is Pitlane -- named after the place in motorsport where engineers sw
Continue reading on Dev.to
Opens in a new tab


