Back to articles
I Turned Karpathy's Autoresearch Into a Skill That Optimizes Anything — Here Is the Architecture

I Turned Karpathy's Autoresearch Into a Skill That Optimizes Anything — Here Is the Architecture

via Dev.toReza Rezvani

Karpathy released autoresearch last week. 31,000 stars. 100 ML experiments overnight on one GPU. Everyone wrote about the ML training loop. I saw something different: a pattern. One file. One metric. One loop. Modify → Evaluate → Keep or Discard → Repeat. That pattern has nothing to do with machine learning. So I built a skill that applies it to: → API response time (benchmark_speed evaluator) → Bundle size (benchmark_size evaluator) → Headline click-through (LLM judge evaluator) → System prompt quality (LLM judge evaluator) → Test pass rate, build speed, memory usage Works across 11 tools: Claude Code, Codex, Gemini CLI, Cursor, Windsurf, OpenClaw, and more. The Full Medium Article The hardest problem: evaluating things that are not numbers. Headlines do not come with a val_bpb metric. Solution: LLM judges using the agent's own subscription. Critical constraint: the agent cannot modify its own evaluator. (The alignment problem in miniature.) What I have not done yet: run 100 experimen

Continue reading on Dev.to

Opens in a new tab

Read Full Article
0 views

Related Articles