
Two Landmines That Nearly Killed Our 20-Agent Cluster — And How We Fixed Them
Lessons in resilience from running a 9-node, 20+ agent OpenClaw cluster in production. Introduction I run 20+ AI agents around the clock on 9 GMK Mini PCs and a Mac Mini at home. They handle everything from business automation to learning to family support. The use cases vary, but they share one common concern: they can't go down . Today alone I hit two failures and had to implement fixes. Both were "obvious in hindsight" problems that could have been prevented. Landmine 1: Anthropic API Overload — Every Agent Goes Silent What Happened The Claude Opus API became overloaded. OpenClaw's Gateway retries with backoff, but after consecutive failures, sessions get interrupted. Because no fallback model was configured, every agent on all 9 nodes went unresponsive simultaneously . A textbook single point of failure (SPOF). When you depend on an API provider, this risk is unavoidable. Fix: Cluster-Wide Codex Fallback Deployment OpenClaw supports model.fallbacks to specify fallback models. We ch
Continue reading on Dev.to DevOps
Opens in a new tab




