Two Landmines That Nearly Killed Our 20-Agent Cluster — And How We Fixed Them

Lessons in resilience from running a 9-node, 20+ agent OpenClaw cluster in production. Introduction I run 20+ AI agents around the clock on 9 GMK Mini PCs and a Mac Mini at home. They handle everything from business automation to learning to family support. The use cases vary, but they share one common concern: they can't go down . Today alone I hit two failures and had to implement fixes. Both were "obvious in hindsight" problems that could have been prevented. Landmine 1: Anthropic API Overload — Every Agent Goes Silent What Happened The Claude Opus API became overloaded. OpenClaw's Gateway retries with backoff, but after consecutive failures, sessions get interrupted. Because no fallback model was configured, every agent on all 9 nodes went unresponsive simultaneously . A textbook single point of failure (SPOF). When you depend on an API provider, this risk is unavoidable. Fix: Cluster-Wide Codex Fallback Deployment OpenClaw supports model.fallbacks to specify fallback models. We ch

Two Landmines That Nearly Killed Our 20-Agent Cluster — And How We Fixed Them

Related Articles

How to Shape the Engineering Culture in Software Companies

“Learn to Code” Is Dead — Here’s What You Should Do Instead

The pattern that spreads

ABM Mahi: A CSE Student from Natore Building His Journey in Tech

Google Preferred Source CTA Plugin for WordPress

Related Articles

How-To
How to Shape the Engineering Culture in Software Companies
InfoQ • 5h ago

How-To
“Learn to Code” Is Dead — Here’s What You Should Do Instead
Medium Programming • 5h ago

How-To
The pattern that spreads
Dev.to • 5h ago

How-To
ABM Mahi: A CSE Student from Natore Building His Journey in Tech
Medium Programming • 8h ago

How-To
Google Preferred Source CTA Plugin for WordPress
Dev.to • 8h ago