
Your Retry Config is Wrong (And So Was Mine)
On May 12, 2022, DoorDash went down for over three hours. Not because a database failed — because a database got slow . A routine latency spike in the order storage layer triggered retries. Those retries hit downstream services, which triggered their retries. Within minutes, what started as 50ms of added latency became a full retry storm: every service in the chain hammering every service below it, each one tripling the load on the next. The shared circuit breaker — designed to protect against exactly this — tripped and took out unrelated services that happened to share the same dependency. Three hours of downtime. All because every service had the same retry config: retries: 3 . DoorDash isn't alone. In December 2024, OpenAI went down for over four hours when a telemetry deploy caused every node in their largest clusters to execute resource-intensive Kubernetes API operations simultaneously — a thundering herd that overwhelmed the control plane and locked engineers out of recovery too
Continue reading on Dev.to DevOps
Opens in a new tab



