Your Retry Config is Wrong (And So Was Mine)

On May 12, 2022, DoorDash went down for over three hours. Not because a database failed — because a database got slow . A routine latency spike in the order storage layer triggered retries. Those retries hit downstream services, which triggered their retries. Within minutes, what started as 50ms of added latency became a full retry storm: every service in the chain hammering every service below it, each one tripling the load on the next. The shared circuit breaker — designed to protect against exactly this — tripped and took out unrelated services that happened to share the same dependency. Three hours of downtime. All because every service had the same retry config: retries: 3 . DoorDash isn't alone. In December 2024, OpenAI went down for over four hours when a telemetry deploy caused every node in their largest clusters to execute resource-intensive Kubernetes API operations simultaneously — a thundering herd that overwhelmed the control plane and locked engineers out of recovery too

Your Retry Config is Wrong (And So Was Mine)

Related Articles

The Boring Middle: Enjoy Your Life While Saving for Early Retirement

The least surprising chapter of the Manus story is what’s happening right now

Read Receipts: An iMessage Simulator

Why 60,000 Repos Adopted AGENTS.md

Intel and LG Display may have beaten Apple and Qualcomm with the best laptop battery life ever

Related Articles

News
The Boring Middle: Enjoy Your Life While Saving for Early Retirement
Medium Programming • 5d ago

News
The least surprising chapter of the Manus story is what’s happening right now
TechCrunch • 5d ago

News
Read Receipts: An iMessage Simulator
Lobsters • 5d ago

News
Why 60,000 Repos Adopted AGENTS.md
Medium Programming • 5d ago

News
Intel and LG Display may have beaten Apple and Qualcomm with the best laptop battery life ever
The Verge • 5d ago