
How We Simulate 2,000+ Infrastructure Failures Without Touching Production
It is 2am. Your pager fires. A terraform apply that "just changed a timeout" has taken down the payment service, the order queue, and half the API layer. The plan output looked clean. The PR had two approvals. And yet here you are, staring at a cascade failure that nobody predicted. This is the scenario that led me to build FaultRay. The problem with breaking things to test things The standard chaos engineering playbook, pioneered by Netflix's Chaos Monkey in 2011 and continued by tools like Gremlin, Steadybit, and AWS FIS, follows a simple premise: inject real faults into real systems, observe what breaks, fix it. This works, but it has structural limitations: It requires a production-like environment. Staging is always out of sync. The failure you test in staging may not match what happens in prod. It tests scenarios you think of. You write the experiments. You choose what to break. The failures you did not imagine are the ones that page you. It cannot answer the ceiling question. No
Continue reading on Dev.to
Opens in a new tab


