
FaultRay: Why We Formalized Cascade Failure Propagation as a Labeled Transition System
The gap that motivated this project Production fault injection tools — Gremlin, Steadybit, AWS FIS — are powerful, and the chaos engineering discipline they represent has genuinely matured over the past decade. But every tool in that class shares a structural constraint: it operates on running systems. That constraint is fine for many organizations. It is not fine for regulated industries operating under mandates like the EU Digital Operational Resilience Act (DORA), where touching production with fault injection commands introduces risk that regulators may not accept. And it is not fine for the more fundamental question that fault injection cannot answer: what is the highest availability your architecture is mathematically capable of reaching, given its dependency structure and external SLA commitments? Classical reliability methods — Fault Tree Analysis and Reliability Block Diagrams — do answer availability ceiling questions analytically. But they operate on static trees under a com
Continue reading on Dev.to
Opens in a new tab

