
OpenSRM: An Open Specification for Service Reliability
A team sets a 99.99% availability target for their checkout service. It's ambitious but achievable: they've done the work, invested in redundancy, and their metrics look solid. Six months later, they're missing their target every single month. The postmortem reveals the problem: their critical path flows through three upstream services. The authentication service promises 99.9%. The payment gateway promises 99.95%. The inventory service promises 99.9%. The math is straightforward: 0.999 × 0.9995 × 0.999 = 0.9975. Their theoretical ceiling is 99.75%, not 99.99%. The target for the checkout service was impossible from day one. Nobody caught this because there's no standard way to express it. SLOs are set per-service, in isolation. Dependency information lives in architecture diagrams that nobody updates, service catalogs that are perpetually stale, and the heads of engineers who've since left the company. Nobody owns the cross-service math. This is one of the things I've been building to
Continue reading on Dev.to
Opens in a new tab



