Back to articles
We audit our code regularly, why don't we audit our monitoring?
How-ToDevOps

We audit our code regularly, why don't we audit our monitoring?

via Dev.topaulg7516

I've been thinking about this for a while. We have automated checks for code quality, security, and test coverage but for monitoring we just hope it's fine. Last year I was on call when a critical service went down. Took us a while to even get paged because the alert that was supposed to catch it had been disabled three months earlier during a maintenance window. Nobody reenabled in and no one noticed. After the postmortem, we spent time going through PagerDuty and Datadog configs and we found escalation policies pointing to people who'd left the company, zero alert rules (was collecting metrics from services but never alerting on anything), lack of notification channels that actually referenced by any policy and dashboard panels stuck in permanent "no data" state that everyone had just learned to ignore. So we had dashboards, we had monitoring, we had alerts but we just had no way to know what we were missing. So I decided to build something. I started working on a tool that connects

Continue reading on Dev.to

Opens in a new tab

Read Full Article
4 views

Related Articles