
How to Audit Your Monitoring Stack (Before the Next Incident Does It for You)
If you've been in this industry long enough, you've sat through a post-mortem where someone says "we should have had monitoring for that". Maybe it was an endpoint that went down and nobody knew until a customer reported about it. Maybe it was an escalation policy that pointed to someone who left the company six months ago. The annoying part is that these aren't hard problems. They're just invisible ones. Nobody wakes up and thinks "I should check if our PagerDuty escalation policies all have valid responders". You set things up, they work for a while, and then stuff drifts. So here's what you should actually check when you audit a monitoring setup. Not the theoretical "you should have observability" stuff - the specific, concrete things that catch on fire when you don't look at them. Escalation policies with nobody home This one bites more teams than you'd think. Someone leaves, their PagerDuty schedule doesn't get updated, and now there's a gap every third Tuesday where alerts go to
Continue reading on Dev.to
Opens in a new tab




