
I Stopped Fighting My Logging Tools and Built an AI Co-Investigator
TL;DR: I restructured my team's scattered documentation into an AI-queryable format, modelled every service's Splunk log events as TypeScript types, and built an investigation workflow around it. Complex incident investigations went from ~2 hours to ~30 minutes, and the system gets smarter with every investigation archived. Who this is for: Backend and platform engineers dealing with on-call rotations, incident investigation across multiple services, and documentation that never stays current. It's 2 AM. PagerDuty wakes you up. The alert says something is wrong with a service you haven't touched in months. You open Splunk. You open New Relic. You open your IDE. You open Slack. And then you open your team's documentation — Confluence, a wiki, whatever your team uses — and the real challenge starts. Not the tool's fault. This is a people problem. The documentation is scattered across dozens of pages written by different engineers in different eras. Half of it is stale. The service got re
Continue reading on Dev.to
Opens in a new tab



