
I ran incident response on my own homelab. Here's the postmortem.
I run a 3-node Proxmox cluster at home with 11 LXC containers. Last week one of them turned into an incident. Not a dramatic one. No data loss. No outage that affected anyone else. But it hit the same failure modes I see documented in enterprise postmortems — and handling it the same way taught me more than any homelab YouTube video has. Here's what happened and what I changed. The incident 00:47 — My homelab control panel stops responding. The web UI that ties together monitoring, service status, and agent health is down. 00:47–01:09 — PM2 restarts the service. Then restarts it again. 32 times total, with exponential backoff, over about 22 minutes. 01:09 — Prometheus alert fires. Wazuh catches the anomaly in PM2 process metrics. I get paged. 01:11 — I SSH in. pm2 logs sjvik-control-panel shows the immediate cause: Cannot find module tsx . The package is gone from node_modules. 01:13 — npm install && pm2 restart sjvik-control-panel . Service is back. Total downtime from first failure t
Continue reading on Dev.to
Opens in a new tab


