
Reducing the time between a production crash and a fix
You ship code, everything works — and then suddenly a crash appears in production. Even in well-instrumented systems, the investigation process often looks like this: check the monitoring alert dig through logs search the codebase try to reproduce the issue write a fix open a pull request In many teams, this process can easily take hours. After several years working on complex applications and critical data workflows, I started wondering if part of this investigation process could be automated. Could we shorten the loop between crash detection and a validated fix ? This is what led me to start building Crashloom . Crashloom is an experiment around using AI agents to investigate crashes, identify potential root causes, and propose fixes that can be validated before creating a pull request . The idea is to reduce the time between a production crash and a safe fix by assisting developers in the investigation workflow. crash → investigation → sandbox validation → pull request The project i
Continue reading on Dev.to
Opens in a new tab


