The cron job failure mode nobody talks about

A few months ago, a nightly ETL job at a previous job nearly cost us a major client. Not because it failed. Because it took four hours instead of forty minutes — and nobody noticed for six days. The job ran. It completed. It exited zero. Every monitoring dashboard showed green. Meanwhile, the downstream data pipeline was ingesting half-processed records, and reports were silently wrong. By the time a client flagged it, we had six days of corrupted reporting to unpick. This is the failure mode nobody talks about: the job that doesn't die, it just... drags. Why your existing monitoring misses it - If you're using Healthchecks.io, Better Uptime, or a similar dead man's switch tool, here's how it works: your cron job pings a URL at the end of each run. If the ping doesn't arrive within a grace window, you get an alert. That's genuinely useful. It catches jobs that crash, hang indefinitely, or never start. But what it doesn't catch is a job that completes in 240 minutes when it should take

The cron job failure mode nobody talks about

Related Articles

The Decision Pattern That Prevents Product–Engineering Conflict

Autopilot

The Most Important Skill in Software Engineering Isn’t Coding

New interstellar hunting with Vera Rubin alerts

R: A Language for Data Analysis and Graphics (1996)

Related Articles

News
The Decision Pattern That Prevents Product–Engineering Conflict
Medium Programming • 3h ago

News
Autopilot
Medium Programming • 3h ago

News
The Most Important Skill in Software Engineering Isn’t Coding
Medium Programming • 3h ago

News
New interstellar hunting with Vera Rubin alerts
Medium Programming • 4h ago

News
R: A Language for Data Analysis and Graphics (1996)
Lobsters • 4h ago