
The production disasters we've watched happen, and the habit that would have prevented all of them
The Tuesday the database got smaller A client of ours ran a loyalty program with about 120,000 members. On a Tuesday afternoon their agency pushed a "cleanup migration" to production. The intent was to merge duplicate accounts where the same email had signed up twice with different casing. The script ran, the dashboard was snappier than usual, and someone in the client's marketing team noticed by Wednesday morning that roughly 40,000 members had vanished from the list. The migration had matched on normalized email, yes, but it had also silently deleted the "loser" row instead of merging the points balances first. There was no soft delete. There was no dry run log. The backup was 19 hours old, which meant a full day of new signups and point redemptions was gone by the time anyone restored it. The agency's post-mortem used the word "oversight" four times. The real word is "untested". Nobody had run the script against a production-shaped dataset before Tuesday. The staging DB had 300 rows
Continue reading on Dev.to
Opens in a new tab
