26,000 EBS Snapshots, a 15-Minute Wall, and the Architecture That Finally Worked

Originally published on Medium A real-world breakdown of 5 compounding failure modes — memory exhaustion, Lambda timeouts, SNS limits, missing retry logic — and three progressively powerful architectures to fix them. The Sunday Night That Changed Everything Picture this: it's Sunday at 4 PM. A scheduled EventBridge rule quietly fires off your Lambda function. Its job? Simple. Scan all your EBS snapshots, find anything older than 90 days, delete it, and send a confirmation email. Except it never sends that email. Because it never finishes. Ten minutes pass. The Lambda runtime does what it always does when a function overstays its welcome — kills it. Hard stop. No cleanup. No notification. No idea how many snapshots (if any) were actually deleted. And then, because Lambda has a retry policy for async invocations, it tries again. And again. Three times total. All timeouts. The scale problem in numbers: 26,000+ EBS snapshots in a single AWS account. A Lambda function loading ALL of them in

26,000 EBS Snapshots, a 15-Minute Wall, and the Architecture That Finally Worked

Related Articles

I Worked At Two Dream Companies. I Was Miserable At Both. That Took A While To Admit.

Try not to get scammed while looking for work

651 Commits, Zero Lines of Code — Why “Done” Is a Myth

Two Developers. One Bug.

The Strategic Review: When Neighbors Can’t “Tunnel” Away

Related Articles

News
I Worked At Two Dream Companies. I Was Miserable At Both. That Took A While To Admit.
Medium Programming • 3h ago

News
Try not to get scammed while looking for work
Lobsters • 3h ago

News
651 Commits, Zero Lines of Code — Why “Done” Is a Myth
Medium Programming • 3h ago

News
Two Developers. One Bug.
Medium Programming • 3h ago

News
The Strategic Review: When Neighbors Can’t “Tunnel” Away
Medium Programming • 4h ago