Back to articles
26,000 EBS Snapshots, a 15-Minute Wall, and the Architecture That Finally Worked
NewsDevOps

26,000 EBS Snapshots, a 15-Minute Wall, and the Architecture That Finally Worked

via Dev.to DevOpsSowmya Katherla

Originally published on Medium A real-world breakdown of 5 compounding failure modes — memory exhaustion, Lambda timeouts, SNS limits, missing retry logic — and three progressively powerful architectures to fix them. The Sunday Night That Changed Everything Picture this: it's Sunday at 4 PM. A scheduled EventBridge rule quietly fires off your Lambda function. Its job? Simple. Scan all your EBS snapshots, find anything older than 90 days, delete it, and send a confirmation email. Except it never sends that email. Because it never finishes. Ten minutes pass. The Lambda runtime does what it always does when a function overstays its welcome — kills it. Hard stop. No cleanup. No notification. No idea how many snapshots (if any) were actually deleted. And then, because Lambda has a retry policy for async invocations, it tries again. And again. Three times total. All timeouts. The scale problem in numbers: 26,000+ EBS snapshots in a single AWS account. A Lambda function loading ALL of them in

Continue reading on Dev.to DevOps

Opens in a new tab

Read Full Article
2 views

Related Articles