Autoscaling Is Not Elasticity
The first autoscaling incident I handled personally left me staring at CloudWatch graphs at 3 AM, watching our infrastructure commit suicide by optimization. The Auto Scaling Group was doing exactly what we'd configured it to do — launching EC2 instances to meet target capacity — except AWS's control plane was choking on an internal partition failure, and every instance we launched hung in pending state, timed out health checks, got terminated, triggered a replacement launch. Over and over. We'd built a perfect feedback loop of failure, and the irony was we'd followed the documentation to the letter. The fix took four minutes once we understood what was happening: manually pin the ASG at current capacity, stop all scaling activity, let the control plane recover on its own schedule. But we didn't have a procedure for that. No Terraform config sitting in version control with min_size = desired = max_size = 12 . No runbook that said "when AWS is on fire, turn off your autoscaler first." W
Continue reading on DZone
Opens in a new tab


