This has happened a couple of times since we moved our cluster project from Google to AWS.
We have an EFS volume that's mounted on a load-balanced cluster in a Beanstalk project.
I will be in the middle of setting something up, either uploading a large ZIP file to that EFS volume (via an instance on the load-balanced cluster), or unzipping one from an ssh session on the cluster instance, and I will suddenly find the instance ripped out from under me, and find that the cluster has bred two (or more) new instances, and is shutting down the one I was accessing.
What is going on here? The instances are all "t2-micro" instances; are they inadequate to the sustained load, and running out of burst capacity? Has anybody seen anything like this?
So you've got this
t2.micro
in an Auto Scaling Group (ASG) I assume?And this ASG is configured to scale up/down based on average CPU load?
You overload it with some large ZIP file manipulation, run out of CPU Credits, CloudWatch notices the average CPU load goes above the threshold and starts a new instance. As expected.
That takes the average CPU load down and ASG terminates the longest running instance (the one you're working on). Also as expected.
I guess your scaling up/down thresholds are too close to each other (maybe you've got scale up when load > 60% and scale down when load is < 50%) - configure a bigger gap, e.g. 60% / 30%).
Don't overload T2/T3, use T2/T3 Unlimited, or use some other instance type like M4, M5 or C5 that don't use CPU credits and provide consistent performance.
Treat instances in ASG as immutable - you should never need to login to instance in ASG, all their configuration should be done automatically through Launch Config scripts. Because you never know when they start or stop.
Hope that helps :)