I've been using AWS for decades at this point and I started using spot instances for various reasons including professional reasons such as testing software I develop, etc. I mostly use Windows Server 2022 instances to which I connect to using RDP.
I've built myself a set of scripts to spawn new instances based on snapshots when I need to and automatically saves them back into snapshots when I don't need them.
For the last year or so, it's been going very well. I've had a few instances where they would suddenly get shut down and I always assumed it was because of the servers needing more resources to give to users who are willing to pay more or use on demand instances.
These past two months, I've had several shut downs sometimes back-to-back. Sometimes it happens multiple time within a short span of time and then it goes weeks without happening at all.
I tried raising the price of the spot request to as high as I can just for testing purposes and it would still go down after a while. If I look at pricing history, everything is low and I see a very stable and flat line indicating that the prices are not fluctuating. Each time I look into it in more details I see :
- Details of the instance in EC2: I would see
State transition reason / User initiated
- Details of the spot requests in EC2: I would see
Status failed, bad-parameters
- But in Cloudtrail what I see tells a different story. In the logs I would see:
// most of this is redacted but here are some relevant parts:
"invokedBy": "spot.amazonaws.com",
"eventType": "AwsApiCall",
"managementEvent": true,
"eventCategory": "Management"
So Cloudtrail is basically telling me that it is NOT user initiated which tells me that the problem is not with my instance. Because at first I thought, maybe my Windows Server instance is just crashing and that's why it says "user initiated" but no.
It has been very annoying, and yes the solution I guess would be to use on demand instances instead but I'm spending a lot of money on AWS and would like to keep using spot instances instead.
Anyone has had similar experiences? Anyone could help debugging this issue further?
Thanks!
"The best way for you to gracefully handle Spot Instance interruptions is to architect your application to be fault-tolerant. To accomplish this, you can take advantage of Spot Instance interruption notices. We recommend that you check for these interruption notices every 5 seconds.
"The interruption notices are made available as a EventBridge event and as items in the instance metadata on the Spot Instance. Interruption notices are emitted on a best effort basis."
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html#ec2-spot-instance-interruption-warning-event
https://aws.amazon.com/blogs/compute/taking-advantage-of-amazon-ec2-spot-instance-interruption-notices/
Spot instances can be shut down at any time by AWS with very little notice. If someone wants an on-demand instance and there's not capacity available spot instances get shut down regardless of the bid price.
If you can't cope with your instances being shut down use on-demand instances.