Obviously, Amazon EC2 servers are still running on physical hardware and therefore can have catastrophic failures. And architecturally, I should be building an app that scales horizontally and works around those failures.
How can I simulate an EC2 instance suddenly breaking down? This should have characteristics of a real equipment failure:
- Processes don't terminate cleanly.
- Data in memory is not given a chance to write to disk.
- Files (e.g., on EBS volumes) are not cleanly closed.
- Open sockets don't FIN, they just hang.
There are an enormous number of ways a system can fail so you probably can't test for all of them and try to work around them.
Perhaps you should look at it from another perspective: look for the services which are essential then find a way to explicitly kill these at random. This simulates a failure you care about, regardless of the cause.
For example if your instance runs an httpd and an FTP server you can kill these daemons occasionally and make sure you can recover from it. You can even programatically terminate the whole server with the AWS API if you want to.
This will also exercise your monitoring infrastructure if the recovery doesn't work :-)