I'm currently running a single EC2 instance, and plan to move to a fault tolerant architecture eventually. Something that will help me decide how urgent this migration is is EC2 MTBF.
Is there any data about how often EC2 machines fail?
I'm currently running a single EC2 instance, and plan to move to a fault tolerant architecture eventually. Something that will help me decide how urgent this migration is is EC2 MTBF.
Is there any data about how often EC2 machines fail?
I generally expect the MTBF to be higher for EC2 instances than for high end hardware I would buy and put in a data center.
The big difference is that I can design my EC2 setup so that when an instance fails, I can bring up a new one within minutes of being alerted and getting to an Internet connection. This is a huge contrast for what I used to have to do when a server failed in a colo 40 minutes away where I had to drive down there, debug the hardware issues, install replacement parts if I happened to have them on hand.
For example, if an instance's underlying hardware fails, you can throw it away and switch to new hardware with a couple commands:
So, though I sometimes design for replication and automated recovery or failover, other times I tend to find myself living with the risk of a little downtime because it's so easy to recover manually.
Document/script instance setup (software installation/configuration) so you can reproduce it at a moment's notice. Take regular snapshots. Make regular backups of your data (in addition to snapshots). Keep copies of backups offsite (outside of EC2).
If you need extra nines of uptime, go for the more complicated replicated, redundant, failover, autoscaling architectures, which AWS also makes easier than with physical hardware.
This is something I've researched for a company project and, unfortunately, its not really possible to quantify. Because there are such huge numbers of nodes in EC2, and cluster computing is inherently unstable due to the large numbers of machines in play, its really a factor of the following: can your application handle failures?
To note, the biggest issues seem to be single points of failures (obviously). Don't host your single database in the cloud, a single file store, etc.. Disk failures on EC2 aren't exactly common place, but I've seen experiences from 0.0001% to 2% disk failure rates. Googling around (and checking the EC2 boards) will yield you more evidence of this. For long term storage--or "more reliable" storage--check out Amazon S3.
Overall, you shouldn't view EC2 instances as stand in replacements for servers in your own data center or co-lo. Rather, you should view them as part-time workers--many will show up, most will do a good job, but every once in a while, one of them will call in sick or quit. When that happens, your application needs to be able to handle the loss, be it data corruption or a server going ofline. If it can (like you say), then cloud computing is a good idea.
There are no published MTBF statistics. "More often than you would like" is about the best that you are going to get. Beyond that, the other posters have provided excellent answers about how to handle architecting your application.