Of the folks managing their own clusters (i.e. not using/paying for Amazon Autoscale, Rightscale, Scalr, etc.), how are you managing your instances on EC2 and handling (e.g.) failover? I'm wondering if most folks just end up writing their own boatloads of scripts against the EC2 API, as I suspect.
That's certainly our approach: whip up our own Python Boto-based monitoring/restarting daemon that runs off-site, listening for UDP keep-alives from our instances. On failure, we snapshot volumes, register images, start new instances, delete old volumes, and so on.
Every so often, when hacking on our scripts, I think there must be some open-source tools out there that deal with these issues already, and which don't have the constraints of (say) Scalr, but I always come back from Google empty-handed. (Things like Scalr have are pretty limited in the supported set/versions/configurations of software, and have specialized and IMO cumbersome ways of manipulating these setups.)
Also, the Linux-HA/Pacemaker ecosystem (Heartbeat, ldirectord, etc.) sounds like it isn't really suited for EC2. (But then I found this - though I'm not sure this is really a high-quality solution).
Well, I don't mean to just state the obvious, but the general idea is to push this complexity into the services managed by Amazon.
So on the frontend, you would use Amazon Elastic Load Balancing (ELB) to provide highly available load balancing. On the rear end, you use Amazon Relational Database Service (hosted MySQL), SimpleDB, and S3 for storage. All of these are managed by Amazon, and contain some sort of high availability / failover handling.
This typically leaves your web application servers, and any lesser common server types you might be using (rendering servers, self installed NoSQL data stores, etc).
Webapp servers are usually handled well enough with the health checks built into ELB. You can accept a small performance degradation when one webapp server is down, or consistently provision +1 server more than you need. Or if your config is simple, then when a webapp server fails, ELB together with Cloudwatch can automatically spawn a new webapp server for you.
Your own custom servers are another matter. For these it's true, you're on your own, and will need to make do with application built-in methods, or duct tape together something with custom scripts / open source HA tools.
Buying Rightscale's solution might be too expensive. But lesser expensive Amazon tools such as ELB, basic CloudWatch alerting (now free for 5 minutes resolution), or AutoScale are well worth it if you need high availability.
RightScale has some great articles on how to automate failover on EC2. While most of them show you how to do it using RightScale itself, the principles are general and probably helpful to anyone thinking of how to set up a failover architecture on EC2.
The issues you describe (HA, monitoring custom servers, 'duct-taping' services) are generally handled by a PaaS provider. Rightscale and Scalr were already mentioned in a previous answer and there are additional good options (see here for some PaaS options:
https://stackoverflow.com/questions/9542784/looking-for-paas-providers-recommendations)
You should consider which of the providers gives the closest fit to what you need.
Due notice: I work for cloudify, an open-source PaaS provider.
I recently wrote a post on our engineering blog about how to use ELB in conjunction with Auto Scaling to achieve automatic failover for any kind of app. It covers how ELB health checks can be used to ping the status of your app and trigger auto scaling actions.
You install heartbeat on both servers You attach an Elastic IP to the 'active' server You configure a script to do the failover by initiating an API request to obtain the elastic IP As soon as the 'stand-by' server got the elastic IP (takes about 30-60 seconds) it can be the master/active.
I don't have the specifics to provide here.
Amazon already provides Elastic Load Balancing... Why reinvent the wheel?