Does the Pacemaker ecosystem (Corosync etc.) make sense in the context of EC2? Up till some point, Corosync required IP multicast (not available on EC2), but I think it has broadcast now. Still, are Pacemaker et. al. the right tool for a cluster to manage itself on EC2, e.g. monitor each other for failure and thus trigger bringing up new instances to replace failed ones?
I guess part of the problem is that I've been spending quite a bit of time just straightening out all the players here (Heartbeat, Corosync, OpenAIS, etc.), and I'm still trying to figure out what these things actually are (beyond nebulous terms, e.g. that Pacemaker is a "cluster resource manager" and that Corosync provides "reliable messaging and membership infrastructure").
Hence, apologies if my question itself is a bit bumbling or doesn't completely make sense. Any insights would be greatly appreciated. Thanks.
Does EC2 monitor the health of services inside the guests?
If not, and that is something you want, then Pacemaker would be relevant here. Corosync probably isn't an option yet as it only does mcast and bcast, so it would be a pacemaker+heartbeat scenario.
Here's a guide to how people do it with linode instances, much of it is likely to also be relevant on EC2: http://library.linode.com/linux-ha/
To answer the question of what the pieces are, Pacemaker is the thing that starts and stops services and contains logic for ensuring both that they're running, and that they're running in only one location (to avoid data corruption).
But it can't do that without the ability to talk to itself on the other node(s), which is where heartbeat and/or corosync come in.
Think of heartbeat and corosync as a bus that any node can throw messages on and know that they'll be received by all its peers. The bus also ensures that everyone agrees who is (and is not) connected to the bus and tells you when that list changes.
For two nodes Pacemaker could just as easily use sockets, but beyond that the complexity grows quite rapidly and is very hard to get right - so it really makes sense to use existing components that have proven to be reliable.
My gut level instinct is to say no, those are really not the right tools for cluster management on EC2. I've used them on stand alone hardware and found you have to have a very specific set of needs / failure cases for them to really make sense there. I cannot concoct a use case in my head that would demand those tools over more specific cloud monitoring systems and tooling like messaging developed with the platform in mind.
That said I don't consider my answer authoritative here, I am really hoping somebody chimes in with a little more experience with that tool set in the ec2 cloud.
EC2 instances are very similar to real hardware for management purposes. If it goes down, it goes down (or if the physical host goes down). There is no intrinsic mechanism for failover on EC2. You get the benefit to restart the instance and it will "magically" reappear, w/o any physical intervention nor maintenance, but you still have to do it, either manually or automatically (maybe EC2 will restart it for you, I don't now that). This can mean an outage of several minutes.
If you want an HA solution, it will be probably faster in terms of recovery, but you have to keep 2 EC2 instances up all the time.
But the ideal architecture for EC2 is to have multiple instances for the service you want, all running in parallel and taking requests, and if one dies, the others will pick up the slack.