We need some more advanced functionality than ELB provides (mostly L7 inspection), but it's not obvious how to handle things like heartbeat and high availability with something like haproxy using EC2. There's a high likelihood we'd need 3 or more haproxy nodes in the cluster, so simple heartbeat between two nodes isn't going to work.
Seems like having a heartbeat'd layer in front of the haproxy nodes would be the way to go, possibly using IPVS, but handling the configuration changes as the EC2 cluster changes (either via intentional changes, like expansion, or unintentional, like losing an EC2 node) seems non-trivial.
Preferably the solution would span at least two Availability Zones.
In answer to Qs: No, sessions aren't sticky. And yes, we'll need SSL, but that could in theory be handled by another setup entirely - we're able to direct SSL traffic to a different location than non-SSL traffic.
OK, I've never built an AWS load balancing solution with traffic on the levels of SmugMug myself, but just thinking of theory and AWS's services, a couple of ideas come to mind.
The original question is missing a few things that tend to impact the load balancing design:
I'm answering from the perspective of how to keep the load balancing layer itself highly available. Keeping the application servers HA is just done with the health checks built into your L7 load balancers.
OK, a couple of ideas that should work:
1) "The AWS way":
Benefits/idea: The L7 load balancers can be fairly simple EC2 AMI's, all cloned from the same AMI and using the same configuration. Thus Amazon's tools can handle all HA needs: ELB monitors the L7 load balancers. If a L7 LB dies or becomes unresponsive, ELB & Cloudwatch together spawn a new instance automatically and bring it into the ELB pool.
2) "The DNS round robin with monitoring way:"
Benefits/idea: Compliant user agents should automatically switch over to another IP address if one becomes unresponsive. Thus, in the case of a failure, only 1/3 of your users should be impacted, and most of these shouldn't notice anything since their UA silently fails over to another IP. And your external monitoring box will notice that an EIP is unresponsive, and rectify the situation within a couple of minutes.
3) DNS RR to pairs of HA servers:
Basically this is Don's own suggestion of simple heartbeat between a pair of servers, but simplified for multiple IP addresses.
Benefits/idea: In AWS' completely virtualized environment it's actually not that easy to reason about L4 services and failover modes. By simplifying to one pair of identical servers keeping just 1 IP address alive, it gets simpler to reason about and test.
Conclusion: Again, I haven't actually tried any of this in production. Just from my gut feeling, option one with ELB in L4 mode, and self-managed EC2 instances as L7 LBs seems most aligned with the spirit of the AWS platform, and where Amazon is most likely to invest and expand later on. This would probably be my first choice.
If you're not doing sticky sessions, or if you're using tomcat/apache style (append node ID to sessionid, as opposed to storing state in the LB), then I'd use ELB in front of a group of haproxies. ELB has a healthcheck built in, so you can have it monitor the haproxies and take any down ones out of the pool. Lots less to set up than heartbeat failover.
As far as propagating changes, I don't have a great answer. Puppet is great for initial configuration and implementing changes, but for adding/removing nodes you tend to want faster response than its 30 minute polling interval.
I haven't used it myself but I've seen a lot of people mention using puppet to handle these sort of problems on EC2