We have a web application with low load but high availability requirements. It consists of a single front-end load balancer and a couple back-end servers. The load balancer is there primarily for masking failures, not for spreading load.
The back-end servers are made highly available via replication across two Availability Zones. But how do you make the front-end tip itself highly available? It's currently a single point of failure.
We may go with AWS Elastic Load Balancing, but it's a bit pricey and we again don't have really need the load balancing part, so: how would you solve this problem another way?
One idea that comes close is to monitor the front-end with pings or heartbeats; on timeout, switch the front-end's Elastic IP to another machine configured to also serve as the front-end. My main concern with this approach is that it apparently can take 10 minutes for the elastic IP assignment to propagate.
Anything with a faster response time than this approach? Think zero downtime is possible?
Spinning this question another way: how would you accomplish this in a regular self-hosted data center, one where you don't have AWS Elastic Load Balancing?
Fast, Reliable, Cheap. Pick any two.
Honestly, though, "zero downtime" is, for all intents and purposes impossible. You're wanting zero downtime, but it doesn't appear that you're willing to spend the money necessary to do it.
I believe you're on the right track with heartbeat and swinging the front end's IP to another node. Anything more involved than that would either involve contracting the services of a CDN like Akamai or Limelight or alternatively, obtaining an AS number, configuring BGP, getting an IP allocation, setting up gear in two geographically-distant colos and replicating data between them. Either of those options would be quite expensive and complex to implement.
When looking at Amazon's ELB service keep in mind that it uses a CNAME record so you won't be able to load balance the root of your domain (example.com). You'd have to use a subdomain like www.example.com and have the machine accepting traffic sent to example.com redirect clients to www.example.com. This gives you a single point of failure. More discussion on this issue can be found on the Amazon forums: http://developer.amazonwebservices.com/connect/thread.jspa?threadID=32044
Your own AS number across two or more carrier class networks is as close to Zero downtime as you will get. With muliple physical sites online. That said EC2 is close to Zero downtime.
By having two load balancers on active/passive or active/active you can avoid spof.
just have in mind that in active/active scenario your two lbs will work at the same time and if either fails, the other one takes over.