Sorry for being unclear previously.
We have a vmware virtualized server instance that is our main production server. I stores a series of web based applications on close to a hundred unique top level domains. For serving web pages we use a LAMP stack. This server is running our primary and secondary dns servers (on two different ip address than that used for serving web content). And finally we also host our mail (pop and smtp) using exim (i believe).
Recently we've had issues causing our root fs to become read only, preventing apache2 or mysql connections and prevent incoming email. Essentially taking down the web presence and email for many thousands of clients. The nature of the issue (still undetermined by under control) did not affect the bind so dns was still resolving fine.
Since then, we have begun to mirror the production web sites and associated mysql databases onto a secondary server. This server is completely production ready.
My question is, what are recommended methods for a failover in the case that apache on our main production server fails (for what ever reason) to quickly, if not automatically, start forwarding traffic as seamlessly as possible to the secondary.
DNS round robining is undesirable for us since we do not wish to the load over two servers, in fact we only ever want the secondary to receive http requests in the case the main server is non-responsive. This is in part to the fact that our mirroring process is one way and changes to the secondary server would be reflected in the main server and even lost.
DNS round robin is not recommended because:
1- Different servers may not be exposed to the same amount of requests. So, they will be loaded in an unevenly manner.
2- DNS load balancing does not take into account the server availability. The server DNs record will remain and may be used in case of failure.
3- DNS caching will make it even worse. You don't have control over the DNS caches of your clients and any intermediate DNS server in between. If you plan to make your TTL value samller, it may not work as expected. Look at this post. The accepted answer says that
Many DNS server do not honor your TTL
.The recommended solution is to install a load balancer like HAProxy along with a high availability solution like heartbeat. This setup should be installed on two machines. If one goes down, the other will take over the VIP (by heartbeat). The running machine will take care of checking backend servers health and distributing the load (by haproxy).
EDIT:
If you want the servers to work in active-passive mode, you don't need a load balancer. You can install heartbeat with pacemaker to monitor the system resources such as apache, mysql, etc. The cluster can be configured to keep only one active server.
Install nginx in front of Apache. If one Apache server is down, nginx will exclude it and serve data from another "workers".
So, setup should look like that
nginx -> worker #1 (Apache), worker #2, worker #3 etc.
Of course nginx should be installed on dedicated box. One problem you have to resolve - what if nginx will be down, but...
nginx website: http://nginx.org
The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer.
UCARP allows a couple of hosts to share common virtual IP addresses in order to provide automatic failover
From what I gather, you are looking for a high-availability solution - i.e. when one server goes down, another server can take over.
What ThomK suggested is one way to do it except that the single point of failure will be the nginx box. Another thing that you can look at is using HAProxy (or even nginx) but with some sort of IP based fail-over as well.
You can get a lot of ideas from elsewhere.
For redundancy within a single site, on a single Internet feed, you want to put clustered hardware on your front end, with a standby box ready to take over the IP address of a failed box. But you'll be out of action if your ISP has a failure, or your site loses power or suffers some other problem.
If you want protection against loss of a whole site, or loss of your ISP, then there are really only two options. One is to get your own BGP autonomous system number, and run your own BGP routes, with peering (well, paid transit) with several different ISPs. The smallest netblock you can do that with is a /24, so you'll need to have a netblock at least that big. You can then advertise different routes to a different site if your main site goes down.
Your other option, as you suggest, is round robin DNS. Some people advise against this on theoretical grounds, and there are problems with Windows Vista clients selecting addresses non-randomly, but it should work fine for redundancy, with the backup box just reverse-proxying traffic back to the main box unless the main box/site/Internet feed goes down.
Really? I'd be very interested to see the articles describing this.
You've not mentioned what OS this runs on - which has a lot of relevance to how the clustering is implemented, nor what software you currently use for DNS / how easy it is to change this.
If you can possibly aviod it, I'd strongly recommend using a replicated type cluster rather than a shared storage cluster - particularly where the data is not changing frequently.
Bind already provides for master/slave replication so distributing the data across multiple platforms - so it's just a matter of working out some way of routing the requests to an available server.
Similarly for MySQL.
Implementing the same for your webserver files is simple using rsync.
While round-robin failover is a little slow compared with other methods - its MASSIVELY more robust, simpler to admin and cheaper than other approaches. Without knowing your reasons for diskliking round-robin, its hard to suggest somethng which might be more acceptable. (The guys who wrote HA-Proxy recommend using at least 2 proxy servers with round-robin in front of a HA cluster).
I would suggest that you use (at least internally) virtual addresses the master and slave server - this simplifies the business of promoting the slave. You might even want to use virtual addresses for each service/cluster node. This makes it easy to set up a heartbeat tied to address failover - this won't be any faster than round-robin but ives marginal benefits during reduced service - assuming its implemented correctly and you can resolve split brain problems.