For the sake of their reputation, I won't mention the names. But I'll just use:
Business I worked for previously - ABC Web Dev
Hosting company they used - XYZ Hosting
I recently found out that XYZ Hosting had some sort of incident where they ended up losing a lot of their client's data - including ABC Web Dev's. ABC Web Dev was able to recover some of their customer's websites, after pulling them from their local development computers and putting them up on another hosting provider. They ended up losing a lot of clients because of it and their reputation ruined.
I'm starting my own web dev company and I don't want to run into this same issue. I'm planning on using Rackspace but, although they are a great company, according to wikipedia they still have had downtime in their past. I thought it might be a good idea to try to run two providers at once, to ensure that if anything happened in one the websites would still be live because of the other.
I know the websites would have to be pulling from one server at all times, but if there's a way to redirect requests to the second server if the first one is down that would solve my issue.
As a note, we will have a staging environment setup locally which will allow for quick recovery if a provider did have any issues, however I'd like to avoid any downtime at all if possible.
So my questions are:
Has anyone tried running two providers simultaneously?
Would this be considered good practice or am I going too far?
Is there really any way to run two simultaneously where one server acts as a backup?
You shouldn't need to run two sites simultaniously if you have good, tested local backups.
If the web hosts both run cpanel, and you have a cpanel backup of your entire account, you should be able to deploy it to your other web host quickly.
The problem with running multiple sites at once is keeping everything up to date. It's hard, especially when databases are involved, and most sites run databases these days.
Let's be honest here, the reason your previous employed lost the clients was not because the websites were down for a few hours (even a few days), but because there were no backups and they lost their entire website forever.
Here's our disaster recovery plan:
This way we know we can recover from an outage in about 3-4 hours.
Running multiple hosts at once is much, much more difficult, and you still have to wait for DNS propogation anyway when you do want to switch hosts.
Dealing with this without any downtime at all is going to be hard. The Big Boys handle this by using BGP to reroute traffic to the entire block of IPs via the new facility. For everyone else, you're going to have to pick a solution that will work out the level of redundancy you want.
In our case, we were looking for "Site A ceases to exist"-level protection (as opposed to "Site A accidentally creates a network loop and the network grinds to a halt for 15 minutes"). To do this, we set up our primary DNS server at Site A and our secondary DNS server at site B, so that if A goes down permanently, we can change the zone files on B to point everything to B, then update our domain to replace the site A nameserver with some other site. There are situations we can't handle this way (for instance: we're locked out of Site A but the DNS server is still running there, in which case we can't really do anything until the domain record has been updated) but for everything else, things are running again once the cached DNS records have expired.
Depending on the specific "threat" you could set up other options: for instance, for loss of all data at site A but the server there is still working (rm accident?), you could set up a basic webserver that redirects all the traffic to backup.example.com. If someone is on call to react to the situation, this can be much faster than switching DNS records.
All of this assumes that you're keeping data synchronized between the sites. How to do so will depend entirely on what it is you need to synchronize, how far behind you're willing to let Site B be, how much you intend to spend on this, and whether that rm accident should be replicated as well... a subject for a different question entirely.