Over this past weekend we had severe storms here in Virginia and of course the crisis in Japan is a reminder that things can go bad in a heartbeat! A question I ask myself "What if a tornado hit my data center, am I prepared?"
I have great backup systems "in my rack" including a tape backup. Because the data center is not close moving tapes off site is not possible. What I'd like to find or create is a system that on a schedule that can backup critical items such as web sites, databases, and copy them remotely, i.e. my server at home. I have FIOS with 35 mbit service so I have the broadband, what I need is the "system" to do this. I am a programmer so I could create something that FTP's down information on a schedule, but I'm curious if there is something out there that would fill this remote backup need now? My SQL Servers are backed up to storage arrays, I could bring those backups down or even schedule my SQL server here to sync with the production servers on a schedule. I use Windows Server 2008 R2 and SQL Server 2008 R2.
What do you all recommend for off site strategy in a crisis such as a natural disaster knocking out our data center? Are you prepared? I hope others ask themselves this question and learn from these natural disasters we've been seeing all too frequently.
Your options should be dictated by your service-level agreements with your customers and limited by your budget.
At the very minimum, you should have off-site backups of all critical data. That is to day, any data which you cannot recreate from scratch needs to be stored elsewhere. Offline backups are better: online backups or replication might help when a tornado hits, but what happens if you have an angry employee drop a database or destroy a filesystem?
From a baseline of offline backups, you can begin to explore options which will speed recovery in exchange for a higher cost. There are a huge number of options here ranging from a single host for on-line backups you describe all the way to completely replicated environments with synchronous data replication running active(-active)+ for near-zero downtime.
You'll find recovery-from-scratch to be much easier if you separate your data from your infrastructure as neatly as possible. For example, from-scratch recovery is going to be much, much faster if you deploy using systems like puppet or chef rather than by hand. Redoing all of the work you've put into building your systems will be much faster if you can automate as much as possible. Keeping data separate also reduces the amount of data you need to back up: don't spin off gigabytes of OS if you only really need a few megs of system configs and application data.
The options can get quite pricey, so you need to determine what your company is willing to spend on disaster recovery and how much downtime your customers can tolerate. Eliminate the options which are too expensive or too slow for your customers.
Once you choose a disaster recovery solution, make sure you practice it. I would recommend at least once a year or whenever your architecture changes, whichever happens more often.
Business Continuity goes much, much further than just making sure you've got access to readable backups. But confining the scope of the answer to just that, ultimately it's only going to be viable where the end-to-end bandwidth from the datacenter to the backup location is sufficient large to handle the volume of data changes.
When you're talking about a datacenter, then for most people that's Gigaytes of data per week.
IME, even on a small scale the best solution is a distributed (or mirrored) operation. Plan it right and there should be little cost overhead compared with a single datacenter.
But if you must copy all the data out to a standby location or even just to remote storage, then
1) don't use FTP - it's just the wrong way to do it for lots of reasons
2) for generic files, use something like rsync which is optimized for the purpose
3) for databases, look at the tools available specifically for your DBMS - the file structure can change massively without the data changing a lot. NB this incldues MSWindows registry and MSAD data.
We have a VPN from our office to our offsite datacenter. At the offsite datacenter we have server that has a network share mounted that we configure as a destination in our backup software (we run Symantec BackupExec) i.e. \OFFSITEDATACENTER\OFFSITESTORAGE
We then do - a full backup over the weekend to that location
- an incremental each evening
As well as our normal "onsite" backups
We also run VMWare VDR to take images of our main servers each week which are put onto a 2TB SATA disk encrypted using FreeOTFE which I take home each week.
We have a number of separate active/active or active/semi-active data centres with >50 miles between them, different power suppliers, security, diversely-routed 10GBps meshed links between them, oh and we ship our backup disks between them too. This does for us.
The specifics of handling a certain backup scheme have been covered ad nauseum here and elsewhere. I'm going to approach this question from more high-level viewpoint of the general guidelines to help you decide how to approach disaster recovery. I've been in quite a few situations where planning had to be in place in case the datacenter became a smoking crater. Thankfully, we only had to use it once. The most important things to remember are:
1) Don't waste your time trying to overengineer and make everything fail over with <1ms precision if you don't have to. A complete failure of that magnitude will generally excuse a few hours worth of recovery.
2) As a corollary to #1, make sure that expectations are realistically determined and coded in a policy somewhere. Having a set goal to achieve as far as recovery time is important, since you can spend unlimited time and funds making is "even better."
3) Prioritize your systems. The plan for recovery needs to be built around a definitive list of the importance of each and every system. Don't miss the obvious things, either, like getting DNS and AD up before the rest of the Windows servers.
4) If it's not offsite AND off-network, it's just a copy. This goes right in line with another key thing to remember: RAID is not a backup plan.
5) Test, Test, TEST! Test every inch of your plan that you can. If you are able to get a weekend's worth off for a maintenance period, disconnect the uplink and/or building power and test your team's reaction time and effectiveness. A disaster recovery plan that's never tested is just wishful thinking.