I'm looking at implementing a very large storage server to be used as live NAS for several other servers (all Linux-based).
By very large, I mean between 4TB and 20TB usable space (although it's unlikely we'll actually make it 20TB).
The storage server will be RAID 10 for data security and performance, but we'll still need a backup solution including off-site backup.
My question is: How do you backup that much data!?
It's not like I can just connect a portable hard-drive and transfer the files over. We currently have no other devices with this much storage space.
Do I need to budget for a second, off-site storage server or is there a better solution?
There are many ways of handling data that size. A lot of it depends on your environment and how much money you're willing to spend. In general there are a few overall 'get the data off the server' strategies:
That's the 100Km view. Once you start zooming in things get a lot more fragmented. As already mentioned, LTO5 is a specific tape technology that designed for these kinds of high-density loads. Another identical storage array is a good target, especially if you can use something like GlusterFS or DRBD to get the data over there. Also, if you need a backup rotation or just the ability to keep running in case the array fails will affect what you put into place.
Once you've settled on a 100Km view method, getting into software will be the next big task. Factors influencing this are what you can install on your storage server in the first place (if its a NetApp, that's one thing, a Linux server with a bunch of storage is another thing entirely, as is a Windows server with a bunch of storage), what hardware you pick (not all FOSS backup packages handle tape-libraries well, for instance), and what kind of backup retention you require.
You really need to figure out what kind of Disaster Recovery you want. Simple live-replication is easier, but doesn't allow you to restore from last-week only just-now. If the ability to restore from last week is important to you, then you need to design for that sort of thing. By law (in the US and else where) some data needs to be preserved for 7+ years.
Simple replication is the easiest to do. This is what DRBD is designed to do. Once the initial copy is done, it just sends changes. Complicating factors here are network locality, if your 2nd array is not near to the primary DRBD may not be feasible. You'll need a 2nd storage server with at least as much storage space as the first.
About tape backup...
LTO5 can hold 1.5TB of data w/o compression. Feeding these monsters requires very fast networking, which is either Fibre Channel or 6Gb SAS. Since you need to back up more than 1.5TB in a whack you need to look into autoloaders (here is an example: link, a 24 slot 1-drive autoloader from HP). With software that supports them, they'll handle changing tapes mid-backup for you. They're great. You'll still have to pull tapes out to send to off-site, but that's a damn sight better than hanging around all night to load tapes yourself when the backup calls for them.
If tape gives you the 'legacy, ew' heebiegeebies, a Virtual Tape Library may be more your speed (such as this one from Quantum: link). These pretend to be tape libraries to backup software while actually storing things to disk with robust (you hope) de-duplication techniques. The fancier ones will even copy virtual-tapes to real-tapes for you, if you like that sort of thing, which can be very handy for off-site rotations.
If you don't want to muck about with even virtual tapes, but still want to do direct-to-disk backups, you'll need a storage array sized big enough to handle that 20TB, plus however much net-change data you want to keep a hold of. Different backup packages handle this differently. Some de-duplication technologies are really nice, others are hacky kludges. I personally don't know the state of FOSS backup software packages in this area (I've heard of Bacula), but they may be sufficient. A lot of commercial backup packages have local agents you install on servers to be backed up in order to increase throughput, which has a lot of merits.
LTO-5 jukebox? you'd need somewhere between three and 15 tapes to back that array up, which isn't a crazily large number. The jukebox will take care of changing the tapes for you, and good backup software (eg bacula) will keep track of which file(s) are on which tape.
You will also want to consider the time required to back up a file system that large, inasmuch as it is very likely the FS will change during that period. For best results, a file system that supports snapshots would be very helpful, so you can take an instantaneous snapshot and perform full or incremental backups against that, instead of against the live filesystem.
You should probably be looking at backing up to disk, since tape will take a long time, and being sequential access, restores will take forever.
Definitely take advantage of differential or incremental backups - only backing up changes, at whatever frequency makes sense for you.
Probably the ideal solution would have a 2nd similarly sized server at another location, where incremental backups are sent regularly, and that could be swapped into place quickly if the main server ever died. However another option would be to use removable drives on-location, which are then taken offsite for storage.
When you're dealing with that much data, it also makes sense to break up your backups into smaller backup jobs, and if they can't all be backed up every day, stagger your backups so set A gets backed up one day, and set B the next.
Always be thinking about the restore procedure. We got stung once when we had to restore a file from a several-hundred-gig backup job, which took a-lot of memory and a-lot of time to re-build the backup index and restore. In the end we couldn't complete it in a day, and had to build a dedicated restore server to allow our main backup server to continue it's nightly jobs!
--added--
You also want to be thinking about deduplication technologies, which can save huge amounts of space by not backing up the same information multiple times, for multiple users. Many backup solutions or filesystems offer deduplication as a part of their functionality.
First, enumerate the risks you are protecting against. Some common risks:
Then evaluate the cost of the various risk avoidance solutions, e.g.:
Then evaluate rotation strategies (how far back do you want to be able to recover, how much data can you afford to lose).
Then pick what your data is worth.
I have a customer with two similar 12 TB systems in two different buildings, connected at 1GB. One is the production system; it's backed up incrementally (with daily snapshots) to the other with the great rdiff-backup utility. rdiff-backup must be available in your standard distribution repository.
Off-site, on-line backup (remote mirror)
use rsync though ssh (only changes) - first backup has to be done locally, but after that backup will be a breeze depending on changes
if you need to keep versions with changes- rdiff-backup
http://www.nongnu.org/rdiff-backup/
btrfs file system in Linux sounds promising, but still on heavy development
Take a look at your actual "content" and how often it changes before you plan your strategy. Many times people just churn the same data to tape weekly over and over for no good reason.
Deduplication technologies from some vendors can allow snapshotting to save you from individual file restores but you will always need offsite for protection.