I'm starting to see clients with hundreds of terabytes of data (in SQL Server installations). As the total volume of data in some enterprises approaches meaningful fractions of a petabyte, I'd like to canvas the collective knowledge base out there to see what people dealing with that magnitude of data are doing to safeguard it.
The obvious issue is that storing multiple backups of that much data is prohibitively expensive, using enterprise-class storage, heck, even just RAID-5.
Options I see are as follows:
- Create a mirror copy of the data in another data-center, and continually ship differences to it (using whatever mechanism is available for your data source - e.g. log-shipping or database mirroring with SQL Server)
- Take regular backups using a hefty compression algorithm (probably only suitable if the data lends itself well to being heavily compressed)
- Take piecemeal backups of the critical/changing parts of the data.
- Don't backup the data and trust the corruption-gods.
I'm seeing option #4 being adopted as the default, and as an HA/DR expert it's really scary, but what do I advise as an alternative? I think #1 is the best approach, but "I don't think so" is the usual answer when any alternatives apart from #4 and possibly #3 are suggested.
Now, of course it depends on the change-rate and criticality of the data. No need to answer with that as I used to be responsible for all the HA features of SQL Server while I worked at Microsoft so I'm well-versed in the 'it depends' arguments - that's my catch-phrase :-)
I'd be very interested to hear of any alternatives I've missed, or to hear that everyone else is in the same boat and there's no realistic alternative to spending lots of money on more storage.
Thanks in advance - due credit will be given to all well thought-out and expressed answers.
Off the wall idea - is the all of the stored information needed or even useful?
How much is the information actually worth? It seems obviously ridiculous to spend more in upkeep and management than the data is worth.
Is the data in the database appropriate for storage in a database? For example, does keeping compressed multi-gigabyte core files in the support organization's database really provide any actual benefit?
Is there a lot of duplicated data in the database? For example, are a thousand people keeping ten copies each of a weekly 10MB newsletter?
Does some of the data have an "expiration date" after which it does not provide any value? Returning to the support organization example, for various reasons there is virtually no benefit in keeping around customer core files more than a few months after a fix has been delivered.
Another thought - is keeping that much data opening the company to liabilities. Some data one must, by law, keep. Some data, however, should be "shredded" because of the risks posed if it is accidentally, or maliciously, released to inappropriate parties.
Yeah, another option is storage virtualization: a device that sits between your servers and the SAN, like IBM SVC. SVC manages SAN-to-SAN copies, and can do remote replication (although that's obviously pretty painful at the petabyte level unless you have really low data change rates and really high bandwidth.)
The slick part is that the whole process is invisible to the servers involved. If you're using SQL Server, you design your filegroups to keep things with a low change rate together (like sales archives from >3 years ago), and things with a high change rate (like current sales) on a separate filegroup. They don't even have to be completely read-only - you just want to design it so that you can use different replication methods for each filegroup. The SAN gear can sync luns via network, tape, or via SANs - meaning, you can ship parts of the SAN back and forth. This is more effective with gear like LeftHand's, where the SAN is made up of a pool of participating units.
Then you can sync the low change rate stuff over the wire automatically, and sync the high change rate with sneakernet. (Sounds like I've got that backwards, but it's true - you can't sync the high change rate stuff over the wire due to volume.) Even some of the low-end gear accommodates this now: LeftHand lets you replicate to other LeftHand units in your datacenter, and then ship them to your offsite datacenter. Plug 'em in, join them to the remote side by changing IPs and groups, and now they're part of your remote backup SAN. The LeftHand sales pitch on this is just brilliant: set up your two SANs side-by-side in your primary datacenter, get them in sync, then you can ship parts of them over to the remote datacenter while some of them stay in your current datacenter to keep in sync. Gradually move 'em over without getting way out of sync.
I haven't done this at the petabyte level, though. You know what they say - in theory, in theory and in practice are the same. In practice...
Option 1 is mirroring, which is almost as bad as #4: any bug that corrupts data, and isn't discovered immediately, will corrupt both copies.
If the data is critical, consider dedicated solutions; read about IBM's Shark products, for example, or competing products from EMS, etc. They have features like Flash-copy, that allow you to instantly create a logical copy of the file without doubling disk requirements; and then you can backup this copy to (e.g.) tape. Look into robotic tape backup as well.
Point out to those that want to store a Petabyte of data that storage ain't cheap.
I get so fed up with people moaning about not having an extra Terabyte of online storage because disc is cheap - disc may be, but managed storage sure as hell isn't.
If it's prohibitively expensive to store the backups then it's prohibitively expensive to store the data in a safe manner, so the proposed solution isn't viable.
One of the most important reasons for having backups is protection from user error (most hardware failure problems can be dealt with by hardware solutions) but even database mirroring is no protection against a dropped table (OK, you can protect against that, but it's still possible to get unremovable guff into your DB - unless the reason the DB is so big is that it only ever issues inserts).
As I see it tape is no longer a viable solution - it is now cheaper to just work with disc arrays (though physical storage can be awkward). So I think your only option is some method of splitting the data into chunks small enough to be restored in a sensible timeframe and then getting them onto disc storage on a regular basis (and here EMS type solutions can help, if you've got the cash).
Interesting video detailing myspace.com's architecture (SQL2005 backend). Not sure if they have individual petabyte dbs as they scale out with multiple dbs. They use SAN snap backups.
http://wtv.watchtechvideos.com/topic70.html
ZFS. Sure, it's still just getting started, but there are a number of areas where ZFS is designed to handle just these sort of thing. First off it's ability to handle a large amount of data, as well as a multitude of different storage devices (local, SAN, fiber, etc.), all while keeping data safe with checksums and "layer violating" awareness of the device health and failures. How though does this help solve backing up this much data?
One method is to use snapshots. Take a snapshot, send that to tape/disk/net for transfer to the remote site. Subsequent snapshots only send data that's been sent, and you can keep live data on both ends if need be.
The other is to use Solaris Cluster software where (so long as you have sufficent network bandwidth) you can have a live mirroring between two servers and if the one goes down, the second can take over. It's more for use where high availability (HA) is important, but I would guess that most places with that much data want HA.
And you say that ZFS isn't supported on Windows, the usual place you might find sqlserver, maybe you run the Sun/ZFS on the backend and connect via iSCSI. Maybe that's a horrid idea also, but it's at least worth giving some thought so you know what not to do.
Have you looked into Amazon Glacier as an option?
IMO, unless you have some kind of godzilla-level hardware, if you have that much data you should be using a backup compression technology. I'm most familiar with LiteSpeed, but there are similar products from other vendors and (of course) a similar feature is built into SQL2008. You might not get 10-to-1 compression, but it does cut storage requirements for the backup down, and can also shrink your backup window requirements. If your goal is to keep multiple backup sets (yesterday plus the day before that, plus one from last week and one from last month, or a series of differentials plus fulls, which can get plenty big if you change a lot of the data in the database), it's a simple matter of storage space.
Filegroup based backup (IOW, put non-volatile data onto certain FGs and the back the up infrequently) never seems to fly because devs or users won't or can't decide what data is volatile and what isn't, and in brownfield scenarios you often can't take the risk.
If a failover site is a requirement, in addition to thinking about Database Mirror) you might want to talk to your clients' storage vendor to see if they offer something like SRDF, which is a hardware-based data replication technology. Naturally, replication (of any sort, but particularly realtime or near-realtime replication) is not a substitute for backups.
I don't think you have much of a choice here on tape v. disk. Tape won't likely cut it in a regular backup window unless you stripe it, and I'm not sure the reliability is there.
So you are down to disk backups. Are you versioning? Meaning do you worry about going back to backup 2 (current db minus 2 backups)? Or backup 3? In that case, you might have issues, but likely what you have to handle is log backups, not so much data backups.
If you can split off some of the data as read-only/non changing, then perhaps you have manageable backup sizes/windows. Or at least you are hoping that backup technology and bandwidth is catching up with data growth.
I don't think you're backing up as much as you are keeping a 2nd copy around in order to recover from issues with your primary. That means hardware, corruption, etc., and you are praying daily that errors aren't being shipped to the second copy. The copies most likely are being made SAN-SAN, with some snap-shot'ing technology. although the original copy might be via Fed-Ex rather than across the wire. Bandwidth to move 100TB is not easy to come by for anyone.
I think you need a combination of 1, 2, and 3 (not 4), with excellent log backup management.
Actually I think that at any point in time you are really looking at 3 copies of your data. Running CHECKDB on 1 of the copies while the 2nd copy is being used to actually receive changes. Then you snapshot that 2nd copy to the first and continue. With this much data, I'd imagine that you would need some diligence here. Paul, how does checkdb work on a multi-user, 100TB db that is online?
As mentioned, aren't log backups, and probably a log reader, critical? Don't you need to recover drop tables/user error from the logs rather than a backup? You can potentially shortcut this by sending SAN copies through some delay, but I haven't seen that technology. A Log Shipping SAN that can delay changes by 4 hours (or some interval) to allow you to recover from issues before overwriting data. Or some log-reader-of-SAN-block-changes tool? Without that, you need to manage those transaction logs, which might be a whole other level of tracking those backups on various file systems for some xxx hours to allow you to potentially recover from non-fatal errors.
Technically, storage is cheap, but at the petabyte level, not so much. It really depends on the application, but I'd say some combination of strategy #2 and #3 is going to be the answer, with #2 a given and #3 depending on how much investment you can make in storage and the kind of storage and IO/computational power that will let you get away with as little incrementalism and as much discreet, full backup as possible.
Alternatively, something like Amazon S3 may also come into play depending on your bandwidth and how much change there is in the data -- at this volume, putting at least some of it on someone else's servers and letting them worry about redundancy gets more and more cost effective.