What is the best way to backup data on content servers? For example, I have 15 servers that just have content, no applications running on it. Each server has a 250 GB hard drive. So, it's a pretty big amount of data. All the data have external access (via HTTP). So, the question is: what methodology is best in my case?
The most useful method I know is cross-backup: when each server contains its own data and backup of one other server. But, there is significant reduction in total capacity.
RAID?
RAID is not backup.
Now that that's out of the way, if you have 15 servers which only hold content, and each one is 250 GB, it's time to ask yourself some questions.
0) Should the data be centralized?
Unless you just happen to like managing the storage on 15 machines, you should probably shoot for pooled, managed storage. This does come with a cost, though, Storage is cheap. Managed storage is expensive. If you don't want to (or can't) manage it centrally, then you need a tape solution. The cheapest solution would be one server with a large amount of disks (in a RAID configuration) attached to a pretty large tape changer (ideally, since you don't want to manually change tapes every day, I assume). You could also get 15 tape drives and attach each one to a server, but that's dumb.
1) What is your data retention policy?
In other words, are you going to keep the data forever, or for a limited period of time
2) What is your size delta?
How much does your data change per day? That needs to be factored into your future storage plans. Equipment purchases are not just IT related. Accounting needs to be factored in. If you depreciate your purchases over 3 years, you need to purchase storage that will last you 3 years. Do the math or pay the price later.
3) Where are you going to put it?
15*250=a lot of data, as you mentioned. You've got to figure out where you're going to put it. If you want it to be "live", you've got to get a storage array of some sort. If you want to back it up to tape, you're going to need a tape changer attached to a server with some big storage.
4) How much of the data is a copy of the other servers?
If you centralized the storage, you have the opportunity to invest in a storage array that has "data deduplication", which saves tons and tons (and tons) of space. Essentially, if a file over here has the same data as a file over there, the data is only stored once, and a token is stored instead in each place which is smaller than the original data. Solutions that provide this are expensive, though.
Please tell us more about the current network topology, data characteristics, server specifics, and whatever else you can.
RAID isn't a backup. Say it with me, and repeat it to yourself again and again. RAID protects you from equipment failure, but not disaster.
Whatever you do, having a backup kept offline is essential. If someone can maliciously or accidentally trash all your backups, because they're all online and accessible via the network, your backups weren't really backups. (Read up on what happend to "avsim.com" when they got hacked if you want to see what I'm talking about.)
Raid will only provide you with backups in case of hardware failure. What you need is backup software to make a duplicate copy of all content on another server, preferably in a different geographic location.
I'd buy a backup server with a few 1TB drives and backup everything to the backup server.
Took this answer from a previous question about backups as I believe it still applies here (FYI it was my answer, not someone elses):
Depending on how much you need to back up I would recommend the following:
1.JungleDisk / Amazon S3 - Works VERY well.
2.RSYNC to a remote machine also works very well. CRON job every XX hours.
We back up almost a TB of data to Amazon's S3 cloud and have a "warm standby" at our colo backing up from the master several times a day (via rsync). The cost for transfer/storage on Amazon S3 is extremely cheap. (ie. cheaper than burning to a DVD but not cheaper than backing up to HDD. I know some folks who simply plug in a 1TB UDB "My Book" or something into the server and back it up weekly/monthly. Depending on your needs one or two of those might be the cheapest solution for you.
Now that's just talking about DATA backups... not backing up the server itself...
Depending on your needs, Norton Ghost or even Acronis (http://www.acronis.com) might be of help to you. Things like Norton Ghost tend to rely on your ability to be able to actually turn OFF the computer to make the backup. Some of us don't have that luxury but if YOU do then Norton Ghost is a VERY good product.
RAID should not be used as a backup solution. I'd get external drives or setup a backup server with something like BackupPC and then rotate the disks and store at least one copy off-site.
What kind of data? Database? regular files? Do you need it to be a live sync?
Some backup solutions will allow restores to any point in the case of a database.
We're also getting into the triangle of cost, quality, speed. Sacrifice one to get the other two.
Cost in this case is money. Quality is the detail of the backup. (more points to restore, off site value) and Speed is the performance you gain or lose with different solutions.
Figuring out what is more important can help you decide on a solution.
If you're willing to part with hard cash, we use R1Soft CDP across our platform. It's pretty good.
If you are serious about backing up nearly 4 TB of data which is what you're talking about with 15 servers each with 250 GB you have a bunch of questions to answer.
1. How much of the data is already duplicated intentionally or not across your environment?
If you have a ton of duplicated data you can greatly reduce your consumed space, and the quantity of data you have to back up.
2. Can you centralize the data to a smaller number of servers?
Patching, licensing and maintaining 15 servers is a time-consuming process when they could be consolidated to one NAS or SAN. Combining them wouldn't pose any "security risk" if permissions were managed correctly (this was the biggest complaint from my users when we consolidated storage they felt if they didn't have their OWN servers people could see their data. Education resolved it.) If they can't all be condensed for reasons of geography that's understandable. That will also change your backup strategy, as nobody wants to drag tons of data across a WAN for backups.
3. Why are you backing up your data? Disatser Recovery? Protection from Accidental Deletion? Potential Hardware Failure? All of the above? These answers drive your retention window, and your methodology. As others have said RAID is only good against hardware failure, if you delete a file on a RAID set it is as good as gone. If you need to get back things users have deleted then you have to know how often the data is used. A month of backups on a file that is only used quarterly, means that you won't have the file when they notice it is gone. I'm not advocating keeping 3 months of incremental data here, but retaining month ends, kept for a year might be a good idea. If disaster recovery is a consideration then you need to think about getting your data off site, as well as off the servers. Also knowing why you're backing up will tell you how often you should back up. Weekly full backups with nightly incremental or differential backups is a traditional method, and a good sort of default, but if your data changes very fast or very slowly this could be nowhere near often enough or way too often.
4. How much budget do you have for backups? This will be a big determining factor in what you end up choosing. For 4 TB of data all in one location I'd go for a small tape changer of some sort and backup software to automate the backups. Or possibly for a disk based backup unit with deduplication. Cross backup is sort of cheap at the outset, but doesn't provide any disaster recovery value, and gets more costly as your data set grows larger. There are also services out there that can back up your data across the Internet even at this scale, in an automated form with encryption and deduplication, which might work better if your data is on many sites.
Well, the architecture is:
15 servers with HTTP server, all files are regular (no databases, no applications) and available for downloads (file sharing project). they`re running under MogileFS.
A couple of application servers, which I don't count in case that they're living their own life. The point for backups is: if something happened I'll roll-out data from backup as fast as it possible.
So, I said about RAID as an option, of course it's not a backup solution, but it will help to reduce total failures.
As a real option I see Amazon S3 with its simple API, on which I already have an account for daily database backups.
And my interest is simple, I just want to know how people deal with such tasks.
Something like MogileFS would be able to help in this instance. It's a large-scale storage solution with no single point of failure, and rather than backing the system as a whole up, has multiple copies of the data scattered around the cluster. Individual drives (or spindles) could fail, but the more important a file is, the more copies of it would exist around the cluster. Thumbnails that could be easily recreated may only have 1 or 2 copies, but the original pictures might have more - according to the class of data that file belonged to.
Similar techniques are used by Google and Facebook to store their own files.