Earlier this week I had a 'perfect storm' moment on my servers: Two backup jobs (one for each RAID10 array on the system) had been humming along for 18 hours, and then we had a sustained spike in traffic on my I/O intensive application. The result was unacceptably slow performance, and I had to force our administrator to cancel the backup. (He was not happy about this...not at all. "I'm not responsible if...")
The end result was lots of stress, unhappy customers, and a very grouchy Stu.
The bottleneck was disk utilization. Once the jobs were canceled, everything was working just fine. What can I suggest to my administrators to lessen the impact on my servers?
Here are some of the gory details:
The backup command itself (I got this out of ps
, but really don't know what it means.)
bpbkar -r 1209600 -ru root -dt 0 -to 0 -clnt xtx-le00 -class F_Full_on_Thursday
-sched Incr_Fri_to_Wed -st INCR -bpstart_to 300 -bpend_to 300 -read_to 300
-blks_per_buffer 127 -stream_count 8 -stream_number 8 -jobgrpid 223932 -tir -tir_plus
-use_otm -use_ofb -b svr_1259183136 -kl 28 -fso
The system
- RHEL4 64-bit
- 4GB RAM (~half used by applications)
- DL380G5 with two attached SAS RAID10 partitions, ~550GB and ~825GB
The data
1TB
- ~10 million files
The application
- busy from 0900 to 2300 on weekdays
- I/O intensive (99% read) mostly focused on a few hundred MB of files
We have a system wher we rsync live servers to backup servers (which are built out of cheap 1TB SATA discs) then take full tape backups of the backup servers. It's excellent:
I'm not sure how bpbkar works really, but I would use rsync to backup all the files offsite and then keep them in sync, which would consume very little resources, as only changed files are updated. Naturally, this means it would take quite some time for the initial backup, but you already say you've been 'humming for 18 hours'.
You would then simply manage the backed up data from the other machine however you wanted to.
Small edit: If you choose to step away from tape backups on to disk backups you may want to use RAID6 which will offer dual parity.
If your backups take 18 hours to run normally, deprioritising them probably isn't going to solve the problem (unless you want to run your backups for a couple of days at a time). I'd be inclined to setup a disk replication mechanism to another machine (I like DRBD, myself) and then use LVM to take a point-in-time snapshot, backup that, and move on. Because it's running on a separate machine, (a) it can hammer as hard as it likes without affecting the live app, and (b) it won't be contending with the live app for disk IO, meaning it'll probably run a whole lot faster as well.
One thing I can say for sure: anything you do on the same machine is going to completely bone your disk cache -- as the backup process reads all the data off the disk to be backed up (even if it just checks mtimes rather than reading and checksumming all the files), that's still a lot of metadata blocks running into your cache, and those will be kicking out useful data from the cache and causing more disk IO than is otherwise warranted.
bpbkar is Veritas Netbackups backup client. It supports throttling, so the combination of normal I/O and backup I/O doesn't saturate your disks. Look at here:
http://seer.entsupport.symantec.com/docs/265707.htm
Is there anything stopping you doing full backups at the weekend, as you say the system is mostly busy weekdays, and incremental backups during the week? That'd help you get the backup done during the quiet slot between 2300 and 0900
Another vote for
rsync
. I use it to daily backup 9TB of a very heavy used fileserver. never had an issue.If you're concerned about 'point in time', create an LVM snapshot, mount, rsync, umount, destroy. Somewhat higher load on the server, but still far (far!) less time than a full copy.
If the administrator says that it must positively, absolutely be
bpbkar
, first do an rsync to a less used system, and then runbpbkar
from it. No need to hog your production system.An anectode from testing: when we approached the 8TB limit of ext3, made some 'pull the plug' tests to determine how possible is to corrupt a file by hardware failure while copying. pulled the plug on the server, the storage boxes, and the SAN wiring. copied tens of millions of files.
Conclusions:
in short,
rsync
works really, really well. Any error could better be attributed to your hardware and/or filesystem.bpbkar
wouldn't perform any better facing the same failures.Judging by the command you posted, and looking at the -class and -sched options, it looks like you're running a full backup on Thursday - probably not the best plan considering your usage schedule (900-2300 weekdays).
With huge datasets like that, you should look at the timing of your full backup, plus the type of incremental backup you take during the week. There are 2 types of incremental backups in NetBackup:
I would consider shifting your backup strategy for that system to a Full backup on Saturday or Sunday, and Differential Incremental backups for the rest of the week. That would run a full backup when there's plenty of time to do so (no/few users) and short incrementals in the few hours of low-usage that you have. The issue with this method is that restores might be a bit more convoluted - you would need more tapes - the tape for the full plus all the incrementals from that full to the point you need the data restored to.
From your question, it sounds like you aren't terribly familiar with the backup system. I understand separating the sysadmins from the backup operators, but some discussion needs to happen between them. If the backup operators have no idea how the system is being used, they can't form a proper policy and schedule for the system.
Get your NetBackup admins to schedule the backups better - do full backups on alternating weeks for each RAID array.
You might also want to look into synthetic full backups so you don't need to do as many full backups.
A couple suggestions:
The other rsync suggestions are also good -- there is no reason why the rsynced copy of the data wouldn't be as good as the image on the primary server unless this is a database application. If it is a database sort of application, you should be copying the transaction logs and backup images to another system as they're created, and backing those up.
I would backup the data on the rsync target to netbackup, but I'd also backup the OS and everything but the program data (the stuff that's taking the space) on the primary and rsync targets. Backing up the OS and program data should be easy and fast, and it should probably be in a different backup policy anyhow.
There are two issues at play -- one is of your architecture and the other is of your implementation.
You can easily optimize your implementation by doing things like changing backup windows or doing backups less often or buying faster disks or networks or tape drives or by duplicating the data to another system. These changes are valid, appropriate, and with Moore's law on your side, they may keep your service running properly forever.
You may also be getting into a situation where you're going to run into scaling issues more and more often. If you're even a little worried that you may be getting hit scaling problems more and more often, you're going to need to think about how to redesign your system to make it scale better. Such things aren't easy, but because they're not easy you need to plan for them well in advance of when you've got a gun to your head.
An example of adjusting your architecture may involve moving all your data to a NAS type system such as a NetApp filer or a box running Solaris and ZFS. With a setup like this, you backup the server, which will be mostly your program and configuration, and you use the data management features of the SAN to backup the SAN. These would be things like snapshots and transaction logs against the snapshot.
You may also do something similar to what archive.org does where you store the data on lots of different systems, usually any given piece of data exists on several systems, and then you have a farm of front-end systems that routes the requests to whichever system actually hosts the data.
Lastly -- are you sure your backups even work? Running a backup for 18 hours on a live system results in a backup that reflects that system over those whole 18 hours. Ideally a backup reflects a system at a single atomic point in time, not some crazy rolling backup where some stuff is from one point in time and others is from almost a whole day later. If any of your data is dependent on or points to other parts of the data elsewhere, these dependencies will get royally messed up if the backups get stuff mid-change, and with a dataset this large, you're 100% likely to have this scenario if it is possible, on every backup you've got.