We run a small business network with a few Windows servers and Backup Exec + an LTO4 tape library to back them all up. We use a yearly, monthly, weekly rota, with tapes going off-site. I should also mention that we use LTO barcodes.
My question is really this - What paperwork/spreadsheets/databases/etc do you use around a backup rota to achieve goals such as the following:
a) Ensure there is a written record of accountability showing that engineers have checked backup logs to ensure jobs are completing successfully, the tape is in good condition, etc (aside from anything else, this seems to be a good way to encourage the process to be followed if people must sign their name to say they've done it).
b) Ability to track where all tapes are currently stored (Backup Exec helps with this, but a separate record seems sensible). Would also be good if this record was somehow stored off-site so that it is accessible in the event of a disaster such as an office fire.
c) In a disaster recovery situation, there isn't just tapes stored off-site, but a written record explaining exactly what job the tapes correspond to, with a record showing the job completed successfully, etc.
d) Anything else that's important
In short, an audit trail. An audit trail that is designed is such a way that it is resilient to disaster situations such as office fires.
Do people tend to roll their own solution, or are there off-the-shelf solutions? Do you tend to keep it all paper based, or do you have some electronic method? Do you keep any paperwork with the off-site tapes?
I should say that we already have a basic system in place, but I'm interested to see what makes up a good audit trail system, in the hope I can improve ours.
Many thanks!
Backup Exec has a feature called "vaulting" to keep track of tapes sent offsite.
a) seems more like an exercise in bureaucratic box-ticking.
b) you have two or three records of where tapes are: your offsite storage provider's reports; Backup Exec/library; and possibly your own list/spreadsheet/database.
One task after each tape rotation must be to reconcile these. This should be done by computer: enter all the records into files (in some common format) and have the computer compare them. Ticking off tape IDs on a piece of paper is too error-prone.
c) seems pointless. In a DR situation, you need to be able to recreate quickly your Backup Installation, so you need detailed (and tested and rehearsed) instructions for that, and catalog backups (at least daily) on tape and disk.
Make sure there is a proper (and accessible) record of who has the authority to recall tapes back from offsite storage. What happens if they are all on holiday when needed?
(a) is important, but it shouldn't be left as a process issue for humans. Checking that all these things are happening, with appropriate periodicity, should be one of the functions of your monitoring system.
(b) is the job of the backup software. Recall the principle "one datum, one location"; if your backup software says a tape's in one place, and your other internal process says its in another, who will you believe? If your onsite/offsite requests are generated automatically (as they should be), it's helpful to keep (soft) copies of those; they can always be used as an emergency fallback check of the backup software's memory.
(c) is also the job of the backup software. Any good software package will have the concept of a "bare metal restore" built into it, and that should include the bare metal restore of the backup server itself. My preferred backup software, bacula, details this in their documentation, which assumes that everything has been lost except the stack of offsite backup tapes, and that you have acquired replacement hardware. It says what tools you'd use to index the tapes, how to find the most recent catalogue backup, how to restore that into a fresh, empty bacula instance, and how you'd go about restoring the clients from there.
Ensure that your backup software also documents this. Test that the procedure works. Keep your notes from those tests.
As for (d), I think you've already covered most of the important points. The one I'd reiterate is that you should test your restores frequently; not just once every six months, but at least once a month. Pick a random employee, ask them which file they'd hate to lose; check this can be restored to their satisfaction. Ask a random IT person which server they'd most hate to lose; restore it to another box and have them check it over for functionality. Test your DR procedures every six to twelve months, in full. Yes, this all costs; lots of time as well as offsite callback charges. But untested backups and procedures may well be worthless, and certainly can't be relied on.