How does a small organization with limited resources go about doing a restore test of its data backup system?
The cajoling of "Test your backups!" seems unrealistic when faced with the reality of what a full-scale restore test would involve, without affecting the mainline systems.
Assume the organization doesn't have tens of thousands of dollars worth of reserve server capacity just lying around to allocate for a temporary spin-up of a full test environment to verify the nightly backups are restorable.
Is there a way to justify purchasing all the mainline hardware a second time, just to do annual restore testing, but it otherwise sits there in storage, powered down and not used?
It has been suggested in other Server Fault discussions on media restore testing, to use a separate tape drive to confirm that media is usable in another device.
For a small site with only a few servers and a single production tape drive, it seems hard to justify buying an additional LTO-7 tape drive for thousands of dollars and additional licensing for the backup software to go with it, just to use it for a once-per-year media restore / test environment verification process and then stick it on a shelf and don't use it until next year's test process.
You test your backups primarily to test your restore procedures so that when you're in crisis situation you'll know exactly what to do and when everybody will be panicking you'll be competent, confident, calm and will know exactly what to do and roughly how long the restore will take etc. etc. because by then restoring backups is a routine event.
The second thing you probably want to do is test data integrity, once you restored your critical data can production be resumed? Is nothing corrupted or incomplete?
You can and probably should test both of those things one small piece at a time. Only once you have the basics down should you attempt restoring a whole datacenter.
If you make backups of file systems and network shares for instance a suitable test would be to restore a specific directory at an alternate location and compare file-sizes, hashes and permissions with the original.
The next time you need to clone a database for testing, instead restore a production database from back-up.
Do a "bare-metal" OS restore on a VM if need be.
But backups and restores are just one aspect of a larger disaster recovery strategy and business continuity plan.
What will your business do when your current location would be lost due to natural disaster (fire, flooding, hurricane etc.)? Can it continue to operate from other existing locations, or is yours is the only location, will the business simply go bankrupt or will insurance money be used to rent emergency offices/containers?
That was the BCP strategy a couple of years ago at one company: a contract with HP, or maybe IBM at the time, to supply a datacenter in a container once a year for complete datacenter disaster recovery tests and having that on standby as well in case of acute disasters.
That company had 1 office facility and only tapes off-site (or maybe a tape-robot) and everything else in-house. The idea was that renting temporary furnished office space, getting internet connectivity and rerouting telephone numbers, getting desktops and printers etc. would be mostly commodity and easy to arrange. But IT slightly less so. The cost-benefit calculations for a twin-datacenter were unfavourable.
So initially every 6 months, but afterwards once a year, they did do a complete BCP test, but on temporary rented hardware: deploying VMWare, restoring the back-up server, restoring VM's with AD domain controllers, mail servers, database & application servers and file-shares.
A more contemporary BCP strategy could be cloud based and with both an off-premises backup copy online and you test your DR restore in the cloud as well, if you only need them a couple of days even a fairly large number of VM won't break the bank.
To paraphrase an old adage
In short, backup and restore tests are absolute needs. To have a good backup and restore plan, I would like to stress the following points:
tar
) or, even better, usersync
(or similar tool) to have a filesystem-level backup of your data. With such tools you can very easily inspect your backup and have at-a-glance idea if all (or most) is present/accessible or not.For fast, cost-effective restores it is critical to make ample use of temporary virtual machines, run on cheap hardware (read: retired servers or workstation). If disk space is a problem, do wide use of thin provisiong. If available RAM is the problem, restore only a small VM subset (even a single one) each time.
For a small site with only a few servers and a single production tape drive, it seems hard to justify buying an additional LTO-7 tape drive for thousands of dollars and additional licensing for the backup software to go with it, just to use it for a once-per-year media restore / test environment verification process and then stick it on a shelf and don't use it until next year's test process.
Most companies don't actually do that. The reason is that they assume that in the unlikely event that they need replacement backup hardware in the event of a complete and catastrophic loss that they can purchase the required hardware and have it within a matter of hours (for a price). So your plan doesn't necessarily need to include purchasing reserve backup hardware, software, licenses, etc.