It is a common situation, when administrator makes system for automatic backuping and forgets it. Only after a system fails administrator notices, that backup system has broken before or backups are unrestorable because of some fault and he has no current backup to restore from... So what are best practices to avoid such situations??
Run fire drills ... every couple of months it is a good idea to say XYZ system is down ... then actually go through the motions of bringing it back online to a new VM etc etc. It keeps things honest and helps you catch mistakes.
soapbox mode: ON
I would say that its as simple that backups that isn't tested regularly is worthless.
A my previous job we had a policy that every system (production, test, development monitoring etc.) should be test restored every 6 months.
This was also the job of the most junior admin so that documentation was up to date. Junior being defined by how much work he/she had don on the specific system, sometime (quite often actually) it was the "group manager" that did it
We had special hardware dedicated to this (one Intel and one IBM/AIX box) that was low spec for everything but diskspace, since we did not need to run anything real on the restored host.
Quite a lot of work the first couple of rounds but it led us to streamline the restore process which is the important part of backup.
Since you seem to be referring to the fact that the administrator doesn't notice that the backup job "breaks", and not so much that a working backup did not work right, I would suggest building some sort of monitoring scripts around the backups.
When building a home-grown backup solution, I would do something like this:
Once all of that is done, you should be fine. One extra thing to do would be perform regular test restores. If you have extra hardware to donate to the cause that is.
Where I work we have a warm-site, once a month we randomly choose a system or database and go to our warm site and perform a test restoration exercise on bare-metal to ensure the ability to recover our data.
Honestly, if you data is very important to you, it would be in your best interest to invest in some software to manage your backups for you. There are hundreds of products out there for this, from the cheap and simple, to the enterprise class.
If you are relying on a set of hand-written scripts running in the crontab for your companies backups, sooner or later you will likely get burned.
We have 60%-size 'Reference' versions of our 'Production' systems, we use them for final testing of changes, we restore 'Production' backups to these systems - it tests the backup plus ensures both environments are in step with each other.
One approach is to script a "recovery" job to run periodically, for instance one that grabs a specific text file from the most recent backup and emails you its contents. If it's possible, this should -- at least sometimes -- be done using a different box than the one that created or backed up the data, just to ensure it will work if you should need to do so. The advantage is that you can be sure your encryption/decryption, compression, and storage mechanisms are all working.
This is a little more involved for specialized backups such as email and database servers, though performing some kind of small-scale recovery from a small DB or brick-level mailbox backup and verifying the contents is certainly possible, just a little more involved.
This approach also shouldn't replace a periodic full restore to ensure you can recover data in the event of an emergency -- it just allows you to be a little more confident about the integrity of your day-to-day backup job.
When performing test restore I don't really feel comfortable at the point "this looks nice, files are restored, it seems no file is missing, even the sizes match", or at the point "this looks nice, I started my application... does not crash, displays some decent data".
I want to restore server/cluster from scratch, and then to actually use it for production. Not for a minute, not for an hour, but permanently. If you claim that your restore was successful, then there is absolutely no reason not to start a production. This is not some "dirty" system, that should be forgotten. This is the system that you will face after a real disaster. So, if it passes "looks nice" stage, live with it. Back it up next night. Forget about the original one. You probably will discover some glitches using this approach, and you will be forced to fix all of them. The next restore of the same system has a decent chance to be 100% successful.
This includes your backup software and server. Yes, you need to restore these too.
Have no budget to buy dedicated hardware for restore?
You'll probably find that some backup types are can be easily restore-tested by scripts (such as databases) while others need some manual input (Active Directory restore). Automate as much as you can of this, make sure some kind of reporting is in place, and make sure "someone" performs the manual tests at regular intervals as well. An isolated environment (downscaled copy of prod) will make it easier to perform restore testing.
While we don't test backups we do have the centralized backup checking and reporting component in the system we developed BackupRadar.com. Feel free to check it out to see if it helps with that component. It attaches a copy of the success/failure emails to the backup policy and it will also attach screenshots if your backup software is capable of sending those as well.
Thanks, Patrick
Make sure backup activity is logged, then write something (in perl of course) that parses those logs looking for failures, distill it down and have it sent as a daily email.