At my work, backups have a surprisingly low priority. The backup strategy was implemented a while ago, and since then it's just assumed the backups are fine. If you ask the sysadmins, they'll say everything is backed up.
But then, when you ask for a SPECIFIC backup, half the time they are not there:
- The disk got full
- The tape failed
- Looks like someone disabled the backup job
- The network connection had downtime
- We ordered that disk years ago, but finance hasn't approved the purchase order
- The files are corrupt
- File contains wrong database
- Only transaction log backups (useless without a full one)
A few weeks ago, disaster came real close as one of the servers lost one too many raid disks. Luckily one disk was still kind enough to copy the data, if you tried a lot of times.
But even after that near-disaster, I can't seem to convince the sysadmins to improve the situation. So I'm wondering, any tips for opening people's eyes? It seems to me we're walking along the edge of a cliff.
You always have to get these things fixed from the top.
Is the current backup strategy backed by and understood by management? If not, it's useless.
The executive management needs to know about the problems and what risks are involved (losing financial data that you need to bring out legally to survive, or customer data that has taken years to collect?) and weigh that in deciding on actions, or deciding on letting someone (like you) take action.
If you can't get to management, try business controllers or other financial positions where data retrieval and its integrity is of high importance to the company's reports. They in turn can "start the storm" if needed...
Where to begin? This is a disaster waiting to happen. A Sysadmins primary job function is to ensure data is backed up and recoverable. Everything else is secondary. No if's no but's.
Here are a few things you can do:
Track KPIs for restores. It should be possible to produce a report showing how many requests for restores have been successful. Anything less than 100% should be investigated thoroughly. Management love reports and this is hard evidence.
There should be documented procedures for all backup and restore operations, including all systems and their backup strategy, tape rotations, schedules, escalation paths, test restores etc. Ask to see them.
Speak to the manager of the sys admins and voice your concerns. Go armed with proof that restores aren't working. If no joy go higher.
Seriously - kick up a fuss. Stuff like this can destroy a company.
Propose (at minimum) yearly disaster recovery tests. The work required to successfully execute the test should reveal shortcomings.
Where I work we have a seriously good IT department, every year they get together from every office around Europe and have a 'restore fest' onto rented servers in a datacentre, effectively simulating what would happen if staff came to work one day and found the office had burnt down during the night.
Get the big boss involved, remind him that if disaster struck, he'd be out of a bonus that year (or worse!) and so maybe it would be prudent to organise a similar disaster recovery exercise. It shouldn't take long or cost much - admins get sent away with their offsite backup tapes and told to bring up an identical office environment from them.
Then sit back and watch IT get better - once management realise that the company data is dangerously close to being permanently lost, sparks will fly (from the rockets that will be strategically placed in said admins)
It is easy to blame the admins -- however Oskar has it right: these things are driven from the top. If management won't spend the bucks to make backups a priority, then the sysadmins are usually out of luck and do the best they can with the resources they have.
The key, if you are one of those unlucky admins -- and I have been in this boat for some customer engagements -- is that you ensure that managment is briefed, repeatedly, and in a paper-trail-confirmable way, that this is a risk to the business.
My strategy is to constantly hammer at the problems. If you do that, sometimes the problems will get fixed, but it's mostly so that whomever I report to can't hide behind the "I was never briefed" excuse. As a consultant, I can usually go one better. I can get my bosses to brief more senior management than I can that there is a vulnerability. This spreads the blame around, or at least focuses it at a level higher than I am.
At the same time you have to be inventive and work hard to minimize the risks with whatever resources the customer can provide.
While in some cases the admins may be culpable, management is always responsible: either for knowing the risk and not doing enough to mitigate it, or hiring people who don't alert them to these risks.
I am responsible for about 200 servers spread across the North West of the UK, and this is obviously far too many to check manually.
I configure the backup so that on completion it runs a (VBScript) script that looks through the backup log, works out whether the backup worked or not and writes a record into a central database with the backup result. Then at head office I run a script that queries this database and presents me with a list of sites where either the backup reported an error or there was no report from the site.
The end result is that when I sit down at my desk I have a list of all the sites where I need to check the backup.
The point of all this is that the default assumption is that the backup failed, and the backup is considered to have worked only if my VBScript detected no errors and wrote this conclusion I to my database. This makes sure backup failures don't go unnoticed.
Some of the servers use Backup Exec, some NTBackup and some just copy their files to another server across the network. It doesn't matter what type of backup the servers do as it's easy to tweak my VBScript to check for errors. My script is actually pretty basic, it just opens the backup report as a text file and greps for phrases like "failed to mount", "tape full", "CRC error" etc, etc. I'm sure a professional programmer would do a slicker job. However the whole thing is simple and robust, and it's proactive in the sense that I see the backup failure report whether I want to or not and I'd only fail to notice an error if I consciously decided to ignore the report.
JR
PS 99% of the backup failures are because the users forgot to change the backup tape. Don't you just love lusers :-)
A backup that isn't tested is no backup whatsoever.