Google did a very thorough study on hard drive failures which found that a significant portion of hard drives fail within the first 3 months of heavy usage.
My coworkers and I are thinking we could implement a burn-in process for all our new hard drives that could potentially save us some heartache from losing time on new, untested drives. But before we implement a burn-in process, we would like to get some insight from others who are more experienced:
- How important is it to burn in a hard drive before you start using it?
- How do you implement a burn-in process?
- How long do you burn in a hard drive?
- What software do you use to burn in drives?
- How much stress is too much for a burn-in process?
EDIT: Due to the nature of the business, RAIDs are impossible to use most of the time. We have to rely on single drives that get mailed across the nation quite frequently. We back up drives as soon as we can, but we still encounter failure here and there before we get an opportunity to back up data.
UPDATE
My company has implemented a burn-in process for a while now, and it has proven to be extremely useful. We immediately burn in all new drives that we get in stock, allowing us to find many errors before the warranty expires and before installing them into new computer systems. It has also proven useful to verify that a drive has gone bad. When one of our computers starts encountering errors and a hard drive is the main suspect, we'll rerun the burn-in process on that drive and look at any errors to make sure the drive actually was the problem before starting the RMA process or throwing it in the trash.
Our burn-in process is simple. We have a designated Ubuntu system with lots of SATA ports, and we run badblocks in read/write mode with 4 passes on each drive. To simplify things, we wrote a script that prints a "DATA WILL BE DELETED FROM ALL YOUR DRIVES" warning and then runs badblocks on every drive except the system drive.
IMNSHO, you shouldn't be relying on a burn-in process to weed out bad drives and "protect" your data. Developing this procedure and implementing it will take up time that could be better used elsewhere and even if a drive passes burn-in, it may still fail months later.
You should be using RAID and backups to protect your data. Once that is in place, let it worry about the drives. Good RAID controllers and storage subsystems will have 'scrubbing' processes that go over the data every so often and ensure everything is good.
Once that all is taken care of, there's no need to do disk scrubbing, though as others have mentioned it doesn't hurt to do a system load test to ensure that everything is working as you expect. I wouldn't worry about individual disks at all.
As has been mentioned in the comments, it doesn't make a lot of sense to use hard drives for your particular use case. Shipping them around is far more likely to cause data errors that won't be there when you did the burn-in.
Tape media is designed to be shipped around. You can get 250MBps (or up to 650MBps compressed) with a single IBM TS1140 drive which should be faster than your hard drive. And bigger as well - a single cartridge can give you up to 4TB (uncompressed).
If you don't want to use tape, use SSDs. They can be treated far rougher than HDDs and satisfy all the requirements you've given so far.
After all that, here are my answers to your questions:
Not at all.
One or two runs.
A simple run of, say,
shred
andbadblocks
will do. Check the SMART data afterwards.No stress is too much. You should be able to throw anything at a disk without it blowing up.
If you have a good backup, and good high-availability systems, then not very much. Since restoring from a failure should be pretty easy.
I will typically run badblocks against a drive or new system when I get it. I will run it whenever I resurrect a computer from the spares pile. A command like this (
badblocks -c 2048 -sw /dev/sde
) will actually write to every block 4 times each time with a different pattern (0xaa, 0x55, 0xff, 0x00). This test does not do anything to test lots of random reads/writes, but it should prove that every block can be written too and read.You could also run bonnie++, or iometer which are benchmarking tools. These should try to stress your drives a bit. Drives shouldn't fail even if you try to max them out. So you might as well try to see what they can do. I do not do this though. Getting an I/O benchmark of your storage system right at install/setup time may be very useful in the future when you are looking at performance issues.
A single run of badblocks is enough in my opinion, but I believe I have a very strong backup system, and my HA needs are not that high. I can afford some downtime to restore service on most of the systems I support. If you are so worried, that you think a multi-pass setup may be required, then you probably should have RAID, good backups, and a good HA setup anyway.
If I am in a rush, I may skip a burn-in. My backups, and RAID should be fine.
Given your clarification, it doesn't sound like any burn-in process would be of any use to you. Drives fail primarily because of mechanical factors, usually heat and vibration; not because of any sort of hidden time-bomb. A "burn-in" process tests the installation environment as much as anything else. Once you move the thing, you're back to where you started.
But here are a few pointers that might help you:
Laptop drives are usually designed to withstand a more jostling and vibration than desktop drives. My friends who work in data-recovery shops always ship data to clients on laptop drives for that reason. I've never tested this fact, but it seems to be "common knowledge" in select industries.
Flash drives (e.g. USB thumb drives) are about the most shock-resistant of any medium you'll find. It should be even less likely that you'll loose data in transit if you use flash media.
If you ship a Winchester drive, do a surface scan before putting it in use. Or better yet, just don't put it into use. Instead, you may want to designate certain drives as "shipping" drives, which see all the abuse, but which you don't rely on for data integrity. (I.e.: copy data onto the drive for shipping, copy off after shipping, very checksums on both sides, that kind of thing).
Your process is wrong. You should use raid arrays. Where I work we have made ruggedized raid arrays that are designed to get transported around. It's not rocket science.
Shock mounting the drives in oversize enclosures with big rubber vibration isolators will improve reliability hugely. (Seagate constellation-es drives, are as an example rated for 300G shock but only 2G vibration, non-operating: so the shipping case needs to vibration isolate the drive. http://www.novibes.com/Products&productID=62 or http://www.novibes.com/Products&productId=49 [part #50178])
However, if you really want to burn-in-test hard drives, so here it goes.
I've worked on systems like hard drives and burn in found some problems but...
For accelerated lifecycle testing of PCBs to bring out faults, nothing beats some hot/cold cycles. ( operating hot-cold cycles works even better... but it's harder for you to do, especially with banks of HDD's)
Get yourself an environmental chamber big enoug for the number of drives you acquire at a time. ( These are pretty expensive, it'd be cheaper to ship raid arrays around) You can't skimp on the test chambers you will need humidity control and programmable ramps.
Program in two repeating temperature ramps, down to minimum storage temp, up to maximum storage temp, make the ramps steep enough to upset the application engineer from you hard drive manufacturer. 3 cold-hot cycles in 12 hours should see the drives failing pretty quickly. Run the drives at least 12 hours like this. If any work afterwards I'll be surprised.
I didn't think this up: One place I worked we had a production engineer did this, to get more products shipped with the same test equipment, there was a huge surge in faults in test, but the dead on arrival rate dropped to practically zero.
I disagree with all the answers that basically say "Don't bother with burn-in, have good backups".
While you should always have backups, I spent 9 hours yesterday (on top of my usual 10-hour shift) restoring from backups because the system was running with drives that hadn't been burned in.
There were 6 drives in a RAIDZ2 config (ZFS equivalent to RAID-6) and we had 3 drives die over the course of 18 hours on a box that had been running for approximately 45 days.
The best solution I've found is to purchase drives from one particular manufacturer (don't mix-and-match), then run their provided tool for exercising the drives.
In our case we buy Western Digital and use their DOS-based drive diagnostics from a bootable ISO. We fire it up, run the option to write random garbage to the entire disk, then run the short SMART test followed by the long SMART test. That's usually enough to weed out all the bad sectors, read/write reallocations, etc...
I'm still trying to find a decent way to 'batch' it so I can run it against 8 drives at a time. Might just use 'dd if=/dev/urandom of=/dev/whatever' in Linux or 'badblocks'.
EDIT: I found a nicer way to 'batch' it. I finally got around to setting up a PXE boot server on our network to address a particular need, and noticed that the Ultimate Boot CD can be PXE booted. We now have a handful of junk machines sitting around that can be PXE booted to run drive diagnostics.
How important is it to burn in a hard drive before you start using it?
It Depends.
If you're using it in a RAID that provides redundancy (1, 5, 6, 10)? Not very.
If you're using it standaolone? A little bit, but you're better off just running smartd or something to monitor it instead, at least in my opinion.
This naturally leads to my answer to "How do you implement a burn-in process?" -- I don't.
Rather than trying to "burn in" disks I run them in redundant pairs and use predictive monitoring (like SMART) to tell me when a drive is getting wonky. I've found that the extra time required to do a full burn-in (really exercising the whole disk) is substantially more expensive than dealing with a disk failure and swap-out.
Combining RAID and good backups your data should be very safe, even when dealing with infant mortality (or the other end of the bathtub cure when you start having drives die of old age)
Spinrite (grc.com) will read and write back all the data on the drive. It's a good thing to do for a new drive even if you're not trying to get it to fail. It's takes a long time to run at level 4, usually a couple of days for current-size drives. I should also add that it is non-destructive. In fact if it has data in bad spots it will move and recover it. Of course you would never run it on a an SSD.
I'm sure a once-a-week benchmarking and error checking will suffice in "burning in" hard drives. Though since your post I've never heard of such a thing.
Quoted from "6_6_6" on Stroagereview.com
In all, I personally think it's a bad idea.
Source: http://forums.storagereview.com/index.php/topic/27398-new-hdd-burn-in-routines/
First, I agree with other posters that your use case suggests that tape drives will be the better option.
If that is not possible, If you have to fly drives across the nation, a true RAID doesn't seem to be an option, as you'll have to have many more drives transported, increasing the risk of failure. However, what about a simple mirroring scheme, sending one drive and keeping the other at the source site?
Then, if the drive fails at arrival, a new copy can be made and sent. If the drive is good at arrival, the spare can then be reused - either for sending or for backing up the original data.
You haven't really said why the drives are being shipped - is this is just a way of sending data, do they have complete applications/OS images ready to be booted in a PC, or something else?
I agree with the other answers that RAID or backups are better than scanning, due to the risks of shipping a drive causing mechanical issues.
A more general way of putting this would be "rely on redundant data to catch and correct errors" - either ship 2 drives for each set of data, or ship redundant data on a single drive. Something like Parchive lets you add a defined level of redundancy to data, enabling recovery even if a large part of the data is corrupted. Since disks are quite cheap these days, just buying a larger disk than strictly required will often be cheaper than scanning the drive, shipping a replacement drive, or shipping 2 drives.
This would protect against non-catastrophic failures of the drive - however it's still best to not re-use a shipped drive except for shipping, as suggested earlier, i.e. view it like a tape that must be extracted to a 'real' drive that's permanently installed and not shipped anywhere.
This should let you ship a large amount of data (or even application/OS images) and reduce the impact of disk errors to whatever level is economic.