How does one fully evaluate a RAID configuration?
Pulling drives is one thing, but are there tools and techniques for more?
I've considered putting a nail through a running drive (powder actuated nailgun) to see what would happen, or simulating various electrical anomalies (shorts/opens in cable, power overloads and surges, etc).
What should be tested, and how?
-Adam
I think your testing should cover the reasonable cases that you plan for. If you're trying to set up a server in the bush, then electrical fluctuations are reasonable test suites. If you're in a data center, the Service Agreement probably covers power.
If you think a drive wildly exploding inside a rack is reasonable - then test it. Maybe you're setting up a server in a command center in Baghdad. But once again, less likely if you're in Washington State.
As a general rule, your tests should cover all expected cases:
And reasonable extreme cases.
And most importantly - RAID doesn't protect against drives silently corrupting data! So make sure you're doing hashes and file verification!
It is indeed important to test a drive failing inelegantly if you care about the ultimate reliability of the overall solution. Every failed RAID solution (meaning the redundancy does not protect against failing drives) I have seen is due to the failure to test real drive failures. The normal test is to pull a drive, claim that drive failure has been tested, and move on.
The best solution is probably to have a collection of marginal drives, or modified firmware that causes inconsistent responses. Only storage vendors are reasonably likely to have this capability.
I like the idea of putting a nail through a running drive, but the forces on adjacent drives might result in an unrealistically catastrophic failure. Or the complete failure of the drive may result in an unrealistically clean failure.
If I was allowed to do legitimate testing of a RAID, I would destroy a few drives with varying means. Hook up wires to random components on the drive's board and fry them or short them. Indeed put a nail through a drive if the geometry of the enclosure makes this unlikely to destroy adjacent drives. (I think the resulting jostling of the remainder of the array is a reasonable test). Intercept a drive's data path and return every possible error, nonsensical results, or correct results delayed by random amounts of time.
Expect drives to return the wrong block sometimes. Expect drives to cause any conceivable electrical problem on their connection.
My experience is that no one considering a storage purchase wants to do real testing. This could expose real problems. I'd be very interested to hear if there is anyone who actually tests storage reliability - certainly they are not publishing their results.