Given the MTTF T of an individual drive (say, 100000 hours) and the average time r it takes the operator to replace a failed drive and the array-controller to rebuild the array (say, 10 hours), how long will it take, on average, for a second drive to fail while the earlier failure is still being replaced thus dooming the entire N-drive RAID5?
In my own calculations I keep coming up with results of many centuries -- even for large values of N and r, which means, using "hot spares" to reduce the recovery time is a waste... Yet, so many people choose to dedicate a slot in a RAID-enclosure to hot spare (instead of increasing capacity), it baffles me...
Let's try a 10 drive RAID5 array with a 3% AFR and a two day rebuild time and do some rough calculations:
A 3% AFR over 10 drives means that roughly we will have a 30% chance of a single drive failure in a year.
If we assume a two day rebuild time, that means the chance that one of the nine remaining drives will fail during the rebuild is about 1.5% (30 * 9 * 2 / 365). That gives us about a .5% chance (.3 * 1.5) of a catastrophic failure with service interruption in a given year.
I agree that a hot spare is not the right solution to this problem. It only reduces the rebuild time a bit.