This may sound like an odd question, but it's generated some spirited discussion with some of my colleagues. Consider a moderately sized RAID array consisting of something like eight or twelve disks. When buying the initial batch of disks, or buying replacements to enlarge the array or refresh the hardware, there are two broad approaches one could take:
- Buy all the drives in one order from one vendor, and receive one large box containing all the disks.
- Order one disk apiece from a variety of vendors, and/or spread out (over a period of days or weeks) several orders of one disk apiece.
There's some middle ground, obviously, but these are the main opposing mindsets. I've been genuinely curious which approach is more sensible in terms of reducing the risk of catastrophic failure of the array. (Let's define that as "25% of the disks fail within a time window equal to how long it takes to resilver the array once.") The logic being, if all the disks came from the same place, they might all have the same underlying defects waiting to strike. The same timebomb with the same initial countdown on the clock, if you will.
I've collected a couple of the more common pros and cons for each approach, but some of them feel like conjecture and gut instinct instead of hard evidence-based data.
Buy all at once, pros
- Less time spent in research/ordering phase.
- Minimizes shipping cost if the vendor charges for it.
- Disks are pretty much guaranteed to have the same firmware version and the same "quirks" in their operational characteristics (temperature, vibration, etc.)
- Price increases/stock shortages unlikely stall the project midway.
- Each next disk is on-hand the moment it's required to be installed.
- Serial numbers are all known upfront, disks can be installed in the enclosure in order of increasing serial number. Seems overly fussy, but some folks seem to value that. (I guess their management interface sorts the disks by serial number instead of hardware port order...?)
Buy all at once, cons
- All disks (probably) came from the same factory, made at the same time, of the same materials. They were stored in the same environment, and subject to the same potential abuses during transit. Any defect or damage present in one is likely present in all.
- If the drives are being replaced one-at-a-time into an existing array and each new disk needs to be resilvered individually, it could be potentially weeks before the last disk from the order is installed and discovered to be faulty. The return/replacement window with the vendor may expire during this time.
- Can't take advantage of near-future price decreases that may occur during the project.
Buy individually, pros
- If one disk fails, it shares very little manufacturing/transit history with any of the other disks. If the failure was caused by something in manufacturing or transit, the root cause likely did not occur in any other disk.
- If a disk is dead on arrival or fails during the first hours of use, that will be detected shortly after the shipment arrives and the return process may go more smoothly.
Buy individually, cons
- Takes a significant amount of time to find enough vendors with agreeable prices. Order tracking, delivery failure, damaged item returns, and other issues can be time-consuming to resolve.
- Potentially higher shipping costs.
- A very real possibility exists that a new disk will be required but none will be on-hand, stalling the project.
- Imagined benefit. Regardless of the vendor or date purchased, all the disks came from the same place and are really the same. Manufacturing defects would have been detected by quality control and substandard disks would not have been sold. Shipping damage would have to be so egregious (and plainly visible to the naked eye) that damaged drives would be obvious upon unpacking.
If we're going simply by bullet point count, "buy in bulk" wins pretty clearly. But some of the pros are weak, and some of the cons are strong. Many of the bullet points simply state the logical inverse of some of the others. Some of these things may be absurd superstition. But if superstition does a better job at maintaining array integrity, I guess I'd be willing to go along with it.
Which group is most sensible here?
UPDATE: I have data relevant to this discussion. The last array I personally built (about four years ago) had eight disks. I ordered from one single vendor, but split the purchase into two orders of four disks each, about one month apart. One disk of the array failed within the first hours of running. It was from the first batch, and the return window for that order had closed in the time it took to spin everything up.
Four years later, the seven original disks plus one replacement are still running error-free. (knock on wood.)
In practice, people who buy from enterprise vendors (HPE, Dell, etc.) do not worry about this.
Drives sourced by these vendors are already spread across multiple manufacturers under the same part number.
An HP disk under a particular SKU may be HGST or Seagate or Western Digital.
Same HP part number, variation on manufacturer, lot number and firmware
You shouldn't try to outsmart/outwit the probability of batch failure, though. You're welcome to try if it gives peace of mind, but it may not be worth the effort.
Good practices like clustering, replication and solid backups are the real protection for batch failures. Add hot and cold spares. Monitor your systems closely. Take advantage of smart filesystems like ZFS :)
And remember, hard drive failures aren't always mechanical...
In deference to the answer from ewwhite, some sysadmins do order in batches. I would never, myself, order drives on an individual basis, but standard ops at the last place I worked in such a capacity was to order drives in batches. For a twelve drive machine, SOP dictated that the drives be split into three batches, giving the machine a three tier redundancy profile.
However, other small outfits that I have consulted at have followed different protocols, some not concerned with the batch, and others splitting batches into two or four arrays. The short answer is do what feels appropriate for the level of service you need to achieve.
Side note: The last place I worked was certainly doing the right thing. The app storage machine decided to fail on an entire batch of drives, and we discovered that this particular batch all had the same fault. Had we not followed a batch protocol, we would have suffered a catastrophic loss of data.
Honest answer from someone that's spent a lot of time dealing with dying raid arrays and difficult drives: Don't have all your drives from the same batch if you can avoid it.
My experience only applies to spinning disks, SSDs have their own issues and benefits to consider when bulk ordering.
Exactly the best way to handle things depends mostly on how big the array you're working with is, if you're working with something like 6 drive arrays with 2 drive redundancy you can probably safely buy similar drives from 3 manufacturers and split the array like that.
If you're using an odd drive or you're working with arrays that can't be easily partitioned like that you can try other approaches like buying the same drive from different vendors, or if you're buying in bulk you can look through and try to separate the drives based on likelihood of being manufactured together.
If you're running a small enough array with the right underlying tech it might even be worth your time to build it incrementally from heterogeneous disk supplies. Start with the minimum number of drives you can get away with and buy the next supply a month or two later, or when you fill the system. That also let's you get a feel for any issues that there might be with the particular models you picked.
The reason behind this advice is a combination of two quirks of drives.
MTBF is remarkably broken when you have a lot of drives with similar origins. In statistics we'd call it a sampling bias, because of the similarity in your samples the averaging effects will tend to be less useful. If there's a fault with the batch or even with the design itself, and it happens more often than you'd think, then drives from that batch will fail sooner than MTBF would suggest.
If the drives are spread out, you might get [50%, 90%, 120%, 200%] of MTBF, but if all the drives come from that 50% batch you've got a mess on your hands.
Raid array reassembly kills disks. No, really. If you get a drive failure and the array rebuilds, it's going to put extra load on the other drives while it scans the data off them. If you have a drive close to failure the rebuild may well take it out, or it may already have a failure location that you just weren't aware of because that section hadn't been read recently.
If you've got a lot of drives from the same batch, the chances of this kind of cascade failure occurring are much higher than the chances if they're different. You can mitigate this by having regular patrol scans, scrubs, resilvering, whatever the recommended practice is for the type of array you're using, but the downside to that is that it will impact performance and can takes hours to complete.
For some context on how wildly the longevity of drives varies, Backblaze do a regular drive failure stat report... I'm not affiliated with the company in any way but they should know what they're talking about on the subject of drive reliability. An example is https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/ ... your sample set will likely be smaller, so outlying data can mess up your own experience, it's still a good reference.
I had to consider this issue for a customer a couple years ago. I have a combination of practical experience and research to back up the recommendation to multisource.
Setting aside your pros and cons for the moment, as well as ewwhite's excellent answer, prudence suggests that if you are buying the drives yourself, you multisource them. A quick look at the Wikipedia discussion of RAID weaknesses points to two interesting references.
The first reference is the ACM paper RAID: High-Performance, Reliable Secondary Storage (Chen, Lee, Gibson, Katz and Patterson. ACM Computing Surveys. 26:145-185). In section 3.4.4 the authors point out that hardware failures are not always statistically independent events, and give the reasons why. At the time I am writing this answer, the paper is available online; pp 19-22 discuss reliability (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.3889).
The second reference is Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? (Schroeder, Gibson. 5th USENIX Conference on File and Storage Technologies.) The authors present statistical data to back up the assertion that drive failures may be clustered in time at a rate higher than predicted for independent events. At the time I am writing this answer, this paper is also available online (https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/index.html).
Dell explicitly recommended against RAID 5 back in 2012 because of correlated disk failures in large disk environments; RAID 6 is predicted to become unreliable for similar reasons around 2019 (A ZDNet article titled "why-raid-6-stops-working-in-2019": http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/). While a key element of both of these is disk size and rebuild times, smaller drive sizes and multisourcing had been recommended as a mitigator for the RAID 5 issue.
So yes, multisource the drives if you can; if you are buying from an enterprise vendor as described in ewwhite's answer this may be happening for you transparently. However ... my customer bought 16 2TB drives from an enterprise vendor. They just happened to be from the same manufacturer and appeared to be manufactured at the same time. Two of the drives failed within two weeks of configuring the RAID01 arrays. So check the drives when you get them. (You already check them anyway, right?)
Another potential disadvantage to ordering drives individually is packaging and handling.
Hard drives are almost never supplied in retail packaging. If you buy them one at a time they will almost certainly be repacked by the seller. I have found this repackaging to by highly variable. Sometimes you get a nice box with plenty of padding but other times you get hardly any padding at all.
A smaller box is also more vulnerable to being tossed around by carriers without obvious outward damage.
I always buy used/bulk. Orders I track are almost always same device model, and being used at least mitigates the concern about a "bad batch". There's so much fire-sale hardware floating around the web that I have a hard time justifying buying new drives (or anything else for that matter) unless it's for mission critical hardware (and all our backup hardware is still refurb!)
+PRO: competitive online pricing and the constantly flood of hardware from shifting business environments means it takes almost no effort to get 50-80% off retail for working environment pulls.
+PRO: Price low price frees up budget to over-purchase and maintain a solid back stock of replacement hardware.
+PRO: Seller relations I have a handful of online sellers who I get slight discounts off the already sizeable discount for refurb/used hardware. Not usually going to get that with Monoprice unless you are buying in huge quantity or have an SLA with them. Also, especially with hard drives, just make sure you test them right out the box. I've never had a problem with a seller not refunding or replacing DOA hardware (unless it was a scam I failed to catch).
-CON: Warranty, Legitimacy Issues Warranty is based on manufacture date of device, you're also going to need to keep a lookout for online huksters trying to sell you re-brands, clones, etc.
-CON: Testing Need to factor in overhead of testing. Regardless, you should be testing fresh hardware also so not sure if this applies.
-CON: lifespan difficult to judge; slightly more susceptible to disk failures.
Note: if it's a client build and they don't explicit request refurb/used, always by shiny/new!
It is possible to get more reliability by using hard drives that come from different batches and ideally manufacturers. Otherwise they may fail too close in time. The excellent answer of @Eliodorus explains this enough.
Of course, it does not matter who shuffles the drives. If your provider confirms it does that for you already, no need to care about. However it seems not reasonable to do some forensic on maybe even different provider and conclude somebody does for you if you are not told directly. Providers usually are not lazy to advertise various measures they take to increase reliability of they drives.
If you are trying to mitigate the "bad batch" scenario, which means every drive in a particular purchase batch can/will fail near the same time, it is also important to consider the size of the array, and the RAID level being used.
If you consider doing multiple orders, no set standard is applicable across the board. People recommending 2 - 4 purchasing tiers should ask themselves, if one entire tier of drives fail, will the array still be online? So for redundancy RAID levels like 1/5/10/50 you would have to buy drives 1 at a time. For RAID6 you could purchase 2 at a time.
I would recommend regardless of how you purchase the drives that you backup regularly and purchase adequate hot/cold spares for your array size and RAID type.
Actually, it depends on the Redundant array of inexpensive discs (Raid) level. In Raid two, three, four, five and six, it does help to have drives from several different batches, but it is not decisive: one already inherently forfeits reliability and performance in using these levels.
Now, for the usually sane choice, that of using Raid 1 (mirroring) or 1+0 (striping over mirrors), it is indeed useful to have different drives on different sides of each mirror (each Raid 1 array), so as to not have the mirror fail during a recovery. Also, there should be hot spares to minimize the recovery window.
For more information, check out the tongue-in-cheek but informative Battle Against Any Raid ‘F’2 (Baarf) Web site, by the prestigious Oak table network of senior DBAs. Wikipedia also sums up the issue nicely.
As far as i know the quality checking of disk storage at the factory is pretty high, and i personally would not be afraid of a hardware failure in bulk due to manufacturing reasons.
And if i were slightly paranoid i would just buy storage from two different manufactures that i know dont share factories, through the same vendor.
Storage is so cheap, that it does not make sense as a company to NOT buy in bulk, and you will within the company also write off the storage after a couple of years so the investment is not that great. The time it takes to purchase from individual vendors probably cost more due to time spent.
If you still are afraid of disk failure in bulk, buy more than you need. if you know you need 12 disks, than buy 5 to 7 in spare. That would only be $48 times 5 to 7, per terabyte, and we can still go cheaper without making our system unstable or unsafe because if discount in bulk or second hand disks (why is safe). Than we talk of resilver / re-initializing the array, now i of course have no way of knowing how large your storage solution is now, but if you spend weeks on this task than i would probably consider to reconfigure the organizational storage since this sounds (to me) more as a miss-configuration than anything else in one way or another.
If we than become REALLY paranoid, get 2x of what ever storage solution you are running, based on how sensitive your organisation are to a storage breakdown this could be cheaper, this is not only a option for fortune 500 companies.
And we can also talk about off loading data we dont need here and now, such as (random example) years of historical financial data to "cloud" vendors that we first encrypt. This will remove storage needs from our own storage that will free us up either financially or functionally.
Based on who you are, where you are and what you do their would be different solutions to best work for you.