We've struggled with the RAID controller in our database server, a Lenovo ThinkServer RD120. It is a rebranded Adaptec that Lenovo / IBM dubs the ServeRAID 8k.
We have patched this ServeRAID 8k up to the very latest and greatest:
- RAID bios version
- RAID backplane bios version
- Windows Server 2008 driver
This RAID controller has had multiple critical BIOS updates even in the short 4 month time we've owned it, and the change history is just.. well, scary.
We've tried both write-back and write-through strategies on the logical RAID drives. We still get intermittent I/O errors under heavy disk activity. They are not common, but serious when they happen, as they cause SQL Server 2008 I/O timeouts and sometimes failure of SQL connection pools.
We were at the end of our rope troubleshooting this problem. Short of hardcore stuff like replacing the entire server, or replacing the RAID hardware, we were getting desperate.
When I first got the server, I had a problem where drive bay #6 wasn't recognized. Switching out hard drives to a different brand, strangely, fixed this -- and updating the RAID BIOS (for the first of many times) fixed it permanently, so I was able to use the original "incompatible" drive in bay 6. On a hunch, I began to assume that the Western Digital SATA hard drives I chose were somehow incompatible with the ServeRAID 8k controller.
Buying 6 new hard drives was one of the cheaper options on the table, so I went for 6 Hitachi (aka IBM, aka Lenovo) hard drives under the theory that an IBM/Lenovo RAID controller is more likely to work with the drives it's typically sold with.
Looks like that hunch paid off -- we've been through three of our heaviest load days (mon,tue,wed) without a single I/O error of any kind. Prior to this we regularly had at least one I/O "event" in this time frame. It sure looks like switching brands of hard drive has fixed our intermittent RAID I/O problems!
While I understand that IBM/Lenovo probably tests their RAID controller exclusively with their own brand of hard drives, I'm disturbed that a RAID controller would have such subtle I/O problems with particular brands of hard drives.
So my question is, is this sort of SATA drive incompatibility common with RAID controllers? Are there some brands of drives that work better than others, or are "validated" against particular RAID controller? I had sort of assumed that all commodity SATA hard drives were alike and would work reasonably well in any given RAID controller (of sufficient quality).
Even for non-RAID, plain-old desktop hard drives, buying drives from the vendor (at the expected ridiculous markup) can often make a difference. For example, Apple is careful to only ship drives that are actually capable of honoring Mac OS X's
F_FULLSYNC
fcntl()
flag, which goes a long way towards making sure things like Time Machine backups work reliably.Again, this is plain vanilla desktop use with no RAID involved. Anything more complex than that and you definitely want to buy, if not the vendor's own over-priced drives, then at least drive models that you know for sure are on the vendor's "approved" list.
So, to answer your question, is it common? I'd say, yes, more common than you might think, even beyond the enterprise realm.
Yes, I have encountered this with low-end cards and buggy drivers. However, no, not on an up-to-date Adaptec rebranded card. Wow is all I can say. One thing to consider, maybe it is more a bug with the drive than the RAID controller.
I don't have a good answer, but since you seem to have exhausted most of your options other than replacing the card, (and replacing the drives did the trick) here's a few ideas you can consider for your troubleshooting:
The WD drives were RE (RAID Edition) drives, right? The time limited error recovery is important, so if you don't have that and the drive is attempting to recover the sector, you are going to get a looooong pause from that drive. If the RAID controller is being patient and not dropping the drive you'll have a big problem on your hands.
Check the SMART data on the drives you removed and see if there is anything interesting.
Another comment about the importance of time limited error recovery (TLER) feature, from NAS / RAID vendor support:
I don't think it's common per se. However, as soon as you start using enterprise storage controllers, whether that be SAN's or standalone RAID controllers, you'll generally want to adhere to their compatibility list rather closely.
You may be able to save some bucks on the sticker price by buying a cheap range of disks, but that's probably one of the last areas I'd want to save money on - given the importance of data in most scenarios.
In other words, explicit incompatibility is very uncommon, but explicit compatibility adherence is recommendable.
I wouldn't dream of using SATA disks for a server - none of them have the expected duty cycle of a server quality drive and they don't have the rich command set that SCSI/SAS has for monitoring drive performance and health. Lenovo servers are cheap and great if you have lots of servers with none of them really that important but there's a reason that HP's 300-series servers account for 40% of the market - they work. In particular their 'SmartArray' disk controllers are matchless in reliability and performance and their pre-failure guarantee is a welcome addition. Not the cheapest but how much is your time worth? I've been buying their (well Compaq first tbh) servers for twenty years now and have no issue whatsoever buying the 500-800 new ones a year that I do. Seriously check them out.
The answer as always is "it depends".
For certain enterprise storage (say EMC), the vendor will specifically qualify drives and even go to the extent of loading custom firmware.
As Mark says, I find it to be the best when you follow a vendor's approved list if there is one. The initial cost savings is outweighed by the time spent trying to hunt down gremlins.
You have a SAS controller, that might be the problem. While SAS protocol can be used to tunnel ATA commands the signaling at physical level is a bit different (SAS uses higher voltage and wider differential). Almost all controllers are able to speak directly to SATA drives, but if there's a (big? crappy?) backplane in the middle the signal might be disrupted. Usually in the enterprise world attaching SATA drivers directly to a SAS controller is not officially supported, you should use an interposer (a small logic board that connects directly to the disk that on one side understands the full SAS protocol, on the other speaks ATA - in this way the backplane carries the higher SAS signaling).
Somewhat related: mixing SAS and SATA drives on the same backplane tends to fail, because the signaling of all drives (including SAS) is lowered to SATA level.
Most probably your WD drives need a firmware update. See this IBM note for downloading and applyig the update. As you can see from the instructions, the WD drives are far from the only ones with problems.
If you are going to put your drives in a taxing server environment, you are bound to run into more problems than in a typical enthusiast desktop configuration.
Could you maybe comment on why you chose to go with the desktop class Deskstar series of drives instead of the Enterprise/RAID class Ultrastar series? Do you feel the extra cost is not worth the added reliability and speed?
As an engineer that works with RAID controllers, I can say that it is not uncommon for some brands of drives to have problems with certain RAID controllers. Every drive has its particular quirks, and any drive model listed on the controller's "compatible devices" list will have its quirks accounted for by the controller. For a drive model to show up on the list, it has to meet the controller manufacturer's standards for performance and reliability. Any drive not on this list might work, but since it hasn't gone through the same rigorous testing as "approved" devices, YMMV.
In particular, the SATA protocol allows for vendor-specific (non-standardized) commands that can be defined by the drive or the controller. In your case, you are might be seeing a controller that is expecting a drive to respond to a particular proprietary command or a drive that is expecting to see a proprietary command that never arrives.
Another possibility is that your problematic drives do not behave very well under certain stressful workloads, and the behavior you see was enough for Adaptec/IBM to not list that drive model as supported.
Unfortunately, storage protocols (SATA, SAS, etc) are not as nice as other standardized interfaces (USB, PCI, etc) where all you need is a bus and a device that speak the same language and everything's fine. Especially when it comes to Enterprise-grade equipment, device manufacturers and drive manufacturers spend a lot of collaborative time and energy ensuring that customers get the best possible performance out of the configurations used by the majority of customers (that is, using drives off of the "supported devices" list). A drive not on that list may have been designed to perform optimally with a different brand of controller, and the errors you are seeing are a side effect of the optimization.