It's pretty common to see advice to disable the write cache on individual disks used for databases because otherwise some disks will acknowledge writes that haven't yet made it to the disk surface.
This implies that some disks don't acknowledge writes until they've made it to the disk surface (Update: or that they report accurately when asked to flush the cache. Where can I find such disks, or where can I look for authoritative information on where to find such disks?
I'm setting up some DB servers that would really benefit from using write cacheing, but the application is price sensitive and I'd rather not double the cost of my disk subsystem for some caching RAID controller because I don't have enough information to know whether I can trust the cache in each drive.
Generally speaking, in direct answer to your question, I am not aware of any major brands of SATA drives that the drive itself has had bugs relative to proper operation with write caching enabled. That is, from a drive perspective only, the drive does what it is supposed to do from a caching perspective. I would also note that even when write caching is enabled, that the delay from a disk write on the SATA cable to the rotating media physically being updated is still very short (~50 to 100ms typically). It's not like the dirty cache data will be just sitting there for seconds at a time.....the drive is continually trying to get dirty data from the cache onto the physical media as soon as it can. This is not just a question of data safety, but one of being ready to accept future writes without any delay (ie: write posting).
The issue that arises when caching is enabled is that the write order to the drive over the SATA cable and the write order to the rotating media is not the same. This can never cause a problem UNLESS you have a loss of power or a system crash before all contents of the cache make it to disk. Why? ->
The issue that can arises here is relative to transaction robustness of the the file system and/or database file contents to these out of order lost writes. In effect, those potentially lost out of order writes can theoretically corrupt the integrity of the transaction logic that would have otherwise have been guaranteed by the disk writes happening in a very specific order to the media.
Now, of course, the designers of the file system, databases, RAID controllers, etc. are aware (or certainly should be aware) of this phenomenon relative to write caching. The write caching is extremely desirable from a performance standpoint in most random access type I/O scenarios. In fact, having the write caching available is a key element of being able to have any real benefit to the more advanced Native Command Queuing (NCQ) that is supported on newer SATA and the last few generations of PATA implementations. So, to guarantee order to the physical media at such certain critical times, the file system and/or application, etc. can specifically request a flush of the write caches to the media. At the completion of this sync request - everything pending from (potentially) file buffers, OS disk caching, physical disk caching etc. is actually out on the media per the transaction system design at the right critical operations. That is, this happens correctly if the programmers make the right call(s) up at the top AND every element of this chain of software and hardware layers did their job correctly. ie: There are no bugs in this regard in the drive, the RAID controllers, the disk drivers, the OS caches, the file system, the database engine, etc. This is a lot of software that all has to work exactly right. Additionally, verifying correctness in this regard is very difficult because in almost any situation normally the write order doesn't matter at all....and power failure and crash scenarios are difficult tests to construct. So, in the end "turning off write caching" at one or more of the various layers and/or meanings of this term....has the reputation of "fixing" certain kinds of issues. In effect, shutting off the write caching behaviors of the RAID controller or OS Disk Caches, or the Drive, etc. is avoiding one or more bugs in the system.....and the source of such lore.
Anyway, getting back to the core of the question: Under SATA, the specific handling of all the disk read/write commands and the flush cache commands are well defined by the SATA specifications. Additionally, the drive manufactures should have detailed documentation for each drive model or drive family describing their implementation and compliance to these rules like this example for Seagate Barracuda drives. In particular, see details of the SATA SET FEATURES command that controls drive operational mode and specifically option 82h can be used to disable disk caching at the drive level because the default is certainly write caching enabled on all drives I am aware of. If you really wanted to disable the cache, this command has to be done at start of each drive reset or power up and is typically under the control of the disk drivers for your operating system. You might be able to encourage your OS driver to set this mode via an IOCTL and/or Registry Setting type thing, but this varies widely.
It's been my experience that a battery-backed caching disk controller will disable the on-drive cache. I'm not aware of a way to disable the on-disk cache otherwise. Even if you could disable the on-disk cache, performance would suffer significantly.
For a low cost option, you can use an inexpensive UPS that can signal your system for an orderly shutdown.
One of the misconceptions if disk write back caches is that they only lose data on power loss. This is not always the case, especially on sATA devices. If a sATA device has an error on it (such as a corner case FW bug or controller bug) and it resets or is reset externally, there is no guarantee that the data in the write-back cache is still available after the hang.
This can lead to scenarios where a device has a transient error, gets reset, data loss occurs in the loss of any dirty cache, and this is silent above the block level of drivers.
Worse, disabling the drive cache via OS tools will also be lost on device resets, so even if a device has its cache disabled at start-of-day, if the device is reset, it will re-enable write-back caching. At another reset, the device will then lose data.
SCSI/SAS drives and some sATA drives have the ability to save the state of the write-back profile to ensure that across resets the property is not lost -- but in practice this is rarely used.
RAID controllers which integrate the block layer into the upper layers can notice drive resets and disable write-back cache again -- but standard sATA and SAS controllers will not do this.
This limitation also goes for other SET FEATURE and similar parameters which are configured for performance and reliability.
I use a RAID system with a supercapacitor rather than a battery to maintain the cache. Batteries wear out, must be monitored, must be replaced and represent a potential point of failure in those respects. A capacitor charges on startup, flushes the cache when power from the UPS fails, lasts virtually forever, does not require monitoring, etc. However, unless you are running a business on the poverty line (not uncommon these days) you should have a UPS and software that shuts down the system cleanly on failure - I usually give it 5-15 minutes (depending on the UPS load and therefore battery available) before shutdown should the power come back up.
During a thunderstorm you may (or may have - power systems are getting better) see the lights flicker, sometimes just before they go out. This is a device called a recloser. It's a circuit breaker that when tripped tries to close the opened switch in case the overload was transient, which most are. If it fails to stay closed after, say three tries, it stays open. The some poor guy has to go out in the rain and deal with it. Don't feel too sorry for him, while making only twice what you and I do and twice that if it's overtime, it is dangerous work.
As you say, a proper battery backed RAID controller will be expensive, but you can find Dell Perc5/i controllers on eBay for £100 ($150) and especially with RAID5 the speed of a controller like the Perc5/i will amaze you. I have several servers with Perc5/is and six disk RAID5 arrays, and they are amongst the fastest disks I have ever seen. Especially for database applications fast disks will really improve performance.
I would bite the bullet and buy a RAID controller.
JR
As far as I understand, fsync() faking is a property of battery backed RAID controllers, not drives. The RAID controller contains a battery that can power its write cache until power is restored to the drive and the write can be safely committed to the disk. This allows the controller to return immediately to the OS, as it makes some level of guarantee that the write will be written to disk.
It should be noted, if the drives writeback cache fills up, writes will block until the cache has be written back to the drive. This means the cache is generally not as effective under sustained writes.
How many IOPS does you application require? Are you sure that you are being limited by the drives write cache, or that a small (compared to memory of your server) on the drive will be of benefit?