Supposedly (see, e.g., a question about it here), with NCQ enabled drives, the drive write cache is supposed to be safe, as in it doesn't lie to the OS about data being committed to the platters when it isn't. I'm trying to figure out what settings are required to make this a reality.
I'm using diskchecker.pl to confirm if all blocks surviving a pull of the power plug. The server is configured like this:
- 4x ST3500514NS running in Linux MD RAID10. Intel 3420 chipset. In AHCI mode.
- LVM running on RAID10.
- Tested filesystem is ext4 (with barrier=1,data=ordered) on a logical volume. I also tried testing directly on a logical volume (block device); that didn't help.
- Debian 6.0 (squeeze); kernel 2.6.32-5-amd64
If I turn off write-cache (hdparm -W0
), then it works (at a huge performance penalty). So it seems like the upper layers are capable.
I've tried enabling FUA in libata (by passing fua=1
to the module loading, and confirming via dmesg
), that did not help.
Any suggestions on how to make this work?
edit: found the reason (see my answer); any suggestions on how to get at least some of the performance back?
Upgrading to kernel 2.6.38-2-amd64 (from sid) fixes the problem, at the cost of a huge performance penalty (very similar to just turning off the write caches).
Doing some research into this, it seems that MD didn't support I/O barriers (except on RAID1) until 2.6.33-rc1 (commit a2826aa92e2e14db372eda01d333267258944033).
Yeah for what i know this is the cost to be safe, you can see many threads about data safety and the speed cost in every one filesystem and storage layer in the Postgresql mailing list, they have been speaking lately of SSD safety for example, only the Vertex 2 Pro or the last SSD intel series that have a small memory attached (like a battery cache in a raid controller) are safe to database use and the problem with SSD can't be fixed disabling write cache.
I paste here two links but you have multiple examples in the mailing list, do a search.
http://archives.postgresql.org/pgsql-performance/2010-06/msg00076.php
http://archives.postgresql.org/pgsql-general/2011-04/msg00709.php
That's why you really should be using an hardware RAID controller with a BBU (battery backup unit). Then you can both have your write cache on and be safe.