I'm familiar with what a BBWC (Battery-backed write cache) is intended to do - and previously used them in my servers even with good UPS. There are obvously failures it does not provide protection for. I'm curious to understand whether it actually offers any real benefit in practice.
(NB I'm specifically looking for responses from people who have BBWC and had crashes/failures and whether the BBWC helped recovery or not)
Update
After the feedback here, I'm increasingly skeptical as whether a BBWC adds any value.
To have any confidence about data integrity, the filesystem MUST know when data has been committed to non-volatile storage (not necessarily the disk - a point I'll come back to). It's worth noting that a lot of disks lie about when data has been committed to the disk (http://brad.livejournal.com/2116715.html). While it seems reasonable to assume that disabling the on-disk cache might make the disks more honest, there's still no guarantee that this is the case either.
Due to the typcally large buffers in a BBWC, a barrier can require significantly more data to be commited to disk therefore causing delays on writes: the general advice is to disable barriers when using a non-volatile write back cache (and to disable on-disk caching). However this would appear to undermine the integrity of the write operation - just because more data is maintained in non-volatile storage does not mean that it will be more consistent. Indeed, arguably without demarcation between logical transactions there seems to be less opportunity to ensure consistency than otherwise.
If the BBWC were to acknowledge barriers at the point the data enters it's non-volatile storage (rather than being committed to disk) then it would appear to satisfy the data integrity requirement without a performance penalty - implying that barriers should still be enabled. However since these devices generally exhibit behaviour consistent with flushing the data to the physical device (significantly slower with barriers) and the widespread advice to disable barriers, they cannot therefore be behaving in this way. WHY NOT?
If the I/O in the OS is modelled as a series of streams then there is some scope to minimise the blocking effect of a write barrier when write caching is managed by the OS - since at this level only the logical transaction (a single stream) needs to be committed. On the other hand, a BBWC with no knowledge of which bits of data make up the transaction would have to commit its entire cache to disk. Whether the kernel/filesystems actually implement this in practice would require a lot more effort than I'm wiling to invest at the moment.
A combination of disks telling fibs about what has been committed and sudden loss of power undoubtedly leads to corruption - and with a Journalling or log structured filesystem which don't do a full fsck after an outage its unlikely that the corruption will be detected let alone an attempt made to repair it.
In terms of the modes of failure, in my experience most sudden power outages occur because of loss of mains power (easily mitigated with a UPS and managed shutdown). People pulling the wrong cable out of rack implies poor datacentre hygene (labelling and cable management). There are some types of sudden power loss event which are not prevented by a UPS - failure in the PSU or VRM a BBWC with barriers would provide data integrity in the event of a failure here, however how common are such events? Very rare judging by the lack of responses here.
Certainly moving the fault tolerance higher in the stack is significantly more expensive the a BBWC - however implementing a server as a cluster has lots of other benefits for performance and availability.
An alternative way to mitigate the impact of sudden power loss would be to implement a SAN - AoE makes this a practical proposition (I don't really see the point in iSCSI) but again there's a higher cost.
Sure. I've had battery-backed cache (BBWC) and later flash-backed write cache (FBWC) protect in-flight data following crashes and sudden power loss.
On HP ProLiant servers, the typical message is:
Which means, "Hey, there's data in the write cache that survived the reboot/power-loss!! I'm going to write that back to disk now!!"
An interesting case was my post-mortem of a system that lost power during a tornado, the array sequence was:
The 1793 POST error is unique. - While the system was in use, power was interrupted while data was in the Array Accelerator memory. However, due to the fact that this was a tornado, power was not restored within four days, so the array batteries were depleted and data within was lost. The server had two RAID controllers. The other controller had an FBWC unit, which lasts far longer than a battery. That drive recovered properly. Some data corruption resulted on the array backed by the empty battery.
Despite plenty of battery runtime at the facility, four days without power and hazardous conditions made it impossible for anyone to shut the servers down safely.
Yes, had that case.
Server "without UPS" in a data center (with the data center having a UPS). PDU failure - system crashed hard. No data loss.
And that basically is it. The good thing about a BBWC is that it is in the machine. Have a UPS - believe me, sometimes someone does something stupid (like pulling the wrong cable). A UPS is external. Oh, THAT cable ;)
I've had 2 cases where battery backed cache in HW RAID controllers failed completely (in 2 separate companies).
BBC relies on the unsurprising idea that battery works. The catch is that at some point battery in controller fails and what's devastating is that in many HW raid controllers it fails silently. We thought we had a cache protected against power loss but we did not.
On power loss the RAID array data loss was so extensive that all disk contents were rendered unrecoverable. Everything was lost. One of the cases involved a machine dedicated entirely for testing, but still.
After that I said "never again", switched to software-based disk mirroring (mdadm) in Linux + journal-based fs that has decent resilience against power loss (ext4) and never looked back. Granted, I've used it on servers that did not have extremely high IO usage.
This seems to necessitate a second answer to the question...
I just had a standalone VMware ESXi host lose a drive in a RAID 5 array. The degraded array impacted performance at the VM and application level.
The IT person at this firm was not aware that a drive failed and hard reset the server (to make it all better?).
The interesting effect of doing this to a compromised array with busy virtual machines running atop was this:
So even though the system was halted abruptly, the in-flight data was protected by the BBWC. The virtual machines all recovered properly and the system is in good shape now.
In addition to "saving your data", they are good for other things. They are also good at buffering writes (in the cache) so as to improve performance of the IO subsystem by keeping the disk-write-queue low. This is particularly important for servers where interactive performance is paramount - for example, Citrix XenApp or Windows Terminal Services.
This is less important for a webserver, or a file server. You might not notice, or even be used to, a little lag. However, when you click on an icon in an Office application, you expect responsiveness. And so does your CEO.