I have recently looked into advanced filesystems (Btrfs, ZFS) for data redundancy and availability and got interested in the additional functionality they provide, especially their "self-healing" capabilities against data corruption.
However, I think I need to take a step back and try to understand if this benefit outweighs their disadvantages (Btrfs bugs and unresolved issues & ZFS availability and performance impact) for general home/SMB-usage, compared to a conventional mdadm-Raid1 + Ext4 solution. A mirrored backup is available either way.
Let's assume I have a couple of file servers which are used for archival purposes and have limited resources, but ECC memory and a stable power source.
- How likely am I to even encounter actual data corruption making files unreadable? How?
- Can Ext4 or the system file manager already detect data errors on copy/move operations, making me at least aware of a problem?
- What happens if one of the madam-Raid1 drives holds different data due to one drive having bad sectors? Will I still be able to retrieve the correct file or will the array be unable to decide which file is the correct one and lose it entirely?
Yes, a functional checksummed filesystem is a very good thing. However, the real motivation is not to be found into the mythical "bitrot" which, while does happen, is very rare. Rather, the main advantage is that such a filesystem provide and end-to-end data checksum, actively protecting you by erroneous disk behavior as misdirected writes and data corruption related to the disk's own private DRAM cache failing and/or misbehaving due to power supply problem.
I experienced that issue first hand, when a Linux RAID 1 array went bad due to a power supply issue. The cache of one disk started corrupting data and the ECC embedded in the disk sectors themselves did not catch anythig, simply because the written data were already corrupted and the ECC was calculated on the corrupted data themselves.
Thanks to its checksummed journal, which detected something strange and suspended the filesystem, XFS limited the damage; however, some files/directories were irremediably corrupted. As this was a backup machine facing no immediate downtime pressure, I rebuilt it with ZFS. When the problem re-occured, during the first scrub ZFS corrected the affected block by reading the good copies from the other disks. Result: no data loss and no downtime. These are two very good reasons to use a checksumming filesystem.
It's worth note that data checksum is so valuable that a device mapper target to provide it (by emulating the T-10 DIF/DIX specs), called dm-integrity, was developed precisely to extend this protection to classical block devices (especially redundant ones as RAID1/5/6). By the virtue of the Stratis project, it is going to be integrated into a comprehensive management CLI/API.
However, you have a point that any potential advantage brought by such filesystem should be compared to the disvantage they inherit. ZFS main problem is that it is not mainlined into the standard kernel, but otherwise is it very fast and stable. On the other hand BTRFS, while mainlined, has many important issues and performance problem (the common suggestion for databases or VMs is to disable CoW which, in turn, disabled checksumming - which is, frankly, not an acceptable answer). Rather then using BTRFS, I would use XFS and hope for the best, or using dm-integrity protected devices.
I had a Seagate HDD that started failing checksums each time I was running zfs scrub. It failed after a few weeks. ZFS and Btrfs have checksums for data and metadata. ext4 has only metadata chcksums.
Only CRC errors and metadata checksum errors. Data corruption can happen.
If it has bad sectors it is not a problem. The entire disk will be "failed", but you have the other disk that is "fine". The problem is when the data has correct CRC, but the data is corrupted. This can happen randomly because of large disks.
I have been using ZFS in production, for both servers and a home office NAS, under both Linux and FreeBSD, for over 6 years. I have found it to be stable, fast, reliable, and I have personally seen it detect and (when able to) correct errors which a simple
md
device orext4
filesystem would not have been able to.Regarding licensing, ZFS is open source it's just released under the CDDL license which is not legally compatible with the GPLv2 license that the linux kernel is released under. Details here. This does not mean it's in a state of "lincensing-limbo for a while" nor does it mean there's any technical incompatibility. It simply means the mainline linux kernel source doesn't have the modules and they have to be retrieved from somewhere like https://zfsonlinux.org . Note that some distros, like debian, include ZFS in their distribution Installing ZFS on Debian / Ubuntu can normally be done with a single
apt
command.As for performance, given sufficient RAM ZFS performance for me is anywhere from close to ext4 to surpassing ext4, depending on memory, available pool space, and compressibility of data. ZFS's biggest disadvantage in my opinion is memory usage: If you have less than 16 GiB of RAM for a production server, you may want to avoid ZFS. That is an overly-simplified rule of thumb; there is much information online about memory requirements for ZFS. I personally run a 10TB pool and an 800GB pool along with some backup pools on a home office linux system with 32GB RAM and performance is great. This server also runs LXC and has multiple services running.
ZFS features go well beyond the data checksumming and self-healing capabilities; its powerful snapshots are much better than LVM snapshots and its inline lz4 compression can actually improve performance by reducing disk writes. I personally achieve a 1.55x savings on the 10TB pool (storing 9.76GiB of data in only 6.3GiB of space on disk)
In my experience, ZFS performance degrades when the pool reaches 75% or 80% usage. So long as you stay below that point, performance should be more than sufficient for general home/SMB-usage.
In the cases I have seen ZFS detect and correct bad data, the root cause was unclear but was likely a bad disk block. I also have ECC memory and use a UPS, so I don't believe the data was corrupted in RAM. In fact, you need ECC RAM to get the benefit from ZFS checksums. However I have seen a handful (~10-15) cases of blocks which failed checksums over the past 6 years. One major advantage of ZFS over an md RAID is that ZFS knows which files are affected by a checksum error. So in cases where a backup pool without redundancy had a checksum error, ZFS told me the exact files which were affected, allowing me to replace those files.
Despite the license ZFS uses not being compatible with the linux kernel, installing the modules is very easy (at least on Debian) and, once familiar with the toolset, management is straightforward. Despite many people citing fear of total data loss with ZFS on the internet, I have never lost any data since making the move to ZFS, and the combination of ZFS snapshots and data checksums/redundancy has personally saved me from experiencing data loss multiple times. It's a clear win and I'll personally never go back to an
md
array.Given enough time, it's almost certain to happen. Coincidentally, it happened to me last week. My home file server developed some bad RAM that was causing periodic lockups. Eventually I decided to simply retire the machine (which was getting rather old) and moved the drives to an enclosure on a different machine. The post-import scrub found and repaired 15 blocks with checksum errors, out of an 8TB pool, which were presumably caused by the bad RAM and/or the lockups. The disks themselves had a clean bill of health from SMART, and tested fine on a subsequent scrub.
No, not really. There might be application-level checksums in some file formats, but otherwise, nothing is keeping an eye out for the kind of corruption that happened in my case.
If you know definitively that one drive is bad, you can fail that drive out of the array and serve all reads from the good drive (or, more sensibly, replace the bad drive, which will copy the data from the good drive onto the replacement). But if the data on the drives differs due to random bit flips on write (the kind of thing that happened to me and shodanshok) there is no definitive way to choose which of the two is correct without a checksum.
Also, md generally won't notice that two drives in a mirror are out of sync during normal operation — it will direct reads to one disk or the other in whatever way will get the fastest result. There is a 'check' function that will read both sides of a mirror pair and report mismatches, but only if you run it, or if your distribution is set up to run it periodically and report the results.
I can add that ZFS is insanely robust, mostly thanks to its origins (it was developed by Sun Microsystems back in 2001). The open source version currently available is a fork of one of the last open source versions released by Sun Microsystems around 10 years ago that has been further developed by the open source community after Oracle closed the ZFS source after acquiring Sun Microsystems.
Oracle themselves still also maintain a closed source version of ZFS that's used in their enterprise storage systems.
ZFS has a bit of a learning curve though, as it's quite powerful there's quite a lot of things that can be tweaked. Also it's one of the few storage file systems I've worked on where maintenance is actually easy. I had one case where a pool needed to be migrated from a RAID5 setup to a RAID6 (or more correctly a RAID-Z1 to a RAID-Z2). Normally, an operation like this would mean copying out all the data, reconfiguring the RAID, and copying the data back in. In ZFS, you attach you secondary storage, and the copy the pool out with one command, reconfigure the array as you like, and copy the pool back in with another command.
There are some gotchas though:
For beginners and home environments I generally recommend FreeNAS, it's very well maintained and simple to set up, which is good for a beginner.
Obviously, given infinite time you're certain to encounter it.
Realistically though, it's still pretty likely unless you have very expensive enterprise grade hardware, and even then it's not hugely unlikely.
More likely though, you'll end up encountering data corruption that just changes the file contents, but doesn't make them unreadable (unless you've got insane numbers of tiny files, simple statistics means you're more likely to have corruption in file data than in file metadata). When this happens, you can get all kinds of odd behaviors just like if you had bad hardware (though usually it will be more consistent and localized than bad hardware). If you're lucky, it's some non-critical data that gets corrupted, and you can easily fis things. If you're moderately unlucky, you have to rebuild the system from scratch. If you're really unlucky, you just ran into an error that caused you to go bankrupt because it happened to hit critical data in a production system and your service is now down while you rebuild the whole thing from scratch and try to put the database back the way it should be.
Short answer, data corruption is likely enough that even home users should be worrying about it.
Ext4 is notoriously bad on this point. Their default behavior n running into an internal consistency error is to mark the filesystem for check on next remount, and then continue as if nothing is wrong. I've lost whole systems in the past because of this behavior.
More generically, in most cases, the best you can hope for from a filesystem not specifically designed to verify it's data is to remount read-only if it runs into an internal error with it's own data structures or file metadata. The thing is though, unless the filesystem specifically handles verification of it's own internal structures beyond simple stuff like bounds checking, this won't catch everything, things will just go wrong in odd ways.
To get anything more, you need the filesystem to verify it's own internal data structures with checksums, error correcting codes, erasure coding, or some similar approach. Even then, unless it does the same for file data, you're still at non-negligible risk of data loss.
It depends on the RAID level, the exact RAID implementation, and whether or not you have it set to auto-recover. Assuming you have auto recovery on:
For RAID1 and RAID10:
For RAID4/5/6 and other cases of erasure coding, almost everything behaves the same when it comes to recovery, either data gets rebuilt from the remaining devices if it can be, or the array is effectively lost. ZFS and BTRFS in this case just give you a quicker (in terms of total I/O) way to check if the data is correct or not.
Note that none of these operate on a per-file basis and most don't allow you to easily pick the 'correct' one, they either work completely, fail completely, or alternately return good or bad data for the out of sync region.
For completeness, I'd like to mention https://bcachefs.org, which is admittedly not in the kernel yet, but IMHO slated to supplant ZFS and btrfs once it does.
It's based on bcache, which has already been in the kernel for a long time, building file system features with its B-tree system.
The lone developer works on it full time, sponsored via Patreon, and has a strong focus on reliability.
Not for the faint of heart at the moment, but as this comment ages, bcachefs should improve :)