I'm researching for ways to build and run a huge storage server (must be running Linux) where for all data arrays I can run consistency check and fix, while the usual applications using the arrays (reads and writes) keep on working as usual.
Say you have a many-TB of data on a single traditional Linux filesystem (EXT4, XFS) that is used by hundreds of users and suddenly the system reports consistency/corruption problem with it, or you know that the machine went down recently in a dirty way and filesystem corruption is very likely.
Taking the filesystem offline and running the filesystem check can easily take many hours/days of downtime, since neither EXT4 nor XFS can run check & repair while in normal operation; the filesystem needs to be taken offline first.
How to avoid this weakness of EXT4/XFS with Linux? How can I build a large storage server without ever needing to get it offline for hours for maintenance?
I've read a lot about ZFS and its reliability due to its use of data/metadata consistency checks. Is it possible to run consistency check and fix ZFS filesystem without taking it offline? Would some other new filesystem or some other organization of the data on disk be better?
One other option I'm thinking about is to divide the data array into ridiculously many (hundreds) of partitions, each having its own independent filesystem, and fix applications to know to use all those partitions. Then, when the need to check one of them arises, only that one will need to be taken offline. Not a perfect solution, but better than nothing.
Is there a perfect solution to this problem?
This would be a case for XFS or ZFS. FSCK is not a concept in the ZFS world.
There's a good amount of skill in building something like this in a robust manner. If there's a budget for bringing in an expert or ZFS consultant, your organization should consider doing so.
The crude reality is that legacy filesystems are not really well suited for multi-TB volumes. For example, RedHat recommend EXT4 filesystems no bigger than 50 TB; with the
fsck
time being one of the limiting factors.XFS is in a better shape, both due to much faster
xfs_repair
(compared to the oldxfs_check
) and to the on-going project to add on-line scrub.EXT4, XFS and other filesystems (BTRFS excluded) can be checked on-line by taking a snapshot of the main volume and running an
fsck
against the snapshot rather than the main filesystem itself. This will catch any serious error without requiring downtime, but it clearly need a volume manager (with snapshot capability) being in place under the filesystem. As a side note, this is one of the main reason why RedHat uses LVM by default.That said, the most know and reliable filesystem with on-line scrubbing clearly is ZFS: it was designed from the start to efficiently support very large arrays, and its online scrub facility is extremely effective. If any, it has the opposite problem: it lack an offline
fsck
, which would be useful to correct some rare class of errors.Do a business continuity analysis by asking the organization how much downtime for this storage is acceptable. Doing better than a handful of planned outages and a couple hours downtime per year usually requires investing in a multiple node solution.
Protect against as many downtime risks as you can think of. For example, a fire in the data center will shut things down for a couple hours, whatever the storage technology. If service must continue, replicate the data to a different system in a different building.
Regarding the file system, pick something you can fix and/or your vendor can support. EXT4 will strongly encourage you to fsck every so many mounts. XFS fsck doesn't do anything due to journal but xfs_check is offline. ZFS has no fsck, rather it has online scrubs.
Splitting data into multiple volumes might make sense to some extent. Would isolate failures, perhaps by organization unit or application. However, hundreds of small volumes just to keep fsck fast increases work. One advantage of centrally managed storage was supposed to be less administrative work.
For multiple node availability and performance, consider adding on another layer, a scale out distributed file system. Ceph, Lustre, Gluster, others. Quite different from one large array. Implementations vary in whether they use a file system underneath, and if they provide block or file protocols to users.