I have an XFS file system with file system errors affecting some non-critical files. I wish to repair it; the business wishes to continue to run with those errors. What are the known risks of not repairing an XFS file system that has "Structure needs cleaning" errors?
The business wishes to avoid the possibly lengthy maintenance window that will be needed. I have always taken it on faith that file system corruption must not be tolerated. The business is going to ask me for reasons to fix it other than my own FUD.
What kind of answers are needed
I already have an opinion; I need more than that.
Answers should be backed by evidence (anecdotes are OK, but only if they are documented first-hand. We don't need "someone told me" answers). Expert opinions are OK, such as answer from the XFS FAQ, or from a developer familiar with XFS internals).
No lay opinions, please. I'm looking for evidence, reliable anecdote, and XFS expert opinion.
Negative answers (e.g. "under similar circumstances, I ran for a year and experienced no serious problems) are OK.
File system details.
The file system is 5.4T, with 3.9T (72%) used.
There are 46.6M files.
Error details
There are 55 corrupt directories that cause applications such as ls
and find
to report "Structure needs cleaning", as mentioned in this XFS FAQ entry:
Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong?
The error 990 stands for EFSCORRUPTED which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we converted from EFSCORRUPTED/990 over to using EUCLEAN, "Structure needs cleaning." The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware. There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data. You can use xfs_repair to remedy the problem (with the file system unmounted).
XFS errors logged to syslog
all look like this:
XFS (sdb): Metadata corruption detected at xfs_inode_buf_verify+0x6d/0xe0 [xfs], block 0x50
XFS (sdb): Unmount and run xfs_repair
XFS (sdb): First 64 bytes of corrupted metadata buffer:
ffff88073fa79000: 49 4e 41 ff 02 01 00 00 00 00 01 f6 00 00 01 f7 INA.............
ffff88073fa79010: 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 ed ................
ffff88073fa79020: 59 1b af d2 09 62 5c 17 4f e8 f8 73 00 00 00 00 Y....b\.O..s....
ffff88073fa79030: 57 e0 73 b2 27 23 63 cd 00 00 00 00 00 00 00 2f W.s.'#c......../
XFS (sdb): metadata I/O error: block 0x50 ("xfs_trans_read_buf_map") error 117 numblks 16
XFS (sdb): xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.
These errors are repeated many times but only for two blocks.
The filesystem should be really taken offline and checked/repaired, for at least two very good reason:
ls
them, or create/remove files inside them.Some suggestions:
xfs_repair
, you can dump all filesystem metadata usingxfs_metadump
and run a "dummy"xfs_repair
on them. This will give you the possibility to observe whatxfs_repair
will do with/at your filesystemYou should repair your filesystem because it could be indicative of an underlying problem with the storage array or hardware.
Make the time for downtime or maintenance... or make the case for better redundancy.
I would be checking into the health of the hardware at this point.
Assuming you're using an enterprise Linux OS (and not Arch Linux), there's a creative solution available. You could use whatever the current release of the Linux HotCopy utility/driver is and take a block-level snapshot of your filesystem. Mount that filesystem with something like:
From there, you can run and
xfsrepair
on the snapshot to get a feel for the severity of the issue, a list of issues and as a timing test.Unmount and destroy the snapshot once done.