42TB LUN, formatted in XFS and shared via NFS was reported 'unavailable' by customers. In the end I was forced to restart the file server. The XFS LUN won't mount until it is repaired, and to repair I need to mount it so the log will replay and commit the uncommitted changes. In the past, I've learned that dumping the log and running the repair results in loss of filenames for a portion of the files and folders in the LUN. 42 TB and potentially hundreds of thousands of files. Loss of filenames equates to data loss.
I have a backup. Restoring will require gathering resources. I think there's roughly 30TB of data in that LUN that I need to restore and copy back into place. So I need 30 TB of free space, which is not readily available.
Is there another way of forcing XFS to mount in order to replay those logs and commit the changes?
This is the third time I've had a LUN 'freeze' on me and be reported as xfs corrupted in the logs and been forced to reboot the server to bring it back online. XFS seems to have a solid reputation. It has been around for a significant amount of time. And it is the default for the file server's OS (RHEL7). Have I got some terrible error in my configuration that is killing these LUNs?
SAN presents LUN, mounted nodev,nosuid,nofail on file server. File server shares to workstations which mount the share as synchronous. Is there something in this combination that would hang the file server?
Came across this question when checking for updates to bugs #1681410 and #1686687 on launchpad which I also have been affected by with similar symptoms as you are describing (also with XFS but a larger LUN and when running ubuntu 16.04 server).
We've been checking our storage system (which provides extensive logs) in quite great depth (requesting support from the manufacturer) but ended up not finding any errors or misconfigurations there.
Having run into this several times we managed to nail the occurrance of this behaviour down to a certain time where nobody may have actively worked on the system which let us look at other factors as well. We finally found evidence that the cron-scheduled runs of fstrim (which is a default on ubtuntu 16.04 server!) started once a week seem to trigger the corruptions on our filesystem especially as it takes some time to fstrim a LUN of over 100TB in size.
I believe the bugs posted on launchpad quite likely describe this issue but as it appears to me, this issue hash been upstreamed but never really fixed so far. So for now we simply make sure that no fstrim is run by removing the respective entry form cron.weekly. We also check if a cron-job has been re-added after running updates which is something I'd like to be solved differently.