After a nasty server crash I was unable to mount a JFS partition on Linux. The jfs_fsck
tool returns
Duplicate block references have been detected in Metadata. CANNOT CONTINUE.
processing terminated: <date> <time> with return code: 10060 exit code: 4.
The 12TB partition holds results of scientific computations that can be reproduced in a matter of a few weeks and are not backed up. Though I cannot exclude the possibility of some nonreproducible data lying around due to user negligence.
My plan to recover the partition was as follows:
- Replay the journal and mount the partition read-only
- Copy the files that can be read to another filesystem
- Identify the block with duplicate references using
jfs_fsck -v
- Identify the inodes corresponding to these blocks with
jfs_debugfs
- Find the filesystem objects corresponding to the inodes using
find -inum
- Unlink the objects altogether using
jfs_debugfs
- Run
jfs_fsck
again and hope it will complete without an error
This plan did work out only in steps (1) to (4). It first failed in step (5) where find
did not seem to get a single inode after running for several hours and could be running forever. When copying files I found some of the directories had their B+trees turn into graphs with loops so it was not impossible that a directory traversal would not terminate.
I jumped straight to step (6) and unlinked first the directories where I could find corrupted structures. But this did not help to make jfs_fsck
run to completion. I then removed all the directories but the root directory entry. Yet jfs_fsck
still failed to complete.
I guess I have to edit not only the directory structure but also the block allocation maps. However I could not find a way to do it with jfs_debugfs
.
Are there tools that can help make a partition with duplicate block references amenable to recovery?
If you can mount the disk R/O at all, you probably could try to copy out the data you can. If the journal is corrupted, it might be that only the last few file changes are lost. Thus you could try to get the files out.
However, if the file is data, how would you know if it is at all correct or hasn't been corrupted itself.
Of course, a journal corruption could also be hiding a more serious disk problem.
At this point, my thoughts would be that to insure the integrity of the data, you'll probably have to rerun the simulations.