I recently added a 7th 2TB drive to a linux md software RAID 6 setup. After md finished reshaping the array from 6 to 7 drives (from 8 to 10TB), I was still able to mount the file system without problems. In preparation for resize2fs, I then unmounted the partition and ran fsck -Cfyv
and was greeted with an endless stream of millions of random errors. Here is a short excerpt:
Pass 1: Checking inodes, blocks, and sizes
Inode 4193823 is too big. Truncate? yes
Block #1 (748971705) causes symlink to be too big. CLEARED.
Block #2 (1076864997) causes symlink to be too big. CLEARED.
Block #3 (172764063) causes symlink to be too big. CLEARED.
...
Inode 4271831 has a extra size (39949) which is invalid Fix? yes
Inode 4271831 is in use, but has dtime set. Fix? yes
Inode 4271831 has imagic flag set. Clear? yes
Inode 4271831 has a extra size (8723) which is invalid Fix? yes
Inode 4271831 has EXTENTS_FL flag set on filesystem without extents support. Clear? yes
...
Inode 4427371 has compression flag set on filesystem without compression support. Clear? yes
Inode 4427371 has a bad extended attribute block 1242363527. Clear? yes
Inode 4427371 has INDEX_FL flag set but is not a directory. Clear HTree index? yes
Inode 4427371, i_size is 7582975773853056983, should be 0. Fix? yes
...
Inode 4556567, i_blocks is 5120, should be 5184. Fix? yes
Inode 4566900, i_blocks is 5160, should be 5200. Fix? yes
...
Inode 5628285 has illegal block(s). Clear? yes
Illegal block #0 (4216391480) in inode 5628285. CLEARED.
Illegal block #1 (2738385218) in inode 5628285. CLEARED.
Illegal block #2 (2576491528) in inode 5628285. CLEARED.
...
Illegal indirect block (2281966716) in inode 5628285. CLEARED.
Illegal double indirect block (2578476333) in inode 5628285. CLEARED.
Illegal block #477119515 (3531691799) in inode 5628285. CLEARED.
Compression? Extents? I've never had ext4 anywhere near this machine!
Now, the problem is that fsck keeps dying with the following error message:
Error storing directory block information (inode=5628285, block=0, num=316775570): Memory allocation failed
At first I was able to simply re-run fsck and it would die at a different inode, but now it's settled on 5628285 and I can't get it to go beyond that.
I've spent the last days trying to search for fixes to this and found the following 3 "solutions":
- Use 64-bit linux.
/proc/cpuinfo
containslm
as one of the processorflags
,getconf LONG_BIT
returns64
anduname -a
has this to say:Linux <servername> 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 GNU/Linux
. Should be all good, no? - Add
[scratch_files]
/directory = /var/cache/e2fsck
to/etc/e2fsck.conf
. Did that and every time I re-run fsck, it adds another 500K*-dirinfo-*
and an 8M*-icount-*
file to the/var/cache/e2fsck
directory. So that seems to have its desired effect as well. - Add more memory or swap space to the machine. 12GB of RAM and a 32GB swap partition should be sufficient, no?
Needless to say: Nothing helped, otherwise I wouldn't be writing here.
Naturally, now the drive is marked bad and I can't mount it any more. So, as of right now, I lost 8TB of data due to a disk-check?!?!?
This leaves me with 3 questions:
- Is there anything I can do to fix this drive (remember, everything was fine before I ran fsck!) other than spending a month to learn the ext3 disk format and then trying to fix it manually with a hex editor???
- How is it possible, that something as mission-critical as fsck for a file-system as popular as ext3 still has issues like this??? Especially since ext3 is over a decade old.
- Is there an alternative to ext3 that doesn't have these sorts of fundamental reliability issues? Maybe jfs?
(I'm using e2fsck 1.42.5 on 64-bit Debian Wheezy 7.1 now, but had the same issues with an earlier version on 32-bit Debian Squeeze)
Just rebuild the array and restore the data from a backup. The whole point of RAID is to minimize downtime. By messing around and trying to fix a problem like this, you just increase your downtime defeating the whole purpose of RAID. RAID doesn't protect against data loss, it protects against downtime.
After playing around with
fsck
some more, I found some remedies:Preventing the 'Memory allocation failed' error
fsck
seems to have a major issue with memory leakage. If it is run on a file-system with some problems (real or imaginary), it will "fix" them one-by-one (see screen dump in original question). As it does so, it consumes more and more memory (maybe keeping a change-log?). Pretty much without bounds. But,fsck
can be cancelled at any time (Ctrl-C) and restarted. In this case, it will continue where it left off, but it's memory use is reset to next-to-nothing (for a while).With this in mind, the three things that need to be done are:
fsck
can use the available memory)fsck
runs for about 12 hours with it)NOTE: I have no idea if canceling and restarting
fsck
brings with it any other dangers (probably does), but it seems to work for me.Dealing with the resulting damage, if the 'Memory allocation failed' error occurs (IMPORTANT!)
fsck
handles theMemory allocation failed
error in the worst possible way: I destroys perfectly good data. I'm not sure why, but my guess is that it does some final data-write to disk of things that it had kept in memory, which (due to the error) have meanwhile gotten corrupted.In my case, the most visible problem was that when I restarted
fsck
after the error, it sometimes reported a corrupted super-block. The problem is: I have no idea how corrupted the super-block was, especially in the cases where it didn't report it as corrupted. Maybe, if restarted after the error, it then uses incorrect drive meta-data found in the corrupted super-block to do all further checks and ends up fixing "issues" that aren't really there, destroying good data in the process.Therefore, if
fsck
ever dies with theMemory allocation failed
error, it needs to be restarted using the-b
parameter to use a backup super-block that (hopefully) wasn't corrupted by the error. The location of the backup super-blocks can be found usingmke2fs -n /dev/...
.Since I don't know what happens if
fsck
dies with the backup super-block selected, I usually just abortfsdk
immediately when it gets toPass 1: Checking inodes, blocks, and sizes
and restart it again without-b
, at which point it starts without complaining about a bad super-block. I.e. it seems like the first thingfsck -b
does is to restore the main super-block.Now the one we've all been waiting for:
How to mount a file-system without letting fsck run to completion
This, I found by accident: It turns out that after running
fsck -b
and aborting it as soon as it prints thePass 1: Checking inodes, blocks, and sizes
(before any errors are found) the file-system is left in a mountable state (Yay! I got pretty much all of my data back!).(Note: There may be another way using
mount -o force
, but it wasn't needed in my case.)How to avoid all these issues in the first place
There seem to be two ways:
fsck
with parameter-N
. If it shows any problems, delete the entire fs and restore everything from the backup. Since, in this scenario, one would be relying very heavily on the backup, I suggest keeping a backup of the backup. Also, use a copy-tool that somehow ensures that the restore does not create random errors in the process (An MTBF of a trillion r/w-ops is small when dealing with TB's of data). Make sure to plan for the resulting down-time, too, as a multi-TB restore probably takes a while...fsck
) aren't robust enough for real production use (yet?). The wayfsck
handles the memory error and the fact that the error occurs in the first place are not acceptable in my mind. I will be trying xfs from now on, but don't yet have enough experience with it to tell whether it's any better.Unfortunately, I'm not able to "add a comment" but had to chime in here and thank the Op. I had a RAID6 failure and manually assembled 6 of the 8 drives with closely matching Event Counts. However I wasn't able to
mount
the assembled array.It appeared that I needed to use a backup Super-block. Running
fsck -b <location> ...
eventually died with out-of-memory, which led me to this thread/question.In short, using
fsck -b <location>...
and then doingctrl+c
allowed me to mount my array and recover my files.Thanks!