Folks please help - I am a newb with a major headache at hand (perfect storm situation).
I have a 3 1tb hdd on my ubuntu 11.04 configured as software raid 5. The data had been copied weekly onto another separate off the computer hard drive until that completely failed and was thrown away. A few days back we had a power outage and after rebooting my box wouldn't mount the raid. In my infinite wisdom I entered
mdadm --create -f...
command instead of
mdadm --assemble
and didn't notice the travesty that I had done until after. It started the array degraded and proceeded with building and syncing it which took ~10 hours. After I was back I saw that that the array is successfully up and running but the raid is not
I mean the individual drives are partitioned (partition type f8
) but the md0
device is not. Realizing in horror what I have done I am trying to find some solutions. I just pray that --create
didn't overwrite entire content of the hard driver.
Could someone PLEASE help me out with this - the data that's on the drive is very important and unique ~10 years of photos, docs, etc.
Is it possible that by specifying the participating hard drives in wrong order can make mdadm
overwrite them? when I do
mdadm --examine --scan
I get something like ARRAY /dev/md/0 metadata=1.2 UUID=f1b4084a:720b5712:6d03b9e9:43afe51b name=<hostname>:0
Interestingly enough name used to be 'raid' and not the host hame with :0 appended.
Here is the 'sanitized' config entries:
DEVICE /dev/sdf1 /dev/sde1 /dev/sdd1
CREATE owner=root group=disk mode=0660 auto=yes
HOMEHOST <system>
MAILADDR root
ARRAY /dev/md0 metadata=1.2 name=tanserv:0 UUID=f1b4084a:720b5712:6d03b9e9:43afe51b
Here is the output from mdstat
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd1[0] sdf1[3] sde1[1]
1953517568 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
unused devices: <none>
fdisk shows the following:
fdisk -l
Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000bf62e
Device Boot Start End Blocks Id System
/dev/sda1 * 1 9443 75846656 83 Linux
/dev/sda2 9443 9730 2301953 5 Extended
/dev/sda5 9443 9730 2301952 82 Linux swap / Solaris
Disk /dev/sdb: 750.2 GB, 750156374016 bytes
255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000de8dd
Device Boot Start End Blocks Id System
/dev/sdb1 1 91201 732572001 8e Linux LVM
Disk /dev/sdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00056a17
Device Boot Start End Blocks Id System
/dev/sdc1 1 60801 488384001 8e Linux LVM
Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000ca948
Device Boot Start End Blocks Id System
/dev/sdd1 1 121601 976760001 fd Linux raid autodetect
Disk /dev/dm-0: 1250.3 GB, 1250254913536 bytes
255 heads, 63 sectors/track, 152001 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/dm-0 doesn't contain a valid partition table
Disk /dev/sde: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x93a66687
Device Boot Start End Blocks Id System
/dev/sde1 1 121601 976760001 fd Linux raid autodetect
Disk /dev/sdf: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xe6edc059
Device Boot Start End Blocks Id System
/dev/sdf1 1 121601 976760001 fd Linux raid autodetect
Disk /dev/md0: 2000.4 GB, 2000401989632 bytes
2 heads, 4 sectors/track, 488379392 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 1048576 bytes
Disk identifier: 0x00000000
Disk /dev/md0 doesn't contain a valid partition table
Per suggestions I did clean up the superblocks and re-created the array with --assume-clean
option but with no luck at all.
Is there any tool that will help me to revive at least some of the data? Can someone tell me what and how the mdadm --create does when syncs to destroy the data so I can write a tool to un-do whatever was done?
After the re-creating of the raid I run fsck.ext4 /dev/md0 and here is the output
root@tanserv:/etc/mdadm# fsck.ext4 /dev/md0 e2fsck 1.41.14 (22-Dec-2010) fsck.ext4: Superblock invalid, trying backup blocks... fsck.ext4: Bad magic number in super-block while trying to open /dev/md0
The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193
Per Shanes' suggestion I tried
root@tanserv:/home/mushegh# mkfs.ext4 -n /dev/md0
mke2fs 1.41.14 (22-Dec-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=128 blocks, Stripe width=256 blocks
122101760 inodes, 488379392 blocks
24418969 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
14905 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
and run fsck.ext4 with every backup block but all returned the following:
root@tanserv:/home/mushegh# fsck.ext4 -b 214990848 /dev/md0
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Invalid argument while trying to open /dev/md0
The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
Any suggestions?
Regards!
Ok - something was bugging me about your issue, so I fired up a VM to dive into the behavior that should be expected. I'll get to what was bugging me in a minute; first let me say this:
Back up these drives before attempting anything!!
You may have already done damage beyond what the resync did; can you clarify what you meant when you said:
If you ran a
mdadm --misc --zero-superblock
, then you should be fine.Anyway, scavenge up some new disks and grab exact current images of them before doing anything at all that might do any more writing to these disks.
That being said.. it looks like data stored on these things is shockingly resilient to wayward resyncs. Read on, there is hope, and this may be the day that I hit the answer length limit.
The Best Case Scenario
I threw together a VM to recreate your scenario. The drives are just 100 MB so I wouldn't be waiting forever on each resync, but this should be a pretty accurate representation otherwise.
Built the array as generically and default as possible - 512k chunks, left-symmetric layout, disks in letter order.. nothing special.
So far, so good; let's make a filesystem, and put some data on it.
Ok. We've got a filesystem and some data ("data" in
datafile
, and 5MB worth of random data with that SHA1 hash inrandomdata
) on it; let's see what happens when we do a re-create.The resync finished very quickly with these tiny disks, but it did occur. So here's what was bugging me from earlier; your
fdisk -l
output. Having no partition table on themd
device is not a problem at all, it's expected. Your filesystem resides directly on the fake block device with no partition table.Yeah, no partition table. But...
Perfectly valid filesystem, after a resync. So that's good; let's check on our data files:
Solid - no data corruption at all! But this is with the exact same settings, so nothing was mapped differently between the two RAID groups. Let's drop this thing down before we try to break it.
Taking a Step Back
Before we try to break this, let's talk about why it's hard to break. RAID 5 works by using a parity block that protects an area the same size as the block on every other disk in the array. The parity isn't just on one specific disk, it's rotated around the disks evenly to better spread read load out across the disks in normal operation.
The XOR operation to calculate the parity looks like this:
So, the parity is spread out among the disks.
A resync is typically done when replacing a dead or missing disk; it's also done on
mdadm create
to assure that the data on the disks aligns with what the RAID's geometry is supposed to look like. In that case, the last disk in the array spec is the one that is 'synced to' - all of the existing data on the other disks is used for the sync.So, all of the data on the 'new' disk is wiped out and rebuilt; either building fresh data blocks out of parity blocks for what should have been there, or else building fresh parity blocks.
What's cool is that the procedure for both of those things is the exact same: an XOR operation across the data from the rest of the disks. The resync process in this case may have in its layout that a certain block should be a parity block, and think it's building a new parity block, when in fact it's re-creating an old data block. So even if it thinks it's building this:
...it may just be rebuilding
DISK5
from the layout above.So, it's possible for data to stay consistent even if the array's built wrong.
Throwing a Monkey in the Works
(not a wrench; the whole monkey)
Test 1:
Let's make the array in the wrong order!
sdc
, thensdd
, thensdb
..Ok, that's all well and good. Do we have a filesystem?
Nope! Why is that? Because while the data's all there, it's in the wrong order; what was once 512KB of A, then 512KB of B, A, B, and so forth, has now been shuffled to B, A, B, A. The disk now looks like jibberish to the filesystem checker, it won't run. The output of
mdadm --misc -D /dev/md1
gives us more detail; It looks like this:When it should look like this:
So, that's all well and good. We overwrote a whole bunch of data blocks with new parity blocks this time out. Re-create, with the right order now:
Neat, there's still a filesystem there! Still got data?
Success!
Test 2
Ok, let's change the chunk size and see if that gets us some brokenness.
Yeah, yeah, it's hosed when set up like this. But, can we recover?
Success, again!
Test 3
This is the one that I thought would kill data for sure - let's do a different layout algorithm!
Scary and bad - it thinks it found something and wants to do some fixing! Ctrl+C!
Ok, crisis averted. Let's see if the data's still intact after resyncing with the wrong layout:
Success!
Test 4
Let's also just prove that that superblock zeroing isn't harmful real quick:
Yeah, no big deal.
Test 5
Let's just throw everything we've got at it. All 4 previous tests, combined.
Onward!
The verdict?
Wow.
So, it looks like none of these actions corrupted data in any way. I was quite surprised by this result, frankly; I expected moderate odds of data loss on the chunk size change, and some definite loss on the layout change. I learned something today.
So .. How do I get my data??
As much information as you have about the old system would be extremely helpful to you. If you know the filesystem type, if you have any old copies of your
/proc/mdstat
with information on drive order, algorithm, chunk size, and metadata version. Do you have mdadm's email alerts set up? If so, find an old one; if not, check/var/spool/mail/root
. Check your~/.bash_history
to see if your original build is in there.So, the list of things that you should do:
dd
before doing anything!!fsck
the current, active md - you may have just happened to build in the same order as before. If you know the filesystem type, that's helpful; use that specificfsck
tool. If any of the tools offer to fix anything, don't let them unless you're sure that they've actually found the valid filesystem! If anfsck
offers to fix something for you, don't hesitate to leave a comment to ask whether it's actually helping or just about to nuke data./proc/mdstat
, then you can just mimic what it shows; if not, then you're kinda in the dark - trying all of the different drive orders is reasonable, but checking every possible chunk size with every possible order is futile. For each,fsck
it to see if you get anything promising.So, that's that. Sorry for the novel, feel free to leave a comment if you have any questions, and good luck!
footnote: under 22 thousand characters; 8k+ shy of the length limit
I had a similar problem:
after a failure of a software RAID5 array I fired
mdadm --create
without giving it--assume-clean
, and could not mount the array anymore. After two weeks of digging I finally restored all data. I hope the procedure below will save someone's time.Long Story Short
The problem was caused by the fact that
mdadm --create
made a new array that was different from the original in two aspects:As it's been shown in the brilliant answer by Shane Madden,
mdadm --create
does not destroy the data in most cases! After finding the partition order and data offset I could restore the array and extract all data from it.Prerequisites
I had no backups of RAID superblocks, so all I knew was that it was a RAID5 array on 8 partitions created during installation of Xubuntu 12.04.0. It had an ext4 filesystem. Another important piece of knowledge was a copy of a file that was also stored on the RAID array.
Tools
Xubuntu 12.04.1 live CD was used to do all the work. Depending on your situation, you might need some of the following tools:
version of mdadm that allows to specify data offset
bgrep - searching for binary data
hexdump, e2fsck, mount and a hexadecimal calculator - standard tools from repos
Start with Full Backup
Naming of device files, e.g.
/dev/sda2
/dev/sdb2
etc., is not persistent, so it's better to write down your drives' serial numbers given byThen hook up an external HDD and back up every partition of your RAID array like this:
Determine Original RAID5 Layout
Various layouts are described here: http://www.accs.com/p_and_p/RAID/LinuxRAID.html
To find how strips of data were organized on the original array, you need a copy of a random-looking file that you know was stored on the array. The default chunk size currently used by
mdadm
is 512KB. For an array of N partitions, you need a file of size at least (N+1)*512KB. A jpeg or video is good as it provides relatively unique substrings of binary data. Suppose our file is calledpicture.jpg
. We read 32 bytes of data at N+1 positions starting from 100k and incrementing by 512k:We then search for occurrences of all of these bytestrings on all of our raw partitions, so in total (N+1)*N commands, like this:
These commands can be run in parallel for different disks. Scan of a 38GB partition took around 12 minutes. In my case, every 32-byte string was found only once among all eight drives. By comparing offsets returned by bgrep you obtain a picture like this:
We see a normal left-symmetric layout, which is default for
mdadm
. More importantly, now we know the order of partitions. However, we don't know which partition is the first in the array, as they can be cyclicly shifted.Note also the distance between found offsets. In my case it was 512KB. The chunk size can actually be smaller than this distance, in which case the actual layout will be different.
Find Original Chunk Size
We use the same file
picture.jpg
to read 32 bytes of data at different intervals from each other. We know from above that the data at offset 100k is lying on/dev/sdh2
, at offset 612k is at/dev/sdb2
, and at 1124k is at/dev/sdd2
. This shows that the chunk size is not bigger than 512KB. We verify that it is not smaller than 512KB. For this we dump the bytestring at offset 356k and look on which partition it sits:It is on the same partition as offset 612k, which indicates that the chunk size is not 256KB. We eliminate smaller chunk sizes in the similar fashion. I ended up with 512KB chunks being the only possibility.
Find First Partition in Layout
Now we know the order of partitions, but we don't know which partition should be the first, and which RAID data offset was used. To find these two unknowns, we will create a RAID5 array with correct chunk layout and a small data offset, and search for the start of our file system in this new array.
To begin with, we create an array with the correct order of partitions, which we found earlier:
We verify that the order is obeyed by issuing
Now we determine offsets of the N+1 known bytestrings in the RAID array. I run a script for a night (Live CD doesn't ask for password on sudo :):
Output with comments:
Based on this data we see that the 3rd string was not found. This means that the chunk at
/dev/sdd2
is used for parity. Here is an illustration of the parity positions in the new array:Our aim is to deduce which partition to start the array from, in order to shift the parity chunks into the right place. Since parity should be shifted two chunks to the left, the partition sequence should be shifted two steps to the right. Thus the correct layout for this data offset is
ahbdcefg
:At this point our RAID array contains data in the right form. You might be lucky so that the RAID data offset is the same as it was in the original array, and then you will most likely be able to mount the partition. Unfortunately this was not my case.
Verify Data Consistency
We verify that the data is consistent over a strip of chunks by extracting a copy of
picture.jpg
from the array. For this we locate the offset for the 32-byte string at 100k:We then substract 100*1024 from the result and use the obtained decimal value in
skip=
parameter fordd
. Thecount=
is the size ofpicture.jpg
in bytes:Check that
extract.jpg
is the same aspicture.jpg
.Find RAID Data Offset
A sidenote: default data offset for
mdadm
version 3.2.3 is 2048 sectors. But this value has been changed over time. If the original array used a smaller data offset than your currentmdadm
, thenmdadm --create
without--assume-clean
can overwrite the beginning of the file system.In the previous section we created a RAID array. Verify which RAID data offset it had by issuing for some of the individual partitions:
2048 512-byte sectors is 1MB. Since chunk size is 512KB, the current data offset is two chunks.
If at this point you have a two-chunk offset, it is probably small enough, and you can skip this paragraph.
We create a RAID5 array with the data offset of one 512KB-chunk. Starting one chunk earlier shifts the parity one step to the left, thus we compensate by shifting the partition sequence one step to the left. Hence for 512KB data offset, the correct layout is
hbdcefga
. We use a version ofmdadm
that supports data offset (see Tools section). It takes offset in kilobytes:Now we search for a valid ext4 superblock. The superblock structure can be found here: https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#The_Super_Block
We scan the beginning of the array for occurences of the magic number
s_magic
followed bys_state
ands_errors
. The bytestrings to look for are:Example command:
The magic number starts 0x38 bytes into the superblock, so we substract 0x38 to calculate the offset and examine the entire superblock:
This seems to be a valid superblock.
s_log_block_size
field at 0x18 is 0002, meaning that the block size is 2^(10+2)=4096 bytes.s_blocks_count_lo
at 0x4 is 03f81480 blocks which is 254GB. Looks good.We now scan for the occurrences of the first bytes of the superblock to find its copies. Note the byte flipping as compared to hexdump output:
This aligns perfectly with the expected positions of backup superblocks:
Hence the file system starts at the offset 0xdc80000, i.e. 225792KB from the partition start. Since we have 8 partitions of which one is for parity, we divide the offset by 7. This gives 33030144 bytes offset on every partition, which is exactly 63 RAID chunks. And since the current RAID data offset is one chunk, we conclude that the original data offset was 64 chunks, or 32768KB. Shifting
hbdcefga
63 times to the right gives the layoutbdcefgah
.We finally build the correct RAID array:
Voilà!
If you are lucky you might have some success with getting your files back with recovery software that can read a broken RAID-5 array. Zero Assumption Recovery is one I have had success with before.
However, I'm not sure if the process of creating a new array has gone and destroyed all the data, so this might be a last chance effort.
I had a similar issue. I formatted and reinstalled my OS/boot drive with a clean install of Ubuntu 12.04, then ran the mdadm --create... command and couldn't mount it.
It said it didn't have a valid superblock or partition.
Moreover, when I stopped the mdadm raid, I could no longer mount the regular device.
I was able to repair the superblock with mke2fs and e2fsck:
Then ran:
That restored the superblock so I could mount and read the drive.
To get the array working without destroying the superblock or partitions I used build:
After verifying the data, I will add the other drive:
I'm just updating some of the information given earlier. I had a 3-disk raid5 array working ok when my motherboard died. The array held /dev/md2 as the /home partition 1.2TB and /dev/md3 as the /var partition 300GB.
I had two backups of "important" stuff and a bunch of random things I had grabbed from various parts of the internet that I really should have gone through and selectively dumped. Most of the backups were broken into .tar.gz files of 25GB or less, and a separate copy of /etc was backed up also.
The rest of the filesystem was held on two small raid0 disks of 38GB.
My new machine was similar to the old hardware, and I got the machine up and running simply by plugging all five disks in and selecting an old generic kernel. So I had five disks with clean filesystems, though I could not be certain that the disks were in the right order, and needed to install a new version of Debian Jessie to be sure that I could upgrade the machine when needed, and sort out other problems.
With the new generic system installed on two Raid0 disks, I began to put the arrays back together. I wanted to be sure that I had the disks in the right order. What I should have done was to issue :
But I didn't. It seems that mdadm is pretty smart and given a uuid, can figure out which drives go where. Even if the bios designates /dev/sdc as /sda, mdadm will put it together correctly (YMMV though).
Instead I issued:
mdadm --create /dev/md2 without the --assume-clean
, and allowed the resync on /dev/sde1 to complete. The next mistake I made was to work on /dev/sdc1 instead of the last drive in the /dev/md2, /sde1. Anytime mdadm thinks there is a problem it is the last drive that gets kicked out or re-synced.After that, mdadm could not find any superblock, and e2fsck -n couldn't either.
After I found this page, I went through the procedure of trying to find the sequence for the drives (done), check for valid data (verified 6MB of a 9MB file), got the disks in the right sequence, cde, grabbed the uuid's of /md2 and /md3 from the old /etc/mdadm.conf and tried assembling.
Well,
/dev/md3
started, andmdadm --misc -D /dev/md3
showed three healthy partitions, and the disks in the right order./dev/md2
also looked good, until I tried to mount the filesystem.The filesystem refused to be mounted, and e2fsck couldn't find any superblocks. Further, when checking for superblocks as described above, the total block count found as a880 0076 or a880 0076 or 5500 1176 did not match the disk capacity size of 1199.79 reported my mdadm. Also none of the locations of the "superblocks" aligned with the data in the posts above.
I backed up all of /var, and prepared to wipe the disks. To see if it was possible to wipe just /md2, (I had nothing else to lose at this point) I dis the following:
All seemed ok, except for the change to the uuid. So after a couple more checks, I wrote 600GB of backed up data onto /dev/md2. Then, unmounted and tried to re-mount the drive:
Are you ********* kidding me? what about my 600GB on the file?
Ah - easily fixed. uncommented one line in /etc/mdadm.conf
Yippie!