Yesterday, I added a second 500GB hard drive to a system. This system was installed as a RAID-1 system with only one drive, because I didnt have the other one on hand.
After finally adding the second disk, I ran "sfdisk -d /dev/sda | sfdisk --force /dev/sdb", as i have done very often.
Then I ran "mdadm --add /dev/md0 /dev/sdb1", and the RAID started syncing.
After it was finished, it turned out the new partitions were added as spares, and not as active devices. This seems to have happened because the RAID 1 device thought it only had space for 1 active device, because of the strange installation I did.
So, today, I ran "mdadm --grow --raid-devices 2 /dev/md0" (note I didnt put a '=' before the '2').
Immediatly, my whole filesystem dissapeared!
I am still logged into an ssh session, but I am limited to bash's built in commands, which is rather painful.
I made up a bash-builtin-cat-command, and can still cat some files. /proc/mdstat looks fine and dandy, and indicates that the new drive is now actually active.
/var/log/messages (which, strangely, is still accessible even though all other files are not) gives me thousands of:
attempt to access beyond end of device md0: rw=0, want=868055984, limit=4
(the number after 'want' varies). The messages were all generated in a couple of seconds after running mdadm --grow, and then stopped.
As mentioned, this is a remote machine.
- what the hell happened here?
- Is there anyway to undo whatever it is that --grow did?
- Can I remove the new disk from the RAID device just echo-ing into obscure /proc files (since mdadm isn't found anymore)?
- should i trigger a SysRq reboot, and hope for the best?
Well, a hard reboot did fix the issue, strangely enough.
After the reboot, the computer booted normally, and now it's rebuilding the RAID 1 array again, and the additional drive is marked as a spare, again.
So it seems that that the grow command instantly made the filesystem and disk access disappear - that quickly, that even the grow command's effects weren't written to disk.
Strange.
EDIT: Turns out the drive that had the data on it had bad sectors, so the first initial sync failed, and mdadm put the new (not completely synced) drive in 'spare' mode. My temporary solution was to write zeros to the bad sectors (which is something you should not do!) using hdparm (google "hdparm write bad sectors"). For some odd reason, this worked (even though a little data was lost), and the array managed to finish its initial sync. Now I can pull the bad drive, and sync the new drive to an even newer drive.