We've been using Windows Dynamic Disks for our customers system discs as a software RAID1 solution (i.e. we install the system on one disk, and mirror it after installation using Dynamic Disks). I've run into a problem that suggests that my procedure for changing them isn't quite right, though it has worked in the past.
Normally, after being notified that a disc had failed, I would look at the Disk Management snap-in, and see which disc had a (!) icon beside it. Then I'd power down the system, change the disk, boot it back up again, and recreate the mirror on the fresh disk. Windows seems to take care of the partitioning and boot loader details, which is nice.
However I've got one customer system where changing the bad disc results in a failure to find \Windows\system32\winload.exe. The system boots into Windows Boot Manager, but whether I choose the normal boot or the "secondary plex" option, I get the same error. I even tried swapping the discs between SATA ports as a desperate measure. The only solution I found to get it back up was to put the bad disc back exactly where it was in the system.
This is bugging, as the only solution I've got left is to back up and restore the system over the network, and we need to be able to change Windows disks reliably.
Can anyone who knows more about Windows Dynamic Disks suggest what's wrong with my disc change procedure? The KB articles I've found:
- http://support.microsoft.com/kb/113977,
- http://support.microsoft.com/kb/969751 and
- http://support.microsoft.com/kb/323432
don't seem to cover the issue.
My guess (and it's only a guess) Is that there is no disk GUID or an invalid or unrecognized disk GUID in the GPT partition table header on the new disk and that's what's causing the problem. My suggestion would be to change your process: Break or Remove the mirror, replace the failed disk, re-make the mirror.
Note: Breaking and removing the mirror are two very different things but in your case I really don't think it matters which you choose as you're ultimately going to be replacing the failed disk in order to put the mirror back into a good state.
I think I can explain it now - the mirror had ever been created successfully in the first place due to a fault on the installation drive (but one that seems to have occurred between the stress test and the server being delivered). Therefore booting off the mirrored drive can't work, and it's a recovery job rather than something the RAID layer can help with.