Due to hurricane Matthew, our company shutdown all servers for two days. One of the servers was an ESXi host with an attached HP StorageWorks MSA60.
When we powered things back up today and logged into the vSphere client, we noticed that none of our guest VMs are available (they're all listed as "inaccessible"). And when I look at the hardware status in vSphere, the array controller and all attached drives appear as "Normal", but the drives all show up as "unconfigured disk".
We rebooted the server and tried going into the RAID config utility to see what things look like from there, but we received the following message:
An invalid drive movement was reported during POST. Modifications to the array configuration following an invalid drive movement will result in loss of old configuration information and contents of the original logical drives
Needless to say, we're very confused by this because nothing was "moved"; nothing changed. We simply powered up the MSA and the server, and have been having this issue ever since.
The MSA is attached via a single SAS cable, and the drives are labeled with stickers, so I know the drives weren't moved or switched around:
---------------------
| 01 | 04 | 07 | 10 |
---------------------
| 02 | 05 | 08 | 11 |
---------------------
| 03 | 06 | 09 | 12 |
---------------------
At the moment, I don't know what make and model the drives are, but they are all 1TB SAS drives.
I have two main questions/concerns:
Since we did nothing more than power the devices off and back on, what could've caused this to happen? I of course have the option to rebuild the array and start over, but I'm leery about the possibility of this happening again (especially since I have no idea what caused it).
Is there a snowball's chance in hell that I can recover our array and guest VMs, instead of having to rebuild everything and restore our VM backups?
Right, this is a very precarious situation...
So the HP Smart Array controller can handle a certain number of physical drive movements before it breaks the array configuration. Remember that HP RAID metadata lives on the physical drives and not the controller...
The MSA60 is a 12-bay 3.5" first-generation SAS JBOD enclosure. It went end-of-life in 2008/2009. It's old enough that it shouldn't be in the critical path of any vSphere deployment today.
In this case, the P411 controller is trying to protect you. You may have sustained a multiple drive failure condition, hit a firmware bug, lost one of the two controller interfaces in the rear of the MSA60 or some other odd error.
This sounds like an older server setup as well. So I'd like to know the server involved and the Smart Array P411 firmware revision.
I'd suggest removing power to all of the components. Waiting a few minutes. Powering on... and watching POST prompts very closely.
See the details in my answer here:
logical drives on HP Smart Array P800 not recognized after rebooting
There may be an option to reenable a previously failed logical drive, with an option to press
F1
orF2
. If presented, tryF2
.You guys are not going to believe this...
First I attempted a fresh cold boot of the existing MSA, waited a couple minutes, then powered up the ESXi host, but the issue remained. I then shutdown the host and MSA, moved the drives into our spare MSA, powered it up, waited a couple minutes, then powered up the ESXi host; the issue still remained.
At that point, I figured I was pretty much screwed, and there was nothing during the initialization of the RAID controller where I had an option to re-enable a failed logical drive. So I booted into the RAID config, verified again that there were no logical drives present, and I created a new logical drive (RAID 1+0 with two spare drives; same as we did about 2 years ago when we first setup this host and storage).
Then I let the server boot back into vSphere and I accessed it via vCenter. The first thing I did was removed the host from inventory, then re-added it (I was hoping to clear all the inaccessible guest VMs this way, but it didn't clear them from the inventory). Once the host was back in my inventory, I removed each of the guest VMs one at a time. Once the inventory was cleared, I verified that no datastore existed and that the disks were basically ready and waiting as "data disks". So I went ahead and created a new datastore (again, same as we did a couple years ago, using VMFS). I was eventually prompted to specify a mount option and I had the option of "keep the existing signature". At this point, I figured it'd be worth a shot to keep the signature - if things didn't work out, I could always blow it away and re-create the datastore again. After I finished the process of building the datastore with the keep signature option, I tried navigating to the datastore to see if anything was in it - it appeared empty. Just out of curiosity, I SSH'd to the host and checked from there, and to my surprise, I could see all my old data and all my old guest VMs! I went back into vCenter and re-scanned storage and refreshed the console, and all of our old guest VMs were there! I re-registered each VM and was able to recover everything! All of our guest VMs are back up and successfully communicating on the network.
I think most people in the IT community would agree that the chances of having something like this happen are extremely low to impossible.
As far as I'm concerned, this was a miracle of God...