Running EMC NetWorker server on a Server 2008 R2 w/ sp1 VM running on ESXi host. The VMDK is stored on a VNXe machine along with the VMDKs for all the other VM servers our organization runs. None of the other VMs have this issue:
Each night for the latter half of this week, sometime after 9pm, this server loses its hard drive. Checking the systems in the morning I find this machine sitting at the boot prompt after it has tried PXE and it reports it cannot find a bootable device. Checking the VM settings I find there is no hard drive attached to the machine.
Recovering is as simple as assigning a new hard drive to the system and pointing it to the existing VMDK which is still sitting there on the data store hosted on the VNXe.
The vSphere server doesn't report anything wrong or any errors.
There is no information in the system log on the server itself so I'm pretty sure it has no clue what happened to it.
The issue began as I have begun ramping up backups using the NetWorker system, adding new hosts to backup. Currently I am only backing up virtual hosts using the configured VADP proxy built into the NetWorker server, along with a test SQL server (also a VM) using the installed NetWorker client local to that machine. I was backing up the NetWorker server itself, as the documentation noted there shouldn't be an issue with that, but disabled that backup shortly after finding this issue.
I need to find out how and why the VMDK is becoming unattached from the NetWorker server. It would be nice for someone to tell me explicitly, but maybe help finding the vSphere logs showing everything that goes on with the systems would be a good point in the right direction.
UPDATE: additional details
Backups of the VMs are scheduled to begin at 9pm each night.
From the vSphere logs for this VM:
- 2/21 at 9:00:11pm: Task: Create virtual machine snapshot.
- 2/22 at 2:18:57am: Task: Remove snapshot. This was the first attempted scheduled backup OF this VM FROM itself and indicates the successful and correct operation of the backup system.
- 2/22: I migrate the machine to a different ESXi host (there are three identical hosts in a HA configuration) to better arrange resources.
- 2/22 at 9:00:15pm: Task: Reconfigure virtual machine. This is the first time the HDD is removed from the VM.
- 2/23 around 8:25am: checking the systems I find the HDD missing on this VM for the first time. This leads me to believe the Snapshot operation triggered by the NetWorker scheduled backup is being translated into "remove the HDD from this VM" by the ESXi host.
- 2/23 at 9:00:14pm: Task: Reconfigure virtual machine.
- 2/24 I reattached the HDD and disable all the scheduled backups of this VM in NetWorker.
- 2/24 at 9:31:32pm: Task: Reconfigure virtual machine.
- 2/25 at 9:00:15pm, 2/26 at 9:00:11pm: The same reconfigure virtual machine task removes the HDD from this VM. and I reattach it the following morning.
Based on this log I need to check the following:
- Does the issue persist when the VM is running on a different host?
- Does the issue persist when no backups run at all?
I'll check these and report back on success or failure.
Update 2: troubleshooting report
One more thing I found: In the configuration for each VM client in NetWorker, there is a place to record the ESXi host that VM is on. When I vMotion a VM to a different ESXi host, this value is not updated, even with VM autodetect enabled in NetWorker. So I updated this value in the VM client configuration to the current ESXi host. It would be nice if AutoDetect kept it updated by itself.
So, to report on the troubleshooting I tried yesterday:
First, the HDD was still attached this morning which confirms the issue was being at least triggered by NetWorker. I disabled all backups yesterday, and I moved the NetWorker server to a new ESXi host. I also updated the ESXi host information noted in the previous paragraph.
Today I've re-enabled most of the backups (leaving off high-availability systems like SQL and Exchange.
If the HDD is removed tonight then it is the backup configuration that is the issue.
If the HDD is NOT removed tonight, then it is the host configuration information or the host itself causing the issue.
Update 3: Troubleshooting followup
The HDD was lost again last night, which means the issue is probably the NetWorker configuration.
Just to recap: Last night I ran scheduled backups of several VMs (but not of the NetWorker server) and just after 9pm I saw the same log entries I've noted earlier in the question, resulting in there no longer being a HDD associated with the VM.
There's another thing I'll try: Based on EMC Documentation, the NetWorker server can also be a storage node, and most of the VMs are processing their backups through this node (this is separate from the VADP). I'll disable these through the node backups and see if that makes a difference.
Also, physical system backups and a NDMP backup from our NAS/Network Drives are working OK.
I'll begin isolating the VMs and adding one at a time to the backup to see if I can determine if a particular VM is causing the issue. This is something I should be able to test during work hours.
UPDATE: Testing shines a light
Ok, the problem is whenever I try and backup a VM using VADP.
I tested backing up running and powered off VMs using a variety of settings permutations, and the only determining factor as to whether the NetWorker server lost it's drive was whether or not I had installed the NetWorker client on the target VM and was backing up using the NetWorker client or using the VADP.
When configuring a backup using the client wizard, first you choose whether you are configuring a new VADP proxy, or a VM backup client, or a NetWorker client.
If you choose VM backup client, you then get to choose whether you're backing it up using VADP (this is default) or using the NetWorker client installed on the VM (this is for if you need any special configurations for backup. VADP hits the actual VMDK and integrates with VMWare. NetWorker still "knows" the client is a VM, but can be used to specify particular drives, VSS, and other functions. VADP backs up VMs without using any guest resources, relying entirely on the ESXi host. NetWorker client software uses client resources to run the backup.
So, running a VADP backup of a VM host is what removes the HDD from the NetWorker server. And there are more log entries that show in the vSphere client when the HDD is dropped:
- About 20 seconds after a VADP proxy backup is initiated, vSphere reports an attempt to migrate the NetWorker server from VM2 to VM2
- then the NetWorker server is reset
- then an event states "a ticket of typemks has been acquired"
- then a warning regarding the amount of video memory assigned to the VM
- finally a report that the NetWorker server VM is powered on.
Its probably too late, but this may be helpful for future planning.
The reason why this happened After using HotAdd transport mode to back up the virtual machine that serves as the backup proxy, the backup completes successfully, but during cleanup, regular virtual disk is mistakenly removed along with HotAdded disk.
Was a known issue with the VDDK kit at that time- http://www.vmware.com/support/developer/vddk/VDDK-1.2.1-Relnotes.html. While building up a hotadd environment, its very important NOT to backup the proxy with VADP.
The solution ended up being to completely rebuild the NetWorker server, which was a good thing for a couple reasons.
Backups are now running and the drives of the NetWorker server/VADP proxy are not getting dropped.