I'm using BackupExec and VCB to backup a few VMs. As I understand it, the pre-job script creates a snapshot of my VMs, mounts them as virtual directories on my backup server, and then my backup exec job just backs up the local folders like normal. The problem I'm having occurs during the pre-job script and the directory for one particular server is never mounted.
When I look at the VI client and look at recent activity, I see the snapshot has begun but hasn't finished. It appears to time out after 15 minutes, and thus the server is never backed up.
I have multiple VMs being backed up this way and the others work fine. The troublesome VM has a virtual disk of 85GBs, however, another VM that does work has a virtual disk of nearly 100GBs.
I'm wondering what else about a VM could cause the snapshot to take a long time to create. Is it an issue with the VM host perhaps? The VM host is a very powerful server, and none of the VM guests are used heavily, plus the backups are run off-hours so it shouldn't be a case that the server is just overworked. Are there any logs or tools I can use to see what's slowing the snapshot down?
VMWare uses the term snapshot rather loosely. It is not actually creating a copy of your server, what it does is to stop making any changes to the existing diskfile - and redirect the changes to a delta file for the life of the snapshot.
What this means is:
I think that what the VCB process does is to make a snapshot (so that the data doesn't change during the copy), and then make a clone of the frozen file for backing up. This can take some time - although you mention that it succeeds for a larger server, so this probably isn't the issue.
One possibility is if you have any virtual disks marked as independent. If so, these are ignored by the snapshot process, and possibly by VCB as well. Not sure how VCB mounts the drives, but perhaps it requires a drive that is marked as independent?
Latency and SCSI reservations have already been mentioned and those are often the cause.
Other things to check:
Are your vmtools in this particular VM installed and running properly? Is the VM running an outdated version of vmtools? The VMware tools are key for getting a good snapshot. For example, more recent versions of ESX 3.5 and vmware tools support using VSS as the snapshot provider for Windows VMs but the updated version of vmware tools would need to be installed with VSS support and that would need to be configured.
The backup resource: is this particular job queuing for an extended period of time? If the disk stage or tape drive is in use and the job remains at the snapshot stage for an extended period of time, the snap may never actually get taken. This seems unlikely given your description but in general it might be something to check.
Check the latency on your san when this is occurring. It could be that another vm or process (sql server job?) is hitting the san at the same time.
How many VMs do you host on that same LUN? How busy are they?
We have had major trouble here with some VMware ESX servers laying so many SCSI reservation on a LUN, other ESX servers using the same LUN weren't able to write to the LUN anymore. You should be able to see this in the logfiles though.
ESX sets a SCSI reservation on an entire LUN when it goes off doing metadata updates. It is possible that VCB adds some to the already heavy load on the LUN here.
Officially, this problem has been fixed for some months, but we still experience trouble every now and then.
Another thing, make sure that defragging never runs while there is a snapshot against the VM. The delta file just explodes in size.