I have an ESXi box with HP LeftHand storage exposed via iSCSI.
I have a virtual machine with a 1TB disk, of which 800GB is consumed. The disk is thick provisioned on the LeftHand storage.
A snapshot was open on the VM (so that Veeam Backup and Recovery could do its thing), and was open for around 6 hours. A delta disk of around 5GB was created during this time.
The snapshot removal has now taken over 5 hours, and still isn't complete. The storage array is reporting virtually no IOPS on that array (around 600, which is background noise), no throughput (around 8MB/sec, which again - background noise), an average queue depth of 9.
In other words, the snapshot consolidation process doesn't seem to be IO bound, I can't see anything that's causing the snapshot removal to be so damn slow. It is working, judging by watching the delta files.
Anything else that I should look at as to why this (relatively small) snapshot is so slow to be removed?
As per the VMWare documentation, I'm watching ls -lh | grep -E "delta|flat|sesparse"
right now, and I see two delta files that are changing:
-rw------- 1 root root 194.0M Jun 15 01:28 EXAMPLE-000001-delta.vmdk
-rw------- 1 root root 274.0M Jun 15 01:27 EXAMPLE-000002-delta.vmdk
I'm deducing that one snapshot file is being consolidated whilst the other one collects delta during the consolidation process. Then the new one is consolidated, and another delta is created during that process.
The file sizes are dropping with each iteration (well, most iterations), so I assume that eventually this consolidation procedure will complete (maybe I'll need to take the VM off the network for 30 minutes to let this finish without generating any changes).
It's taking around 2 minutes per hundred megs of delta to consolidate. This has certainly never happened before. Snapshot removal under a normal Veeam backup takes around 40 minutes (so certainly not fast, but not this slow).
After 6 hours and 2 minutes, the snapshot is finally removed. However I'd still like to know if there's any way you would normally troubleshoot this sort of issue (outside of storage performance).
It is my understanding that ESXI snapshot removal can (and usually does) take a long time. Before the snapshot can be removed the changes from the old snapshot need to be written to the next snapshot in order. I was taught to always delete snapshots from oldest to most recent to help this process run as quickly and efficiently as possible.
Naturally, the more changes between snapshots the longer the merge will take.