I have a Dell Poweredge R630 with 4 drives in a RAID. I'm not sure if it's RAID 10 or RAID 5 because I didn't order or set up the server originally and I'm just the default network admin, it's not my primary job. The server is running vSphere Essentials ESXi 6.7 and it hosts half a dozen VMs.
I use Altaro VM backup running in a VM on another host to backup this host as well as an ESXi 6.5 host. When I started backing up the VMs on this host I found that the backups would randomly fail. Any given night 2 or 3 of the 5 VMs I'm backing up would fail but not the same VMs each night. A couple weeks ago they started to always fail.
In working with Altaro support to find out why it was failing they found this in the Altaro logs:
2019/09/24 00:11:31.034: DISKLIB-LINK : "san://snapshot-155[Storage] VMName/[email protected]:[email protected]/XXX" : failed to open (Unknown error).
2019/09/24 00:11:31.034: DISKLIB-CHAIN : "san://snapshot-155[Storage] VMName/[email protected]:[email protected]/XXX" : failed to open (Unknown error).
2019/09/24 00:13:18.446: VixDiskLib: Detected DiskLib error 2338 (NBD_ERR_NETWORK_CONNECT).
2019/09/24 00:13:18.446: VixDiskLib: VixDiskLib_Read: Read 437 sectors at 19619760 failed. Error 14009 (The server refused connection) (DiskLib error 2338: NBD_ERR_NETWORK_CONNECT) at 5235.
Their support says these log entries, I assume the last line in particular, came directly from the host.
Not being an ESXi expert I'm not totally sure which log files to look at in ESXi to try to figure out what is going wrong, to confirm it's a drive problem on the host, and to determine which drive it is so I can replace it. So far the vCenter is not raising any alerts or warnings about a drive problem and the host is not indicating a problem with the array.
Another data point: Most of these VMs are running Windows. Each of those is running Windows backup internally to a separate drive and those all complete with no errors. I find it interesting that Windows is able to backup its drives from inside the VM but there is a read error when ESXi is making the backup from outside.
It's not a host hard drive problem. The log file is telling you that it failed to open the virtual hard drive of the VM because of a network error.
My guess is that the backups of the VM's that are on the same host as the Altaro backup probably don't fail. Is that right?