How can a guest inside ESX find io problems like this?
[ 40.601502] end_request: critical target error, dev sdg, sector 430203456
[ 40.601563] sd 2:0:6:0: [sdg] Unhandled sense code
[ 40.601582] sd 2:0:6:0: [sdg] Result: hostbyte=invalid driverbyte=DRIVER_SENSE
[ 40.601622] sd 2:0:6:0: [sdg] Sense Key : Hardware Error Sense Key : Hardware Error [current] [current]
[ 40.601661] sd 2:0:6:0: [sdg] Add. Sense: Internal target failureAdd. Sense: Internal target failure
[ 40.601695] sd 2:0:6:0: [sdg] CDB: Write(10)Write(10):: 2a 2a 00 00 02 19 64 a4 05 62 c0 80 00 00 00 00 40 40 00 00
- physically the data is on vmfs stored in a raid6 array (adaptec 5805), which seems happy
- also the ESX host does not log any problems
- the disk size reported by the guest seems the same as the disk size provisioned
- through esx the guest has 9 equal 'drives' attached and only 2 exhibit this problem
I've experienced similar thing on backup volume for MS SQL in Win 2008 guest under ESX 4.0 - it's a raw volume exposed from NetApp filer.
Guest OS is reporting (and still reports) bad sectors on that volume.
I think this happened because of too many I/O write operations, temporary timeout or filer overload.
No more bad sectors reported. NetApp "disk scrubing" says all is ok. No filer error reported.
But we are going to recreate this volume anyway and see if it fix this.
How about your other volumes on this filer? Can you please check this volume with the "badblocks /dev/sdg" command? (caution: huge read overhead)
It was a hardware/firmware problem after all. While the Adaptec 5805 (with latest firmware) was reporting all RAID6 volumes to be in optimal state, it also reported one volume to contain 'Failed Stripes'. The effect of this seems to be, that part of the RAID6 volume becomes unreadable (causing the errors quoted in the question). ESX does not seem to see this directly, but running
dd if=/dev/zero of=file-on-damaged-volume
directly on the ESXi console ended in an i/o error while there was still plenty of space on the volume.No amount of arcconf verify / verify_fix runs on volumes and physical devices was able to detect or fix anything ... Eventually I moved all data away from the volume and re-created it on the adaptec level. Now all is well, but my trust in adaptec's ability to safeguard my data is severely damaged.