The problem: I have a lot of Disk IO errors on my server, these are causing multiple server failures.
- VMs are rebooting because of IO errors "task xyz/sdaX blocked for more than 120 seconds"
- Backup not working, because VSS needs to much time.
- Writing to HDD Disks not possible or transfer is extrem slow with massive retry events
- Disks are disappearing and stay disappeared until I power cycle the server
Windows: "The IO operation at logical block address X for Disk (2|5|7|8) was retried"
Linux: "Buffer I/O error on dev sdX1, logical block Y, lost async page write"
My Server:
Mainboard: Supermicro XDRi
CPU: 2x E5-2630v3
RAM: 8x32GB DDR4 (8x Samsung M386A4G40DM0)
Disks:
4x WD Red 3TB
2x WD Red 6TB
2x SM863 2TB
1x Intel SSDSC2BX200G4 200GB
1x Samsung 940 Evo - 256GB
OS: Hyper-V 2012 R2
Controller: Onboard Intel C612 | HighPoint Rocket 2720SGL | HighPoint Rocket 640L
Raid: I'm not using any hardware raid - I use MS Storage Spaces, but the described problem occurs even without any software raid.
What I tried:
- Changing all Sata / SAS cables (2x!)
- Changing the sata controller (2x!)
- change the hdd bay slot
- Tested every single disk at my workstation - no smart / write / read error
- Reinstalled the host system
- Installed older / newer driver
- Updated bios / firmware
- Reset Bios Settings / Disabled power saving options
- CPU / RAM Test
I can reproduce the IO errors if I write data to the disks (only hdds - no issues with my ssds) - windows or linux - it does not matter.
Do you have an idea, what that could be?
It seems that the power plug cables were not ok, I changed the power plug cables from the psu to the backplane, now everything is working - I was able to test 1,5Gb/s without a single disk I/O error.
Still can't imagine how this could happen.