When copying large files (50+GB) from an NVMe disk to a SATA 7200rpm HDD disk I see the following error in the logs on a fully patched Ubuntu 20.04:
Aug 08 00:45:59 host kernel: ata6.00: exception Emask 0x20 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 08 00:45:59 host kernel: ata6.00: irq_stat 0x20000000, host bus error
Aug 08 00:45:59 host kernel: ata6.00: failed command: WRITE DMA EXT
Aug 08 00:45:59 host kernel: ata6.00: cmd 35/00:08:30:a2:e0/00:00:e8:00:00/e0 tag 23 dma 4096 out
res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x20 (host bus error)
Aug 08 00:45:59 host kernel: ata6.00: status: { DRDY }
Aug 08 00:45:59 host kernel: ata6: hard resetting link
Aug 08 00:46:00 host kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 08 00:46:00 host kernel: ata6.00: configured for UDMA/133
Aug 08 00:46:00 host kernel: ata6: EH complete
ata6.00
is the disk which is being written to.
The issue is intermittent. Sometimes does not appear for 24 hours, sometimes a couple times per hour.
Often times the disk recovers, but sometimes the filesystem just becomes corrupt, needs to be unmounted, repaired (if possible) and remounted.
What I tried:
- I tried 3 different brands of HDD. All have the same issue.
- I suspected hardware issue. I replaced the motherboard and SATA cables. None of this helped.
- I have another server with an identical configuration. The issue does not occur there. Same workload.
- I have yet another server with a completely different configuration (Intel vs. AMD). The issue occurs there. Same workload.
- I disabled NCQ via
echo 1 > /sys/block/sda/device/queue_depth
. Did not help.
I ran out of ideas...
These are all data center grade components. Given the steps I've taken, I suppose it's not a hardware manufacturing defect.
Could this be software/OS/BIOS related?
Any ideas what else should I try?