When copying large files (50+GB) from an NVMe disk to a SATA 7200rpm HDD disk I see the following error in the logs on a fully patched Ubuntu 20.04:
Aug 08 00:45:59 host kernel: ata6.00: exception Emask 0x20 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 08 00:45:59 host kernel: ata6.00: irq_stat 0x20000000, host bus error
Aug 08 00:45:59 host kernel: ata6.00: failed command: WRITE DMA EXT
Aug 08 00:45:59 host kernel: ata6.00: cmd 35/00:08:30:a2:e0/00:00:e8:00:00/e0 tag 23 dma 4096 out
res 50/00:00:00:00:00/00:00:00:00:00/00 Emask 0x20 (host bus error)
Aug 08 00:45:59 host kernel: ata6.00: status: { DRDY }
Aug 08 00:45:59 host kernel: ata6: hard resetting link
Aug 08 00:46:00 host kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 08 00:46:00 host kernel: ata6.00: configured for UDMA/133
Aug 08 00:46:00 host kernel: ata6: EH complete
ata6.00
is the disk which is being written to.
The issue is intermittent. Sometimes does not appear for 24 hours, sometimes a couple times per hour.
Often times the disk recovers, but sometimes the filesystem just becomes corrupt, needs to be unmounted, repaired (if possible) and remounted.
What I tried:
- I tried 3 different brands of HDD. All have the same issue.
- I suspected hardware issue. I replaced the motherboard and SATA cables. None of this helped.
- I have another server with an identical configuration. The issue does not occur there. Same workload.
- I have yet another server with a completely different configuration (Intel vs. AMD). The issue occurs there. Same workload.
- I disabled NCQ via
echo 1 > /sys/block/sda/device/queue_depth
. Did not help.
I ran out of ideas...
These are all data center grade components. Given the steps I've taken, I suppose it's not a hardware manufacturing defect.
Could this be software/OS/BIOS related?
Any ideas what else should I try?
Perhaps this is more a problem of operating temperature? As the disk becomes constantly in use, its physical position and heat gain to loss ratio gets too high leading to erratic behaviour?
On newer kernels like yours drive temperature can be put in sysfs at this path:
Be sure to make sure that the
drivetemp
module is loaded withmodprobe drivetemp
.You could consider monitoring the files in here and beginning a large file copy again, the kernel documentation here provides an indication of how these files are to be interpreted.
They include useful values like the operating min/max temperatures, some drivers can also offer alarm indicators too which are chip-dependant alarms that are triggered on a fault.
Seems to be resolved by upgrading to Ubuntu 21.04. No idea why though. The server runs stable now without any ATA issues.