TL;DR: I thought I had a data-at-rest corruption error on 2 SSDs, but I think it is after reading the data. How can I diagnose where the failing part is?
My ML training algorithm opens thousands of files (readonly), and yesterday one of the files showed up corrupted. However, when I started exploring the differences between the 3 copies (1 on each of 2 SSDs and 1 HDD), things got more strange. All of the dates and sizes matched perfectly, but the md5sums showed differences in 10 files.
What is even stranger, after I made sure all 3 copies were in sync (using rsync with checksum), a different file on 1 SSD randomly showed corruption. So I compared the md5sum, and it was the odd one out of the 3 copies. However, when I tested it again 2 minutes later, the md5sum matched the other 2. This shows that it isn't corruption on the disk (data-at-rest).
How do I go about figuring out what is failing? I'm going to run a long memtest (which previously passed, a year ago), but I'm unsure what else I can do.
Specs
- Dell T7500 (A18 BIOS - latest from Dell)
- 2x Xeon X5675
- 64GB (4x16GB ECC)
- Drives:
- Samsung 850 EVO 250GB (SSD FW:EMT03B6Q)
- Samsung 860 EVO 500GB (SSD FW:RVT01B6Q)
- WD Blue 4TB (HDD FW: 80.00A80)
- All 3 drives are connect to:
- IO Crest 4-port SATA III PCIe 2.0 x2 Controller Card Green, SI-PEX40057 (chipset Marvell 88SE9230)
- Used because motherboard is SATA 2.0, and I needed the higher throughput. This was the only SATA card that I could boot from, given the Dell's BIOS limitations.
output of free -h
(cache is full because I just ran new match of md5sums on all 3 drives)
total used free shared buff/cache available
Mem: 62G 1.2G 312M 11M 61G 61G
Swap: 2.0G 0B 2.0G
output of sudo lshw -C memory
(I can confirm the 4 sticks are sitting in the correct slots according to the manual. MB DIMM 1 and 2, Riser DIMM 1 and 2)
*-firmware
description: BIOS
vendor: Dell Inc.
physical id: 0
version: A18
date: 10/15/2018
size: 64KiB
capacity: 1984KiB
capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot
*-cache:0
description: L1 cache
physical id: 700
size: 384KiB
capacity: 384KiB
capabilities: internal write-back unified
configuration: level=1
*-cache:1
description: L2 cache
physical id: 701
size: 1536KiB
capacity: 1536KiB
capabilities: internal varies unified
configuration: level=2
*-cache:2
description: L3 cache
physical id: 704
size: 12MiB
capacity: 12MiB
capabilities: internal varies unified
configuration: level=3
*-cache:0
description: L1 cache
physical id: 702
size: 384KiB
capacity: 384KiB
capabilities: internal write-back unified
configuration: level=1
*-cache:1
description: L2 cache
physical id: 703
size: 1536KiB
capacity: 1536KiB
capabilities: internal varies unified
configuration: level=2
*-cache:2
description: L3 cache
physical id: 705
size: 12MiB
capacity: 12MiB
capabilities: internal varies unified
configuration: level=3
*-memory
description: System Memory
physical id: 1000
slot: System board or motherboard
size: 64GiB
capabilities: ecc
configuration: errordetection=multi-bit-ecc
*-bank:0
description: DIMM DDR3 1333 MHz (0.8 ns)
product: 9965516-433.A00LF
vendor: AMD
physical id: 0
serial: CF38EF94
slot: DIMM 1
size: 16GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:1
description: DIMM DDR3 1333 MHz (0.8 ns)
product: 9965434-110.A00LF
vendor: AMD
physical id: 1
serial: 2D25C605
slot: DIMM 2
size: 16GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:2
description: DIMM DDR3 Synchronous [empty]
vendor: FFFFFFFFFFFF
physical id: 2
serial: FFFFFFFF
slot: DIMM 3
width: 64 bits
*-bank:3
description: DIMM DDR3 Synchronous [empty]
vendor: FFFFFFFFFFFF
physical id: 3
serial: FFFFFFFF
slot: DIMM 4
width: 64 bits
*-bank:4
description: DIMM DDR3 Synchronous [empty]
vendor: FFFFFFFFFFFF
physical id: 4
serial: FFFFFFFF
slot: DIMM 5
width: 64 bits
*-bank:5
description: DIMM DDR3 Synchronous [empty]
vendor: FFFFFFFFFFFF
physical id: 5
serial: FFFFFFFF
slot: DIMM 6
width: 64 bits
*-bank:6
description: DIMM DDR3 1333 MHz (0.8 ns)
product: 9965434-110.A00LF
vendor: AMD
physical id: 6
serial: 2E25EB05
slot: RISER DIMM 1
size: 16GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:7
description: DIMM DDR3 1333 MHz (0.8 ns)
product: 9965434-110.A00LF
vendor: AMD
physical id: 7
serial: 2F25DC05
slot: RISER DIMM 2
size: 16GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:8
description: DIMM DDR3 Synchronous [empty]
vendor: FFFFFFFFFFFF
physical id: 8
serial: FFFFFFFF
slot: RISER DIMM 3
width: 64 bits
*-bank:9
description: DIMM DDR3 Synchronous [empty]
vendor: FFFFFFFFFFFF
physical id: 9
serial: FFFFFFFF
slot: RISER DIMM 4
width: 64 bits
*-bank:10
description: DIMM DDR3 Synchronous [empty]
vendor: FFFFFFFFFFFF
physical id: a
serial: FFFFFFFF
slot: RISER DIMM 5
width: 64 bits
*-bank:11
description: DIMM DDR3 Synchronous [empty]
vendor: FFFFFFFFFFFF
physical id: b
serial: FFFFFFFF
slot: RISER DIMM 6
width: 64 bits
Update 1
Dell's built-in system diagnostics ran without issue (I stopped it from doing the memory tests, and did them with memtest86 instead).
Finished tests 1-8 of memtest86 v4 without issues.
I wrote a python script to get a dictionary of all the md5sums in a directory and ran it against the 3 copies simultaneously (but only 1 thread per drive*). It found 7 new discrepancies (out of 3000 files). These were about evenly divided among the 3 drives (so it isn't just an issue with the SSDs). And when I went back to check each of the 7 odd ones out, each md5sum now matched the other 2.
Current ideas:
- I thought that possibly having 2/3 workers accessing files per drive simultaneously might've been the issue, but I've now done a few tests that the errors still show up with sequential access.
- The SATA card is bad in some way. I'll reconnect all 3 drives to the motherboard and run the same test again.
Seems likely to be the SATA card Have now run 3 passes on all 3 drives after connecting them directly to the MB, with 0 md5sum discrepancies. Looks like the SATA card is flaky, and destined for the trash.