I am getting irregularities with large sized files. I have 64 GB of RAM, and my storage drives are all Samsung 860 EVO. I am running mdcrypt on top of my raw drives, luks RAID on top of that, and ext4 as file systems. I have lots of free drive space, and am not running swap.
My distribution is Ubuntu 18.04 LTS (4.18.0-25-generic #26~18.04.1-Ubuntu SMP Thu Jun 27 07:28:31 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux).
This irregularity was first discovered when cloning large USB thumb drives. Then I noticed that large looped-mounted file systems would also become corrupted.
Snippets follow:
I start my test by creating a 32 GB file of zeros:
$ dd if=/dev/zero of=zero-file_32GB bs=1024k count=32768
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 55.3081 s, 621 MB/s
Then I SHA256 sum that file to get the hash code. Note how the hash changes with multiple iterations:
$ sha256sum zero-file_32GB
5f7f8ea75d87ac7d64c07ecc2c5fdbe407540913ac0feb050ede768990140b38 zero-file_32GB
$ sha256sum zero-file_32GB
64bcf7372df895319ce9e54758aec2814600fa3335fb82c5996a7636e7d637be zero-file_32GB
$ sha256sum zero-file_32GB
3475353b2a00e5abebb1878a9ddb5956eb829c94af26d9cd079f991fbd84435c zero-file_32GB
$ sha256sum zero-file_32GB
cf65fa70ba04d7bb4055b72fdf2ac90bf65ac8457cc80b8e673af5acb57d22d1 zero-file_32GB
The same inconsistencies happen with MD5 sum:
$ md5sum zero-file_32GB
8633b9ba83a8ac04c9b56fad0a065ec2 zero-file_32GB
$ md5sum zero-file_32GB
cc289d380b25235b7610a7b86bc4fd47 zero-file_32GB
$ md5sum zero-file_32GB
249f66bd3843b6fcad8316fd0a3e660c zero-file_32GB
$ md5sum zero-file_32GB
888ac00592204be7a026c27e98159ff2 zero-file_32GB
By now I am fairly confident that my file is corrupted, and not the hash summing algorithms. To test this hypothesis, I split my 32 GB zero-file into thirty-two 1 GB chunks:
$ split --verbose -b 1G zero-file_32GB split-1G_
creating file 'split-1G_aa'
creating file 'split-1G_ab'
creating file 'split-1G_ac'
creating file 'split-1G_ad'
creating file 'split-1G_ae'
creating file 'split-1G_af'
creating file 'split-1G_ag'
creating file 'split-1G_ah'
creating file 'split-1G_ai'
creating file 'split-1G_aj'
creating file 'split-1G_ak'
creating file 'split-1G_al'
creating file 'split-1G_am'
creating file 'split-1G_an'
creating file 'split-1G_ao'
creating file 'split-1G_ap'
creating file 'split-1G_aq'
creating file 'split-1G_ar'
creating file 'split-1G_as'
creating file 'split-1G_at'
creating file 'split-1G_au'
creating file 'split-1G_av'
creating file 'split-1G_aw'
creating file 'split-1G_ax'
creating file 'split-1G_ay'
creating file 'split-1G_az'
creating file 'split-1G_ba'
creating file 'split-1G_bb'
creating file 'split-1G_bc'
creating file 'split-1G_bd'
creating file 'split-1G_be'
creating file 'split-1G_bf'
I then SHA256 sum the new file splits. They should all be identical because they each should consist of only zeros. But notice the inconsistency at splits az and ba:
$ sha256sum split-1G_??
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_aa
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ab
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ac
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ad
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ae
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_af
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ag
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ah
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ai
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_aj
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ak
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_al
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_am
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_an
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ao
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ap
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_aq
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ar
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_as
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_at
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_au
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_av
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_aw
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ax
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_ay
702301f26e8df8cf784ca6b45954f1ca3524d1e22c322ee271ab1ac20b4face2 split-1G_az
bd9442046cecfcdec29169f5e8485ee0e226f56fab24cfded23b4ad15275b5d9 split-1G_ba
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_bb
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_bc
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_bd
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_be
49bc20df15e412a64472421e13fe86ff1c5165e18b2afccf160d4dc19fe68a14 split-1G_bf
Similar inconsistencies with the MD5 sum:
$ md5sum split-1G_??
cd573cfaace07e7949bc0c46028904ff split-1G_aa
cd573cfaace07e7949bc0c46028904ff split-1G_ab
cd573cfaace07e7949bc0c46028904ff split-1G_ac
cd573cfaace07e7949bc0c46028904ff split-1G_ad
cd573cfaace07e7949bc0c46028904ff split-1G_ae
cd573cfaace07e7949bc0c46028904ff split-1G_af
cd573cfaace07e7949bc0c46028904ff split-1G_ag
cd573cfaace07e7949bc0c46028904ff split-1G_ah
cd573cfaace07e7949bc0c46028904ff split-1G_ai
cd573cfaace07e7949bc0c46028904ff split-1G_aj
cd573cfaace07e7949bc0c46028904ff split-1G_ak
cd573cfaace07e7949bc0c46028904ff split-1G_al
cd573cfaace07e7949bc0c46028904ff split-1G_am
cd573cfaace07e7949bc0c46028904ff split-1G_an
cd573cfaace07e7949bc0c46028904ff split-1G_ao
cd573cfaace07e7949bc0c46028904ff split-1G_ap
cd573cfaace07e7949bc0c46028904ff split-1G_aq
cd573cfaace07e7949bc0c46028904ff split-1G_ar
cd573cfaace07e7949bc0c46028904ff split-1G_as
cd573cfaace07e7949bc0c46028904ff split-1G_at
cd573cfaace07e7949bc0c46028904ff split-1G_au
cd573cfaace07e7949bc0c46028904ff split-1G_av
cd573cfaace07e7949bc0c46028904ff split-1G_aw
cd573cfaace07e7949bc0c46028904ff split-1G_ax
cd573cfaace07e7949bc0c46028904ff split-1G_ay
7036950003e53e471654b020330b386e split-1G_az
0a82f6068a91bef3b46294e1e30687be split-1G_ba
cd573cfaace07e7949bc0c46028904ff split-1G_bb
cd573cfaace07e7949bc0c46028904ff split-1G_bc
cd573cfaace07e7949bc0c46028904ff split-1G_bd
cd573cfaace07e7949bc0c46028904ff split-1G_be
cd573cfaace07e7949bc0c46028904ff split-1G_bf
I thought to continually split into smaller chunks to determine the size of the actual discrepancy, and then to analyze it with a hex editor, but I doubt that would provide any insight to what is causing this data degradation. My ISO images, video files, and EXT4 looped filesystems are becoming damaged. Any idea what the culprit is?
Being that this only starts to happen at 32 GB (which happens to be half the size of my 64 GB RAM. I am not using swap.), I am inclined to believe that it is a memory issue. What say you?
Update #1:
Unfortunately, the memory test did not take nearly as long as expected. :-(
Update #2 (Resolution!)
- I removed all 4 16-GB sticks from the computer.
- I then inserted only the lowest serial-numbered stick into the first DIMM slot, Slot #1 (my slots are numbered from 1 to 4). I ran MemTest86 for 3:44 (three hours, forty-four minutes), and it completed with zero errors.
- I replaced that stick with the next sequentially serial-numbered stick (after the obvious power down and electrostatic precautions). I once again used Slot #1 (because I wished to test all the memory first, before starting to check my slots). I ran MemTest86 again. This time the test aborted almost instantaneously, due to too many errors.
- Accordingly I inserted the third DIMM into Slot #1. MemTest86 ran for 3:43, without errors.
- The fourth DIMM in Slot #1 test also ran for 3:43 and without errors.
- I then inserted the three known good DIMMs into the first three slots. MemTest86 ran for 8:54 and without any errors.
I found it interesting that testing three DIMMs (8:54) took significantly less time than the three single-module test combined did (3:44 + 3:43 + 3:43 = 11:10). I assume that some tests were done in tandem.
I sha256sum'ed a new 32 GB zeros file. The sum remained unchanged even after multiple iterations. My sum was 97af759fc4597bc41706df77cbab318a57d935bacb262bd409e3ab767e07066f, the same number @bernard.wei presented.
I would like to thank @heynnema for his advice on MemTest86. That was instrumental in troubleshooting this problem.
I consider this matter resolved. Thanks all!
check for firmware updates for your Samsung 860 EVO.
Samsung Magician
is a Windows app used to check your firmware.check your BIOS version with
sudo dmidecode -s bios-version
and then go to the manufacturer's web site to check for a newer BIOS.run
memtest
to check your 64G RAM. Go to https://www.memtest86.com/ and download/run the freememtest
to test your memory. Get at least one complete pass of all the tests to confirm good memory. This will take many hours to complete.Update #1:
memtest
failed in test 2/4, [Address test, own address]memtest
can fail for a few reasons...wrong spec RAM installed
the BIOS is set to overclock the memory, or run them at max speed
DIMM is incorrectly seated in its slot
DIMM is defective
DIMM's are normally installed in pairs of equal sizes to take speed advantage of memory interleaving using two channels, A & B (or more in some cases). The first pair of DIMMs goes into slots A1/B1, and the second pair into A2/B2. (Assuming that this is a desktop computer with four or more DIMM slots).
Step #1:
Step #2:
Step #3:
touch chassis ground, unplug the computer, hold the power button for 10 seconds
reseat all DIMMs
retest with
memtest
if
memtest
runs successfully, you've probably fixed the problemStep #4:
touch chassis ground, unplug the computer, hold the power button for 10 seconds
identify the A2/B2 DIMMs and carefully remove them
retest with
memtest
if
memtest
runs successfully, the A1/B1 DIMMs are goodif
memtest
fails, then either A1 or B1 DIMM is badmemtest
memtest
runs successfully, the A1/B1 DIMM that you pulled out is the defective onememtest
fails, the other A1/B1 DIMM is defectiveAssuming that you have four 16G DIMMs, continue cycling the remaining DIMM sticks through slot A1/B1 until only one defective DIMM remains uninstalled. Keep in mind that you may actually have more than one defective DIMM.
Update #2:
Using
memtest
, one bad DIMM was identified. Checksum are now fine, and consistent.