ras-mc-ctl --errors
is reporting results like:
661 2019-08-20 08:42:29 -0400 error: corrected filtering (some unreported errors in same region) Generic CACHE Level-3 Generic Error, mcg mcgstatus=0, mci Corrected_error Threshold based error status: yellow, mcgcap=0x00000c09, status=0x8c400c400001110b, addr=0x3334c0000080b06, misc=0x00b501c0, tsc=0x3c6571e2bbea4, walltime=0x5d5beab4, cpuid=0x000806e9, bank=0x00000008
and more frequently:
728 2019-08-31 13:35:59 -0400 error: corrected filtering (some unreported errors in same region) Generic CACHE Level-3 Generic Error, mcg mcgstatus=0, mci Corrected_error Threshold based error status: green, Large number of corrected cache errors. System operating, but might leadto uncorrected errors soon, mcgcap=0x00000c09, status=0x8c2000c00001110b, addr=0x2b6b100000374cf, misc=0x0001bdc0, tsc=0x376c4b0d8828, walltime=0x5d6aafff, cpuid=0x000806e9, bank=0x00000008
What do these messages actually mean, and what could/should one do about them?
Additional info:
- This is an Intel NUC 7i7BNH, with 16 Gb memory, 500G SSD, and 4K monitor.
- It runs Ubuntu 18.0, with recent "apt upgrade".
- The BIOS was updated to the recent July version.
- I've made no hardware modifications.
lshw -C memory
shows:
*-firmware
description: BIOS
vendor: Intel Corp.
physical id: 0
version: BNKBL357.86A.0080.2019.0725.1139
date: 07/25/2019
size: 64KiB
capacity: 8128KiB
capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int14serial int17printer acpi usb biosbootspecification uefi
*-memory
description: System Memory
physical id: 28
slot: System board or motherboard
size: 16GiB
*-bank:0
description: SODIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
product: CMSO16GX4M1A2133C15
vendor: AMI
physical id: 0
serial: 00000000
slot: ChannelA-DIMM0
size: 16GiB
width: 64 bits
clock: 2133MHz (0.5ns)
*-bank:1
description: [empty]
physical id: 1
slot: ChannelB-DIMM0
*-cache:0
description: L1 cache
physical id: 2d
slot: L1 Cache
size: 128KiB
capacity: 128KiB
capabilities: synchronous internal write-back unified
configuration: level=1
*-cache:1
description: L2 cache
physical id: 2e
slot: L2 Cache
size: 512KiB
capacity: 512KiB
capabilities: synchronous internal write-back unified
configuration: level=2
*-cache:2
description: L3 cache
physical id: 2f
slot: L3 Cache
size: 4MiB
capacity: 4MiB
capabilities: synchronous internal write-back unified
configuration: level=3
*-memory UNCLAIMED
description: Memory controller
product: Sunrise Point-LP PMC
vendor: Intel Corporation
physical id: 1f.2
bus info: pci@0000:00:1f.2
version: 21
width: 32 bits
clock: 33MHz (30.3ns)
capabilities: bus_master
configuration: latency=0
resources: memory:dc244000-dc247fff
Test results:
Running memtest86 produced some interesting results:
- After about 5 minutes, it displayed the Intel logo and rebooted.
- The same thing happened again, but I managed to record most of the messages first.
- The third time, it completed a full pass (about 45 minutes), and then crashed again a few minutes into the second pass.
- I'll leave it running, but I doubt it will make it through 4 passes.
The second attempt resulted in:
Test 4: Addr: 33090D380 Expected 08080808 Actual: 18080808 CPU:2
Test 4: Addr: 33090D38C Expected 08080808 Actual: 08080818 CPU:2
Test 4: Addr: 33090D390 Expected 08080808 Actual: [???]
Test 4: Addr: 33090D394 Expected 08080808 Actual: [???]
The third time, which made it through the first pass, showed:
Note that the addresses aren't the same as the previous time (though both had 4 errors).
Go to https://www.memtest86.com/ and download/run their free
memtest
to test your memory. Get at least one complete pass of all the 4/4 tests to confirm good memory. This will take many hours to complete.Update #1:
memtest
failed.You've either got a defective 16G RAM stick, or bad cache memory on your motherboard. Try re-seating the 16G RAM stick and see if it helps. FYI: for optimum memory speed, it's better to have two 8G RAM sticks instead of one 16G RAM stick. It also makes it easier to troubleshoot memory issues.
Check to make sure that your CPU is not overclocked, or that memory XMP is not enabled in your BIOS.
Check your BIOS version with
sudo dmidecode -s bios-version
and then go to the manufacturer's web site and check for a newer BIOS.Update #1:
User has the latest BIOS, version: BNKBL357.86A.0080.2019.0725.1139, date: 07/25/2019