I woke up this morning to what's a first for me; one of my systems had logged DRAM ECC error
notifications. Three of them, in fact, for as far as I can tell the exact same memory location (obviously, the system isn't actually named localhost):
Aug 31 05:00:46 localhost kernel: [719099.816034] [Hardware Error]: CPU:0 MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c6c40006b080a13
Aug 31 05:00:46 localhost kernel: [719099.816046] [Hardware Error]: MC4_ADDR: 0x0000000641f49d20
Aug 31 05:00:46 localhost kernel: [719099.816051] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
Aug 31 05:00:46 localhost kernel: [719099.816059] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x641f49d20
Aug 31 05:00:46 localhost kernel: [719099.816070] EDAC MC0: CE page 0x641f49, offset 0xd20, grain 0, syndrome 0x6bd8, row 2, channel 0, label "": amd64_edac
Aug 31 05:00:46 localhost kernel: [719099.816075] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
The above was followed by an identical notification at system time 05:10:46
(719699.8160) and then one more at 05:20:46
(720299.8160) which also had Over
on the CPU:0 MC4_STATUS
line (status 0xdc6c40006b080813
). So far the system has been stable since, with no further errors logged. System activity is normal, and the system in question has been running with ECC RAM since 2014 but never logged any ECC errors.
I wouldn't be too worried about a single correctable ECC error. The almost exactly ten minutes (down to a few microseconds, in fact) in between the errors being logged could be simply for RAM scrubbing happening every ten minutes; unfortunately, on this particular system, the scrub interval is not exposed as a setting. However, the three consecutive errors in the same memory location (same value for CE ERROR_ADDRESS
) does have me a little bit concerned.
Update: The host in question has logged several more since I originally posted this question, all with the same value for CE ERROR_ADDRESS
.
How seriously should I take this? What's a good response; order replacement RAM right away and schedule to install it ASAP, treat this as just a momentary glitch, or be on toes to replace RAM if it happens again but no specific action right now?
ECC RAM tends to be used on critical servers. The system is reporting a hardware failure. If it's not a critical system and you don't mind everything going through it potentially corrupting, sure wait and see what happens, but if you care about your data more than the cost of the RAM replace the faulty RAM ASAP.
I'd suggest to run memtest86+
http://www.memtest.org
It's also included in some distributions as standard package.
It may confirm your suspicion on faulty memory module.
Wikipedia's webpage on Memory Scrubbing says:
That webpage contains a link to the SuperMicro X9SRA motherboard manual which explains the scrubbing interval:
Thus, the cause is not from scrubbing. It's possible that there is a faulty bit. While a fault might occur suddenly it seems odd that it goes away and comes back, especially when it occurs so frequently.
Pavel Machek, whom invented the nohammer kernel module says:
You can exchange the RAM modules and see if the error report follows the chip, sticks with the memory location, or occurs elsewhere.
HPE recommends (for a faulty memory module):
Suggested course of action:
Switching RAM in it's sockets will tell you if it's a specific RAM module or if the fault is in other circuitry.
As long as you don't get more than one bit error every few days there's no panic (rush).
If you're getting hit every 10 minutes you might be getting hammered.
See also: "Defending against RowHammer in the kernel" and "ECCploit: ECC Memory Vulnerable to Rowhammer Attacks After All". For ARM processors there's: "Android GuardION patches to mitigate DMA-based Rowhammer attacks on ARM".