Ping a Specific Port

Question

user

Asked: 2017-09-01 00:28:27 +0800 CST2017-09-01 00:28:27 +0800 CST 2017-09-01 00:28:27 +0800 CST

What to do in response to repeat DRAM ECC error notifications for the same memory location?

772

I woke up this morning to what's a first for me; one of my systems had logged DRAM ECC error notifications. Three of them, in fact, for as far as I can tell the exact same memory location (obviously, the system isn't actually named localhost):

Aug 31 05:00:46 localhost kernel: [719099.816034] [Hardware Error]: CPU:0   MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c6c40006b080a13
Aug 31 05:00:46 localhost kernel: [719099.816046] [Hardware Error]:         MC4_ADDR: 0x0000000641f49d20
Aug 31 05:00:46 localhost kernel: [719099.816051] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
Aug 31 05:00:46 localhost kernel: [719099.816059] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x641f49d20
Aug 31 05:00:46 localhost kernel: [719099.816070] EDAC MC0: CE page 0x641f49, offset 0xd20, grain 0, syndrome 0x6bd8, row 2, channel 0, label "": amd64_edac
Aug 31 05:00:46 localhost kernel: [719099.816075] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

The above was followed by an identical notification at system time 05:10:46 (719699.8160) and then one more at 05:20:46 (720299.8160) which also had Over on the CPU:0 MC4_STATUS line (status 0xdc6c40006b080813). So far the system has been stable since, with no further errors logged. System activity is normal, and the system in question has been running with ECC RAM since 2014 but never logged any ECC errors.

I wouldn't be too worried about a single correctable ECC error. The almost exactly ten minutes (down to a few microseconds, in fact) in between the errors being logged could be simply for RAM scrubbing happening every ten minutes; unfortunately, on this particular system, the scrub interval is not exposed as a setting. However, the three consecutive errors in the same memory location (same value for CE ERROR_ADDRESS) does have me a little bit concerned.

Update: The host in question has logged several more since I originally posted this question, all with the same value for CE ERROR_ADDRESS.

How seriously should I take this? What's a good response; order replacement RAM right away and schedule to install it ASAP, treat this as just a momentary glitch, or be on toes to replace RAM if it happens again but no specific action right now?

3 Answers

Voted

Tim · Answer 1 · 2017-09-04T23:26:58+08:00

Best Answer

Tim

2017-09-04T23:26:58+08:002017-09-04T23:26:58+08:00

ECC RAM tends to be used on critical servers. The system is reporting a hardware failure. If it's not a critical system and you don't mind everything going through it potentially corrupting, sure wait and see what happens, but if you care about your data more than the cost of the RAM replace the faulty RAM ASAP.

2

Jaroslav Kucera · Answer 2 · 2017-09-04T23:16:43+08:00

Jaroslav Kucera

2017-09-04T23:16:43+08:002017-09-04T23:16:43+08:00

I'd suggest to run memtest86+

http://www.memtest.org

It's also included in some distributions as standard package.

It may confirm your suspicion on faulty memory module.

0

Rob · Answer 3 · 2018-11-27T17:13:13+08:00

I woke up this morning to what's a first for me; one of my systems had logged DRAM ECC error notifications. Three of them, in fact, for ... I wouldn't be too worried about a single correctable ECC error. The almost exactly ten minutes (down to a few microseconds, in fact) in between the errors being logged could be simply for RAM scrubbing happening every ten minutes; unfortunately, on this particular system, the scrub interval is not exposed as a setting.

Wikipedia's webpage on Memory Scrubbing says:

"Over 8% of DIMM modules experience at least one correctable error per year. This can be a problem for DRAM and SRAM based memories. The probability of a soft error at any individual memory bit is very small.".

"In order to not disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS setup program.

That webpage contains a link to the SuperMicro X9SRA motherboard manual which explains the scrubbing interval:

"Patrol Scrub
Patrol Scrubbing is a process that allows the CPU to correct correctable memory errors detected on a memory module and send the correction to the requestor (the original source). When this item is set to Enabled, the North Bridge will read and write back one cache line every 16K cycles, if there is no delay caused by internal processing. By using this method, roughly 64 GB of memory behind the North Bridge will be scrubbed every day. The options are Enabled and Disabled.".

Thus, the cause is not from scrubbing. It's possible that there is a faulty bit. While a fault might occur suddenly it seems odd that it goes away and comes back, especially when it occurs so frequently.

"How seriously should I take this? What's a good response; order replacement RAM right away and schedule to install it ASAP, treat this as just a momentary glitch, or be on toes to replace RAM if it happens again but no specific action right now?"

Pavel Machek, whom invented the nohammer kernel module says:

"It is fairly hard to do rowhammer by accident, so if you are hitting it, someone is probably doing it on purpose. ... Well, there's more than three orders of magnitude difference between cosmic rays and rowhammer. IIRC cosmic rays are expected to cause 2 bit flips a year... rowhammer can do bitflip in 10 minutes, and that is old version, not one of the optimized ones.".

You can exchange the RAM modules and see if the error report follows the chip, sticks with the memory location, or occurs elsewhere.

HPE recommends (for a faulty memory module):

"SYMPTOM: The below error message is found in the OS logs:

host1 kernel: Northbridge Error (node X): DRAM ECC error detected on the NB.

FIX:
1. Identify the Memory module number that has failed (if mentioned in the error)
2. Check IML for Error relating to Memory module. Ex Proc x slot x
3. Update System BIOS
4. If no errors are found run diagnostics and replace the memory module (5-6 loops of Memory Diagnostics to isolate the memory module)"

Suggested course of action:

Switching RAM in it's sockets will tell you if it's a specific RAM module or if the fault is in other circuitry.
As long as you don't get more than one bit error every few days there's no panic (rush).
If you're getting hit every 10 minutes you might be getting hammered.

See also: "Defending against RowHammer in the kernel" and "ECCploit: ECC Memory Vulnerable to Rowhammer Attacks After All". For ARM processors there's: "Android GuardION patches to mitigate DMA-based Rowhammer attacks on ARM".

What to do in response to repeat DRAM ECC error notifications for the same memory location?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?