Ping a Specific Port

Question

Jens Erat

Asked: 2014-11-12 07:00:18 +0800 CST2014-11-12 07:00:18 +0800 CST 2014-11-12 07:00:18 +0800 CST

How do I get notified of ECC errors in Linux?

772

How do I get notified, when a Linux machine equipped with ECC memory recognizes a memory failure? I'm interested in both correctable and uncorrectable errors.

if a message is written to dmesg/the syslog, this is already fine, but I'd love to know what to look for
installing additional daemons (like smartmontools for hard drives) is acceptable
Nagios/Icinga monitoring would be another way to go
not all machines to be monitored have IPMI

Systems of interest have Supermicro boards (X9SCM-F), regarding an HP N54L Microserver I'm just curios, but don't care too much. All systems run Debian or Ubuntu Linux.

5 Answers

Voted

maxschlepzig · Answer 1 · 2017-12-17T09:41:17+08:00

The Linux kernel supports the error detection and correction (EDAC) features of some chipsets. On a supported system with ECC the status of your memory controller is accessible via sysfs:

/sys/devices/system/edac/mc

The directory tree under that locations should correspond to your hardware, e.g.:

/sys/devices/system/edac/mc/mc0/csrow2/power
/sys/devices/system/edac/mc/mc0/csrow0/power
/sys/devices/system/edac/mc/mc0/dimm2/power
/sys/devices/system/edac/mc/mc0/dimm0/power
/sys/devices/system/edac/mc/mc1/power
...

Depending on your hardware, you might have to explicitly load the right edac driver, cf.:

find /lib/modules/$(uname -r) -name '*edac*'

The edac-utils package provides a command line frontend and a library for accessing that data, e.g.:

edac-util -rfull          
mc0:csrow0:mc#0memory#0:CE:0
mc0:csrow2:mc#0memory#2:CE:0
mc0:noinfo:all:UE:0
mc0:noinfo:all:CE:0
mc1:noinfo:all:UE:0
mc1:noinfo:all:CE:0

You can setup some kind of cron-job that periodically calls eac-util and feeds the results into your monitoring system, where you can then configure some notifications.

In addition to that, running mcelog is generally a good idea. Depends on the system, but uncorrectable/correctable ECC errors are likely reported as machine check exception (MCE), as well. I mean, even brief periods of CPU throttling due to higher temperature are reported as MCE.

Michael Hampton · Answer 2 · 2014-11-12T07:50:03+08:00

Michael Hampton

2014-11-12T07:50:03+08:002014-11-12T07:50:03+08:00

mcelog will monitor the memory controller and report memory error events to syslog, and in some configurations can offline bad memory pages. This is, of course, in addition to its usual use to monitor machine check exceptions and a variety of other hardware errors.

Most Linux distributions have a service set up to run it as a daemon, e.g. for EL 6:

chkconfig mcelog on
service mcelog start

10

spaceman spiff · Answer 3 · 2020-01-05T15:01:56+08:00

spaceman spiff

2020-01-05T15:01:56+08:002020-01-05T15:01:56+08:00

The rasdaemon package was created as a replacement for edac-tools, and newer kernels don't even support edac-tools or mcelog.

An update to the EDAC linux kernel drivers changed how the memory error counters were managed in userspace, so edac-tools and mcelog are effectively deprecated.

9

ewwhite · Answer 4 · 2014-11-12T07:52:28+08:00

This depends on your server hardware. A whitebox or a Supermicro system will handle this differently than a Dell, HP or IBM...

One of the value-add features of high-end servers is that there's a level of hardware/OS integration. Nicer servers will report what you're looking for as part of the management agents and/or out-of-band management solution (ILO, DRAC, IPMI).

You should use the tools native to your hardware platform.

Excerpt from an HP ProLiant servers running Linux and the HP Management agents:

Trap-ID=6056
ECC Memory Correctable Errors  detected.

and

Trap-ID=6052
Advanced ECC Memory  Engaged

or a more severe

Trap-ID=6029
A correctable memory log entry indicates a memory module needs to be
replaced.

or the worst... Ignoring an error for 6 days until the server crashes because of bad RAM

0004 Repaired       22:21  12/01/2008 22:21  12/01/2008 0001
LOG: Corrected Memory Error threshold exceeded (Slot 1, Memory Module 1)

0007 Repaired       02:58  12/07/2008 02:58  12/07/2008 0001
LOG: POST Error: 201-Memory Error Single-bit error occured during 
memory initialization, 
Board 1, DIMM 1. Bank containing DIMM(s) has been disabled.

0008 Repaired       19:31  12/08/2009 19:31  12/08/2009 0001
LOG: ASR Detected by System ROM

These were logged, plus SNMP traps and emails were sent.

Generically, you'll see Machine Check Exceptions in the kernel ring buffer, so you can check dmesg or run mcelog. In my experiences with Supermicro gear without IPMI, that didn't catch everything, and I still had RAM errors slip through the cracks and cause outages. Unfortunately, this led to archaic RAM burn-in policies before system deployments.

gabriele · Answer 5 · 2020-03-13T00:09:14+08:00

gabriele

2020-03-13T00:09:14+08:002020-03-13T00:09:14+08:00

As mentioned by another poster mcelog is deprecated and effectively replaced by rasdaemon. I made a writeup on how to install and configure it on many Linux distributions, including instructions to properly setup DIMM labels.

6

How do I get notified of ECC errors in Linux?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?