I've noticed a bunch of errors that just recently appeared in /var/log/messages
on one of our servers (below). However, the mce client seems to be less certain of the error source than the decoded entries in syslog. Is there some sort of key to use in order to interpret the MCE output?
Nov 12 04:19:19 areion kernel: [14698753.176035] Machine check events logged
Nov 12 04:19:19 areion mcelog: HARDWARE ERROR. This is *NOT* a software problem!
Nov 12 04:19:19 areion mcelog: Please contact your hardware vendor
Nov 12 04:19:19 areion mcelog: MCE 0
Nov 12 04:19:19 areion mcelog: CPU 0 BANK 8
Nov 12 04:19:19 areion mcelog: MISC 640738dd0009159c ADDR 96236c6c0
Nov 12 04:19:19 areion mcelog: TIME 1352711959 Mon Nov 12 04:19:19 2012
Nov 12 04:19:19 areion mcelog: MCG status:
Nov 12 04:19:19 areion mcelog: MCi status:
Nov 12 04:19:19 areion mcelog: MCi_MISC register valid
Nov 12 04:19:19 areion mcelog: MCi_ADDR register valid
Nov 12 04:19:19 areion mcelog: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Nov 12 04:19:19 areion mcelog: Transaction: Memory read error
Nov 12 04:19:19 areion mcelog: STATUS 8c0000400001009f MCGSTATUS 0
Nov 12 04:19:19 areion mcelog: MCGCAP 1c09 APICID 20 SOCKETID 1
Nov 12 04:19:19 areion mcelog: CPUID Vendor Intel Family 6 Model 44
All errors seem to be connected with the same memory bank:
areion:~# awk -F'mcelog:' '/mcelog:.*BANK/{ print $2; }' < /var/log/messages |uniq
CPU 0 BANK 8
I have the mcelog daemon running, and when I check for error information, it doesn't seem to know where the errors are coming from. Only that they are associated with CPU0
(we only have one CPU in this box):
Memory errors
SOCKET 1 CHANNEL any DIMM any
corrected memory errors:
77 total
77 in 24h
uncorrected memory errors:
0 total
0 in 24h
Per page corrected memory statistics:
359ffc000: total 2 2 in 24h online
3b93cc000: total 2 2 in 24h online
3ce45c000: total 2 2 in 24h online
96236c000: total 20 20 in 24h online triggered
96545c000: total 9 9 in 24h online
96a82c000: total 9 9 in 24h online
96a8ec000: total 1 1 in 24h online
96fb6c000: total 15 15 in 24h online triggered
9c2edc000: total 15 15 in 24h online triggered
9c5eac000: total 1 1 in 24h online
9c6a1c000: total 1 1 in 24h online
It's not at all clear how I am to interpret this information. On one hand, the mce client doesn't indicate channel or DIMM, but the decoded message indicates the errors occur on DIMM 8. dmesg
seems to indicate that only 42 messages were logged:
[14698753.176035] Machine check events logged
[14698753.629174] Machine check events logged
[14698815.338595] __ratelimit: 38 callbacks suppressed
[14698815.338628] Machine check events logged
[14698816.020797] Machine check events logged
I seem to be getting mixed messages, which makes me wonder what assumptions to make based on the info reported from the various sources.
Misc info:
areion:~# grep 'model name' /proc/cpuinfo |uniq
model name : Intel(R) Xeon(R) CPU X5670 @ 2.93GHz
areion:~# apt-cache policy mcelog |grep Installed
Installed: 1.0~pre3-3
areion:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 6.0.6 (squeeze)
Release: 6.0.6
Codename: squeeze
You might like to try replacing the DIMM in question (CPU 0, SOCKET 8) and seeing whether the MCE messages continue to be generated.
The mcelog package comes configured with some default thresholds for various MCE events that occur over time. Check out
/etc/mcelog/mcelog.conf
for details. For memory page errors the threshold is 10 events over 24 hours. (I'm not really sure where this number comes from but it's probably a reasonable reference point). Your post mentions 77 correctable events over 24 hours against a whole bunch of pages, so it's pretty likely that the DIMM has developed a problem which may or may not turn into something more serious.I wouldn't be too upset about receiving inconsistent information from different sources. In general I have found that anything at the firmware level is pretty platform specific (i.e particular to that particular hardware model). My rule of thumb for firmware-related problems is that the vendor tools are usually the most accurate, but the least usable. The more generic open source tools are easier to work with, but may not provide enough information to show exactly what's going on.