We often get DIMMs in our servers going bad with the following errors in syslog:
May 7 09:15:31 nolcgi303 kernel: EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) May 7 09:15:31 nolcgi303 kernel: MC0: CE page 0xa0, offset 0x40, grain 8, syndrome 0xb50d, row 2, channel 0, label "": k8_edac May 7 09:15:31 nolcgi303 kernel: MC0: CE - no information available: k8_edac Error Overflow set May 7 09:15:31 nolcgi303 kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error
We can use the HP SmartStart CD to determine which DIMM has the error but that requires taking the server out of production. Is there a cunning way to work out which DIMM's bust while the server is up? All our servers are HP hardware running RHEL 5.
MC0, row 2, and channel 0 are significant. Try replacing DIMMA1 on CPU0.
By way of example, I had to identify a bad DIMM in a Linux server with 16 fully populated DIMM slots and two CPUs. These are the errors I saw on the console:
The bad DIMM in my server was DIMMA0 on CPU1.
EDAC stands for Error Detection And Correction and is documented at http://www.kernel.org/doc/Documentation/edac.txt and /usr/share/doc/kernel-doc-2.6*/Documentation/drivers/edac/edac.txt on my system (RHEL5). CE stands for "correctable errors" and as the documentation indicates, "CEs provide early indications that a DIMM is beginning to fail."
Going back to the EDAC errors above I saw on my server's console, MC1 (Memory Controller 1) means CPU1, row 1 is referred to as csrow1 (Chip-Select Row 1) in the Linux EDAC documentation, and channel 0 means memory channel 0. I checked the chart at http://www.kernel.org/doc/Documentation/edac.txt to see that csrow1 and Channel 0 correspond to DIMM_A0 (DIMMA0 on my system):
(As another example, if I had seen errors on MC0, csrow4, and Channel 1, I would have replaced DIMMB2 on CPU0.)
Of course, there are actually two DIMM slots called DIMMA0 on my server (one for each CPU), but again the MC1 error corresponds to CPU1, which is listed under "Bank Locator" in the output of dmidecode:
(On my workstation, dmidecode actually shows the Part Number and Serial Number for my DIMMs, which is very useful.)
In addition to looking at errors on the console and in logs, you can also see errors per MC/CPU, row/csrow, and channel by examining /sys/devices/system/edac. In my case the errors were only on MC1, csrow1, channel 0:
I hope this example is helpful for anyone trying to identify a bad DIMM based on EDAC errors. For more information, I highly recommend reading all of the Linux EDAC documentation at http://www.kernel.org/doc/Documentation/edac.txt
In addition to using the EDAC codes, you can use the CLI only HP utilities to determine this while the machine is online. The cli versions are far more lightweight than the web based ones and do not require you to open ports or have a daemon constantly running.
hpasmcli will give you the cartridge and module #'s of the failed modules. A little quicker than analyzing EDAC.
Example:
Status will change for failed modules.