As title says, on one of my BL460, i have a RedHat installed, and a recurrent message in /var/log/messages from mcelog deamon, telling me:
mcelog: Corrected memory errors on page 61a5dd000 exceed threshold 10 in 24h: 10 in 24h mcelog: Location SOCKET:1 CHANNEL:1 DIMM:0 [] mcelog: Offlining page 61a5dd000 mcelog: Offlining page 61a5dd000 failed: Input/output error
I have two questions:
Is the message is "normal", i mean the system see errors, correct them, and then after all corrections I shouldn't have those errors anymore in /var/log/messages ? (even tho it means some dimm module has some errors)
I try to locate the DIMM module, but i don't find it. I located the PROC 1 of BL, and the CHANNEL 1 pair. But in BL460, DIMM or listed as 1 to 6 . I assumed DIMM:0 was the physical DIMM 1, but after removing it but the message still appears in /var/log/messages. (then I removed 1 and 2 after to check because both are CHANNEL1, but still same) How can I understand which physical DIMM it is ?
Thank you :)
This is a case where you should have the HPE management agents installed. I don't use mcelog on proper HPE server equipment.
See: HP ProLiant DL380e Gen8 server - SPP use
For RHEL/CentOS, these drivers manage system health and reporting to the OS. Granted, you can also get this information directly from the ILO.
Example output:
Or via ILO...