One of our Supermicro servers reports an error like this during POST:
Failing DIMM: DIMM location (Correctable memory component found)
DIMMB2
I can also see this in the Health Event Log in the IPMI web interface:
Failing DIMM: DIMM location. (Correctable memory component found) (DIMMB2)
Until I rebooted it (for unrelated reasons), the server has been running fine, so I had no idea anything was wrong with its RAM. Is there any way to find errors like this without rebooting the server, e.g. some ipmitool command?
If not, is there a way to at least a scriptable way to see these errors after a server has been rebooted, i.e. without using the web interface? I tried ipmitool sel elist
, but it shows these entries as "Unknown" events:
5 | 10/11/2019 | 11:21:25 | Unknown #0xff | | Asserted
Edit: I found that Supermicro's proprietary tool, IPMICFG, can show these events (IPMICFG-Linux.x86_64 -sel list
) but it would still be nice to have a way to do this with ipmitool
and, most importantly, without rebooting.
Try to use FreeIPMI instead (ipmi-sel for instance): there's a good chance it will give you more information than ipmitool as the codebase is much more maintained