I have a linux server that has logged the following mcelog error:
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 20
MISC 800000
TIME 1476167381 Tue Oct 11 06:29:41 2016
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCA: BUS error: 0 0 Level-3 Generic Generic Other-transaction
Request-did-not-timeout
QPI:
Intel QPI physical layer detected a QPI in-band reset but aborted
initialization
STATUS 8800004000200e0f MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 63
I can find reference to this error in Intel system programming docs, and monitoring code on github, but nothing explaining the cause, effect and suggested actions. I have read through the latest microcode update notes to see if it's mentioned but can't find anything.
The error might be a 'cosmic radiation-type' one-off or a 'non-event' to ignore, but can anyone elaborate with some real world System Admin-level guidance?
Thanks
I assume that is a pair of E5-2640v4 processors (the v# at the end matters).
You need to check the processor errata sheet (search for the "specification update" documents for your specific processor), as there are several errata about QPI issues on many processor models...
Ok: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf does not list any errata that would result in that QPI behavior. You might have a hardware defect, or you suffered an unlisted errata (more common than you'd think).
However, Supermicro is absolute crap at keeping their BIOS up-to-date (they still have that outrageous statement about never updating your BIOS on their support pages), so we can safely assume it will have outdated platform firmware kit components, such as microcode updates and platform setup bytecode.
So, you can still hope a firmware update would help. As expected from Supermicro, even the latest BIOS for that Motherboard has too old a microcode update, below the minimum version that is recommended to use when running Linux (you want at least revision 0x0b00001d, from 2016-06-06). Please install the microcode update package for your distro (must be based on Intel's version 20160714 or later), that might help.
Since supermicro support is typically quite good at addressing the issues caused by their joke of a server/workstation firmware management lifecycle. Report the issue to them directly, and request a beta BIOS with updated firmware (processor microcode, chipset, ME/AMT/TPM firmware and platform setup components). They might tell you to RMA the board instead, though, if they consider it more likely to be a hardware defect.