I have a f23 linux box running as a dev server, and several times over the last few weeks I've come to log into it and found that it had been reset. One time it rebooted right in front of me, and appeared to reset to the BIOS, and then power up again.
This seems to happen about once every 2 or 3 days. The server log show only normal operations, cron etc, until it resets and reboots;
https://paste.fedoraproject.org/518600/33737531/
Jan 01 20:01:02 pc03.config run-parts[19540]: (/etc/cron.hourly) starting mcelog.cron
Jan 01 20:01:02 pc03.config run-parts[19544]: (/etc/cron.hourly) finished mcelog.cron
Jan 01 20:09:10 pc03.config puppet-agent[19565]: Applied catalog in 0.03 seconds
-- Reboot --
Jan 01 20:17:57 pc03.config systemd-journal[372]: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.6G available → current limit 1.5G).
Jan 01 20:17:57 pc03.config systemd-journal[372]: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.6G available → current limit 1.5G).
Jan 01 20:17:57 pc03.config kernel: Linux version 4.8.13-100.fc23.x86_64 ([email protected]) (gcc version 5.3.1 20160406 (Red Hat 5.3.1-6) (GCC) ) #1 SMP Fri Dec 9 14:51:40 UTC 2016
Jan 01 20:17:57 pc03.config kernel: Command line: BOOT_IMAGE=/vmlinuz-4.8.13-100.fc23.x86_64 root=/dev/mapper/fedora_pc03-root ro rd.lvm.lv=fedora_pc03/root rd.lvm.lv=fedora_pc03/swap rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau video=vesa:off LANG=en_GB.UTF-8
Jan 01 20:17:57 pc03.config kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jan 01 20:17:57 pc03.config kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
However there seem to be lots of these messages in the journal;
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: event severity: corrected
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: Error 0, type: corrected
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: fru_text: CorrectedErr
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: section_type: PCIe error
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: port_type: 0, PCIe end point
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: version: 0.0
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: command: 0xffff, status: 0xffff
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: device_id: 0000:80:02.3
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: slot: 0
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: secondary_bus: 0x00
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: vendor_id: 0xffff, device_id: 0xffff
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: class_code: ffffff
I checked the BIOS smbios event log, and it only has the reboot code 0x17 showing the machine coming up after the reset, and it's not registered any memory resets like I expected.
Unfortunately the machine does not support IPMI, as the board is a supermicro X9DAi
I am not sure how to interpret the error code in that Hardware Error message, but it seems that 0000:80:02 corresponds to;
[root@pc03 ~]# lspci -s 0000:80:02
80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07)
I am currently monitoring the server for temps/cpu, and so I will have a good idea of the sensor states when it crashes next. Are there any other steps I can take to determine the root cause of this crashing?
0 Answers