Symptom:
The system freezes anywhere from two minutes to an hour after boot, then spontaneously reboots about ten seconds later. It doesn't matter if the system is sitting at the login screen, idle at a desktop, watching a video, etc. Temperature readings are normal leading up to the freeze+reboot.
I thought that implied a memory issue, but I've tried reseating modules, swapping slots, increasing DRAM voltage, etc. Threads on Ryzen and the Aorus motherboard sent me down rabbit holes and I've been toggling c-states off, increasing idle DRAM power, etc. No joy.
Note that this AMD Ryzen 5 3600 is not a defective CPU part; I swapped it with AMD via an RMA exchange and saw no difference! (When I install an AMD Ryzen 3400G for the CPU the system is rock solid. However, I can't use that CPU/APU long-term for this system.)
As much information as you can stand follows. Please let me know if I've missed anything which might help further diagnose what's wrong.
I am weeks of precious time into trying to get this build stable. At this point I feel like I've tried everything except swinging a dead chicken over my head. Please help me find the root cause! I'm at my wit's end and feeling very discouraged. :(
Short list of (potentially) relevant other threads:
- kernel - Ubuntu server 20.04 Ryzen 7 3700X freezes
- AMD Ryzen 5 3600 + Ubuntu 20.04 problems
- https://wiki.gentoo.org/wiki/Ryzen#Random_reboots_with_mce_events
Hardware
- Gigabyte x570 Aorus Elite motherboard (UEFI Versions: F11 or F20)
- AMD Ryzen 5 3600 6-Core Processor
- 16GB Corsair Vengeance LPX memory (DDR4 2x8GB 3200Mhz)
- MSI GeForce GTX 970 GAMING 4G
- 08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 970] [10de:13c2] (rev a1)
Things I've tried with no change
- Tested the memory exhaustively (overnight, no problems detected)
- Reseating memory
- Swapping memory to the opposite memory bank
- Swapping memory sticks within the same bank
- Swapping out the CPU via RMA with AMD
- Different UEFI versions (F11 and F20)
Errors reported at boot typically look like this:
sudo journalctl | grep -i "hardware err"
Jul 13 17:28:36 obelisk-ubuntu kernel: mce: [Hardware Error]: Machine check events logged
Jul 13 17:28:36 obelisk-ubuntu kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
Jul 13 17:28:36 obelisk-ubuntu kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff87930eee MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Jul 13 17:28:36 obelisk-ubuntu kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1594686497 SOCKET 0 APIC 4 microcode 8701013
Jul 13 20:06:36 obelisk-ubuntu kernel: mce: [Hardware Error]: Machine check events logged
Jul 13 20:06:36 obelisk-ubuntu kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108
Jul 13 20:06:36 obelisk-ubuntu kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffbbf30eee MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Jul 13 20:06:36 obelisk-ubuntu kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1594695977 SOCKET 0 APIC a microcode 8701021
Jul 15 16:57:44 obelisk-ubuntu kernel: mce: [Hardware Error]: Machine check events logged
Jul 15 16:57:44 obelisk-ubuntu kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
Jul 15 16:57:44 obelisk-ubuntu kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff89330eee MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Jul 15 16:57:44 obelisk-ubuntu kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1594857445 SOCKET 0 APIC 1 microcode 8701021
More:
UEFI settings
The settings in the picture below are referring to F20, the most recent stable UEFI release.
Things I've tried with no change (note NO overclocking of any sort)
- Every version of Gigabyte's UEFI between F11 and F20 at "optimized default" settings
- Increasing core DRAM voltage to 1.35V
- Many of the settings below/pictured toggled in one direction or another:
- CPU Clock Ratio: Auto (36.00)
- CPU Clock Control: Auto (100.00MHz)
- Extreme Memory Profile (X.M.P): Disabled
- CPU Vcore: Auto
- CPU Vcore Loadline Calibration: Auto
- CSM Support: Enabled
- SMT Mode: Disabled
- Power Supply Idle Control: Typical Current Idle
- IOMMU: Enabled
- SVM Mode: Enabled
- ACS Enabled: Auto
- Enable AER Cap: Auto
- Global C-state Control: Disabled
- DRAM Power Options > Power Down Enable: Disabled
Software
Ubuntu 20.04 LTS
$ uname -a
Kernel: Linux obelisk-ubuntu 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ grep GRUB_CMDLINE_LINUX_DEFAULT /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash atkbd.reset=1 i8042.reset pci=assign-busses apicmaintimer idle=poll reboot=cold,hard processor.max_cstate=1 rcu_nocbs=0-11"
I have also tried installing the ZenStates package and setting it to disable C6.
Here's a gist with everything else I think you might ask for.
I'm facing the same issue with a 3700X on that mainboard type, running Debian Buster and different kernels. The system was stable for long time before, issues started when I updated the bios together with the installation of new memory. Tried to flash the bios back to version F3 today and now the system seems to be stable again. Unfortunately it seems that this old bios version does not support the ecc on my memory banks.