I'm currently faced with the problem of a Windows Server 2019 Standard Server hanging under heavy system load. The server in question is a host for windows docker containers that are used to compile and test .NET applications.
The server itself is a 2 socket system with two AMD EPYC 7451 processors and 128 GB of memory. The version of Windows used is Windows Server 2019 Version 1809 (Build 17763.1158).
The problem presents itself when the system is under heavy load, meaning around 90% load on both CPUs and like 90GB of memory usage, while at the same time creating and destroying docker containers. When the problem occurs the entire system suddenly stops, how ever while being connected to the physical VGA port of the server I noticed that the desktop was still working. I had the process explorer open at this time. The process list and all the graphs stopped dead, but the UI was still working. I was able to move the windows and switching the tabs in the system information of the process monitor was still working. How ever all functions that caused a new window to be opened instantly caused the process monitor ui to freeze up as well. The CTRL+ALT+DEL is not working anymore once the system is frozen, also I enabled CTRL+ALT+SCRLK to trigger a BSoD without success. How ever the mouse cursor is still working and switching num lock on the keyboard works as well. The event log shows no entries once the system is frozen and no errors right before. The last entry in the event log is usually a message from the Hyper-V VMSwitch, either creating or deleting a Hyper-V network. My guess was that the issue may be related to the system handles, because starting applications and creating windows did not seem to work anymore, but at the time of the system freezing there were only around 250k Handles active on the system.
To resolve the issue I already updated basically all the drivers for the hardware, I updated the firmware of all the hardware components that allow it and I updated the BIOS to the latest version, all without changing the situation. I also ran a stress test for the CPUs and memtest for the RAM. Both did not reveal any issues.
I'm running out of ideas what else to do or even what to look for at this point. Anyone here who had a similar issue or any advice what else I could try in order to resolve the issue?