We have a new Supermicro Server AS-4124GS-TNR equipped with eight NVIDIA RTX A6000. The OS is Ubuntu 20.04.2, the NVIDIA driver version is 460.73.01 (no Nouveau driver used), the CUDA Version is 11.2.
We ran a few long-lasting tests on the GPUs and the system was stable. However, after some GPU idling the system crashed repeatedly.
We assume that GpuPowerMizerMode
has to be set to 1 to prevent crashes during GPU idling (an assumption backed by other user reports found on the internet).
The only way to do this that we know of is to start X (e.g. by starting gdm) and then set the value accordingly via nvidia-settings
(running nvidia-settings
without X/gdm leads to "Unable to init server: Could not connect: Connection refused."). But when stopping X/gdm, the GpuPowerMizerMode
value is automatically reset to 2. Unfortunately, keeping X/gdm running is not an option because this also leads to system instability.
So, our problem seems to be as follows:
- GPU idling +
GpuPowerMizerMode
!= 1 can result in a system freeze.GpuPowerMizerMode
can only be set vianvidia-settings
connected to a running X/dm(?). In order to persistently set the value to 1 X/dm(?) has to keep running. - A running X/gdm can cause a system crash.
Are our assumptions correct? / Are others also experiencing these specific problems?
How can we solve the problem of freezing during GPU idling?
It should not be necessary to start a GUI session (or even have one installed!) to change settings such as this;
nvidia-settings
should work fine from the framebuffer console or even in a script you write that runs at startup.Check to be sure:
For eight GPUs just write a simple script, something like:
and run it at startup in whatever manner you find convenient.
I can't say whether your crashes are due to running with GpuPowerMizerMode!=1. If that is the case, then you probably have some sort of defective hardware that you should track down and replace.