Is there any way to clear stale allocated GRES in Slurm?
I have one node where 4 GPUs are allocated while no jobs are running on the node. Rebooting the node does not release the GPUs.
user@control1:~$ scontrol show node node2
NodeName=node2 Arch=x86_64 CoresPerSocket=64
CPUAlloc=0 CPUTot=256 CPULoad=0.05
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:tesla:8
NodeAddr=node2 NodeHostName=node2 Version=21.08.5
OS=Linux 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023
RealMemory=1025596 AllocMem=0 FreeMem=1025887 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=DEFAULT
BootTime=2023-09-19T12:58:23 SlurmdStartTime=2023-09-19T12:59:06
LastBusyTime=2023-09-19T12:59:07
CfgTRES=cpu=256,mem=1025596M,billing=256,gres/gpu=8,gres/gpu:tesla=8
AllocTRES=gres/gpu=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
user@control1:~$ squeue -w node2 --state=all
JOBID NAME USER STATE TIME QOS PRIORITY TIME_LIMIT NODELIST(REASON)
I don't know why, but this morning the allocation was gone. The node was turned off over the night to conserve energy (clusterwide configuration), apart from that I didn't change anything in the configuration.