At the moment, we have set up Slurm to manage a small cluster of six nodes with four GPUs each. That has been working great so far, but now we want to utilize the Intel Core i7-5820K CPUs for jobs which only require CPU processing power. Each CPU has six cores and 12 threads, each GPU requires one thread/logical core, so there are 8 threads remaining (per node) which could be used for "CPU-only" jobs.
Current configuration:
cat /etc/slurm-llnl/gres.conf
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Name=gpu File=/dev/nvidia3
cat /etc/slurm-llnl/slurm.conf (excerpt)
SchedulerType=sched/builtin
SelectType=select/cons_res
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/none
GresTypes=gpu
MaxTasksPerNode=4
NodeName=node1 CoresPerSocket=4 Procs=8 Sockets=1 ThreadsPerCore=2 Gres=gpu:4 State=UNKNOWN
NodeName=node2 CoresPerSocket=4 Procs=8 Sockets=1 ThreadsPerCore=2 Gres=gpu:4 State=UNKNOWN
NodeName=node3 CoresPerSocket=4 Procs=8 Sockets=1 ThreadsPerCore=2 Gres=gpu:4 State=UNKNOWN
NodeName=node4 CoresPerSocket=4 Procs=8 Sockets=1 ThreadsPerCore=2 Gres=gpu:4 State=UNKNOWN
NodeName=node5 CoresPerSocket=4 Procs=8 Sockets=1 ThreadsPerCore=2 Gres=gpu:4 State=UNKNOWN
NodeName=node6 CoresPerSocket=4 Procs=8 Sockets=1 ThreadsPerCore=2 Gres=gpu:4 State=UNKNOWN
PartitionName=gpu Nodes=node[2-6] Default=NO Shared=NO MaxTime=INFINITE State=UP
PartitionName=short Nodes=node1 Default=YES Shared=NO MaxTime=INFINITE State=UP
I guess the first step would be to change CoresPerSocket=4 Procs=8
to CoresPerSocket=6 Procs=12
, because that would match the actual hardware.
I alread tried to consult the documentation, but I still don't know what to do. Do I need to modify the gres.conf
? Which File=
should I specify for a CPU? Then, I thought I would add a third partition, maybe called cpuonly
. But is that even the right way to accomplish what I am trying to do? I guess I have to add something to the Gres=
parameter in the lines starting with NodeName
.
MaxCPUsPerNode
for each.CPUs
Parameter. AllMaxCPUsPerNode
added should be less or qual than this (available CPUs/Cores/Threads)SelectTypeParameters=CR_CPU
SchedulerType=sched/backfill