We would like to have an SGE-based compute cluster with a queue that gives access to all nodes for the computational staff, and a second cluster queue that gives access to, say, half the nodes for occasional (but heavy) use by other staff.
We want to limit the resources of the second queue so that computational staff can keep doing some work, even when there is occasional (but heavy) use by non-comp. staff.
Is there a way to set up two (or more) SGE queues of one collection of nodes, such that one queue can contain all nodes, and a second queue contains a subset of the same nodes, and both queues operate simultaneously?
What specific SGE configuration parameters would I research to set up something like this?
Sure, this is totally possible. SGE queues are independent of one another, so you can assign whatever nodes you would like to each queue, letting them overlap however you wish.
To create a queue, type
qconf -aq
: this will open up your default editor (usually vim). Type the name of the queue as theqname
, add the hosts you would like to assign in thehostlist
, and forslots
, add a comma-delimited list of entries of the format[hostname=numslots]
. Typically the number of slots is the number of cores in the host, but you can under- or over-subscribe if you prefer. If you want the queues to overlap, just add the same hosts to multiple queues.Note, however, that by default the overlapping queues are not aware of each others' usage. They will both cheerfully assign jobs to the same node and expect them to run.
The most common way to prevent this is to makes nodes job-exclusive, so only one job may run at a time. (This is the default in other schedulers like PBS.) SGE makes this a little complicated, and involves creating a virtual "resource" which can only be used once per node. To do this, type
qconf -mc
to manage consumable resources. This will open an editor listing consumable resources: add a new one called "exclusive", like so:For more information, see the grid engine wiki.
You can also configure what are called subordinate queues. In this, you set one queue up so that it will automatically override the other when over a certain number of slots-per-node are assigned. To set this up, run
qconf -mq queue1
and under "subordinate", specifyqueue2=N
. Then whenever the number of slots used on a node in queue1 is over N, the job in queue2 will be suspended until the queue1 job is complete.