How do you use SGE to reserve complete nodes on a cluster?
I don't want 2 processors from one machine, 3 processors from another, and so on. I have a quadcore cluster and I want to reserve 4 complete machines, each having 4 slots. I cannot just specify that I want 16 slots because it does not guarantee that I will have 4 slots on 4 machines each.
Changing the allocation rule to FILL_UP isn't enough because if there are no machines that are completely idle, SGE will simply "fill up" the least loaded machines as much as possible instead of waiting for 4 idle machines and then scheduling the task.
Is there any way I can do this? Is there a better place to ask this question?
I think I found a way, but it probably doesn't work on old SGE's like mine. It seems new version of SGE has exclusive scheduling built in.
https://web.archive.org/web/20101027190030/http://wikis.sun.com/display/gridengine62u3/Configuring+Exclusive+Scheduling
Another possibility I've considered, but quite error prone, is to use qlogin instead of qsub and manually reserve 4 slots on each desired quadcore machine. Understandably, automating this is not particularly easy or fun.
Lastly, maybe this is a situation where hostgroups can be used. So for example, creating a hostgroup with 4 quadcore machines in it and then qsubbing to this specific subset of a queue, requesting a number of processors equal to the maximum total number in the group. Unfortunately this is kind of like hardcoding and has a lot of drawbacks eg having to wait for people to vacate a particular hardcoded hostgroup and requiring changes if you want to switch to 8 instead of 4 machines etc.
It seems like there is this hidden command-line request to add:
But you have to configure it into your SGE or OpenGridScheduler by adding it to the list of complex values (qconf -mc) and enabling each individual host (qconf -me hostname)
see this link for details: http://web.archive.org/web/20130706011021/http://docs.oracle.com/cd/E24901_01/doc.62/e21978/management.htm#autoId61
In summary:
type:
and add the line:
then:
and edit then complex_values line to read:
If you have any host-specific complex_values already in there, then just comma separate them.
SGE is weird with this, and I haven't found a good way to do this in the general case. One thing that you can do, if you know the memory size of the node you want, is to qsub while reserving an amount of memory almost equal to the full capacity of the node. This will ensure it grabs a system with nothing else running on it.
I'm trying to do almost exactly the same thing and am looking for ideas. I think a pe_hostsfile is the best option, but I'm not a manager of our SGE system, and there's no hosts files configured, so I need a quick work around. Just checked out the Configuring Exclusive scheduling link, and see that that also requires managerial rights...
I think a wrapper script could do it. I wrote a bash one-liner to figure out the number of available cores left on a machine (below). Our grid is heterogeneous, with one node having 24 cores, some 8, and the majority only 4, which makes things a little awkward.
Here's that bash one-liner anyway.
Problem now is how to get this bash variable into a SGE startup script preprocessing directive?? Maybe I'll just provide the below arg in my shell script, as the pvm environment ships with SGE. Doesn't mean it's configured though...
Sun's page on Managing Parallel Environments is pretty helpful, although again, the instructions are mostly aimed for administrators.
We set the allocation rule to the number of slots available on the node (in this case, 4). This means you can only start jobs with n*4 CPUs, but it will achieve the desired result: 16 CPUs will be allocated as 4 nodes with 4 CPUs each.
Specify the allocation rule in PE configuration as $pe_slots
This will cause all slots to be allocated on a single host
I finally found the answer to this. At first I used the the above
-l excl=True
setup as described above. However this does not quite solve the problem.To fully solve the problem I had to set up and additional pe_environment. On my cluster we have a number of 12 core nodes. So I will use this as my example.
I created an additional environment called mpich2_12. Pasted below..
Note here that the allocation_rule is set to 12, this means that the job MUST use 12 cores on a node. If you submit a job requesting 48 CPUs, it will wait and grab 4 FULL nodes when they are available.
I still use the
-l excl=True
option, but i suspect this is irrelevant now.If I have jobs that require only one CPU (and I do), I submit them to the same queue, but without the
-l exel=True
option, and I use my originalpe_environment
which has theallocation_rule = 'fillup'
Any job submitted with the mpich_12 environment will wait till there are complete nodes free. My cluster works so much better now.