I know the topic is weird but so is my problem. On our cluster we have SGE with OpenMPI compiled for tight integration. When I set it up it worked just fine in my tests and so far there have been no complaints until recently. Thing is: When I submit a job using the OpenMPI PE and run my binary using mpirun it fails.
The error messages are like
fully.quallified.host.name - daemon did not report back when launched
and
[hostname:\d{5}] [[63730,0],\d{1,2}] routed:binomial: Connection to lifeline [[63730,0],0] lost
thats even for something simple like mpirun -np 40 --pernode hostname
now here's the weird part: if I turn on verbose output for plm_base it works: mpirun -np 40 --mca plm_base_verbose 5 --pernode hostname
does work!!! The loads of debuging output this produces on stderr don't contain any indication of a problem whatsoever.
I've tried this multiple times and I can always reproduce this so I'm quite positive that this isn't just a fluke. Problem is: I'm quite puzzled now.
I certainly miss something, so here's the questions:
- Does setting the verbosity in this case also silently set other parameters?
- What else could cause this weird behaviour?
Best Regards.
Edit: configuration of relevant PE:
pe_name ompi-gcc
slots 2000
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
Nothing fancy there... OpenMPI is compiled for thight integration and detects everything it needs... Nevertheless it doesn't work with qrsh i.e. it only works when disabling qrsh for rsh...
Nevermind. After some trial and error with the other parameters of
plm
I found that settingplm_rsh_disable_qrsh
fixes the problem. However that doesn't explain why setting its base verbosity to something other than 0 also fixed the problem. This is the part I still don't get.