We have a cluster being used to run MPI jobs for a customer. Previously this cluster used Torque as the scheduler, but we are transitioning to Grid Engine 6.2u5 (for some other features). Unfortunately, we are having trouble duplicating some of our maintenance scripts in the Grid Engine environment.
In Torque, we have a prologue.parallel script which is used to carry out an automated health-check on the node. If this script returns a fail condition, Torque will helpfully offline the node and re-queue the job to use a different group of nodes.
In Grid Engine, however, the queue "prolog" only runs on the head node of the job. We can manually run our prologue script from the startmpi.sh initialization script, for the mpi parallel environment; but I can't figure out how to detect a fail condition and carry out the same "mark offline and requeue" procedure.
Any suggestions?
I can't say I've tried it, but at least with the prolog script returning a value other than 0, 99, or 100 should place the queue in an error state. You may be able to use a similar tactic in the
start_proc_args
script.If that doesn't work, I'm not sure if what you are asking is possible to achieve via prolog scripts. Perhaps you could use a health-check cron job (or use your monitoring system of choice) to perform the checks and disable the host's queues if they fail?
In case it's helpful to others, here's what we ended up doing:
Health checks on a long time-scale, and which wouldn't interfere with potentially overlapping jobs, (i.e. checking for hardware problems in the storage system) were offloaded to periodic cron jobs. (Frequencies depend.)
Health checks on a long time-scale, but which might interfere with jobs (memory performance checks) were offloaded to an SGE job submitted to each node in "exclusive" mode, submitted nightly by cron. If failed, the node is offlined before any other jobs could arrive.
Checks on the environment conditions right before running a job (looking for stray processes, full memory, etc) were put in a script which was run from the pe startup script, startmpi.sh. Commands are submitted to the nodes using pdsh, and output codes are returned via STDOUT. (Not ideal, but...) If one or more nodes fail, the script offlines them and runs
qmod -r $JOB_ID
to re-run the job. (Note that the job has to be specified as "re-runnable" either in its script or by default.) This forces the list of nodes to be rebuilt before the jobscript is actually run.We're currently working on building fault-tolerance into this, but the basics have been confirmed to work. Thanks to @kamil-kisiel and the #gridengine channel on synirc.net for suggestions.
Why not create a load sensor that runs on every node and depending on what you test for sets a complex?
With this approach you can have jobs running that isn't depending on for example interconnect if your interconnect network is down.