Ping a Specific Port

Question

ajdecon

Asked: 2011-03-03 10:43:48 +0800 CST2011-03-03 10:43:48 +0800 CST 2011-03-03 10:43:48 +0800 CST

Parallel prologue and epilogue in Grid Engine

772

We have a cluster being used to run MPI jobs for a customer. Previously this cluster used Torque as the scheduler, but we are transitioning to Grid Engine 6.2u5 (for some other features). Unfortunately, we are having trouble duplicating some of our maintenance scripts in the Grid Engine environment.

In Torque, we have a prologue.parallel script which is used to carry out an automated health-check on the node. If this script returns a fail condition, Torque will helpfully offline the node and re-queue the job to use a different group of nodes.

In Grid Engine, however, the queue "prolog" only runs on the head node of the job. We can manually run our prologue script from the startmpi.sh initialization script, for the mpi parallel environment; but I can't figure out how to detect a fail condition and carry out the same "mark offline and requeue" procedure.

Any suggestions?

3 Answers

Voted

Kamil Kisiel · Answer 1 · 2011-03-03T17:16:41+08:00

Kamil Kisiel

2011-03-03T17:16:41+08:002011-03-03T17:16:41+08:00

I can't say I've tried it, but at least with the prolog script returning a value other than 0, 99, or 100 should place the queue in an error state. You may be able to use a similar tactic in the start_proc_args script.

If that doesn't work, I'm not sure if what you are asking is possible to achieve via prolog scripts. Perhaps you could use a health-check cron job (or use your monitoring system of choice) to perform the checks and disable the host's queues if they fail?

1

ajdecon · Answer 2 · 2011-03-06T15:26:29+08:00

In case it's helpful to others, here's what we ended up doing:

Health checks on a long time-scale, and which wouldn't interfere with potentially overlapping jobs, (i.e. checking for hardware problems in the storage system) were offloaded to periodic cron jobs. (Frequencies depend.)
Health checks on a long time-scale, but which might interfere with jobs (memory performance checks) were offloaded to an SGE job submitted to each node in "exclusive" mode, submitted nightly by cron. If failed, the node is offlined before any other jobs could arrive.
Checks on the environment conditions right before running a job (looking for stray processes, full memory, etc) were put in a script which was run from the pe startup script, startmpi.sh. Commands are submitted to the nodes using pdsh, and output codes are returned via STDOUT. (Not ideal, but...) If one or more nodes fail, the script offlines them and runs qmod -r $JOB_ID to re-run the job. (Note that the job has to be specified as "re-runnable" either in its script or by default.) This forces the list of nodes to be rebuilt before the jobscript is actually run.

We're currently working on building fault-tolerance into this, but the basics have been confirmed to work. Thanks to @kamil-kisiel and the #gridengine channel on synirc.net for suggestions.

Jimmy Hedman · Answer 3 · 2011-04-21T05:36:59+08:00

Jimmy Hedman

2011-04-21T05:36:59+08:002011-04-21T05:36:59+08:00

Why not create a load sensor that runs on every node and depending on what you test for sets a complex?

With this approach you can have jobs running that isn't depending on for example interconnect if your interconnect network is down.

0

Parallel prologue and epilogue in Grid Engine

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?