I have access to a cluster that uses 'torque' (I think) and we use PBS scripts to submit jobs. I need to run more then 200 instances of an app that I've developed in java. The app acts as a peer forming a P2P network, which means that those instances communicate with each other through sockets.
I was able to do my tests with 100 instances running on a single node on the cluster, but when running 200 instances on a single node it doesn't work, and I can't ask for more resources (mem, cores, etc.)
My question is: should I do this the way I'm doing it? With a serial script in which I start all my instances one by one sending them to the background and then wait for them?
Could this be accomplished with a parallel script in which I could ask for 2 nodes and instantiate 100 instances of my app in each node? In this case, I have some other questions: How can I do it? and is there any guarantee that both jobs run at the same time? All 200 instances must be running at the same time.
- To form the P2P network at least one peer IP address must be known in a serial job, I can get the node IP address in the script and pass it as a parameter to the app, but in a parallel job with 2 nodes how can I do this?
This is part of the script that I'm currently using...
#PBS -l nodes=1:ppn=4
#PBS -l pmem=6GB
#PBS -l walltime=00:20:00
IP=`/sbin/ifconfig eth0 | grep 'inet ' | awk '{print $2}' | sed 's/addr://'`
PORT_PEER=3000
java -jar $JAR $JAR_PARAMS -ip=$IP -port=$PORT_PEER & # first peer, others connect to this one..
for i in {1..99}
do
PORT_PEER=`expr $PORT_PEER + 2`;
java -jar $JAR $JAR_PARAMS -ip=$IP -port=$PORT_PEER -bootstrap=$IP:3000 &
sleep 1s
done
wait # wait here until all instances terminates
If you change the script to something like:
you'll get 2 nodes, each with at least 4 available cores. You may already know that.
Your TORQUE admin may have also enabled
pbsdsh
. With the proper arguments, you can use that to run commands on each node reserved by your job. Withoutpbsdsh
, if they've at least enabledrsh
access among systems in one queue, you can parse through the contents of the file given by the environment variable$PBS_NODEFILE
andrsh
to each one that's not the main host, running a shell script on each.So, untested, but something like:
and
should get you started.