I am using torque 4.0.1 on openSUSE 12.1 in a cluster environment. When I qsub a job (simple as "echo hello"), it remains in 'Q' state, and never gets scheduled. I can force the job to run with qrun, and it is executed on the first node without error.
I tried to find the solutions for the past few days but failed. I read the manual, the logs, even the source code, but still can not locate the problem. Of course I googled a lot, tried various solutions, however no one worked.
Here is some info that maybe helpful:
- pbs_sched is running, but its logs seem to suggest it receives no notification about jobs being queued.
05/13/2012 18:55:08;0002; pbs_sched;Svr;Log;Log opened
05/13/2012 18:55:08;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120513 opened
05/13/2012 18:55:08;0002; pbs_sched;Svr;main;pbs_sched startup pid 32604
- pbs_server log just showed that the job was queued into default queue batch:
05/13/2012 19:33:08;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.1, loglevel = 0
05/13/2012 19:33:56;0100;PBS_Server;Job;16.head;enqueuing into batch, state 1 hop 1
05/13/2012 19:33:56;0008;PBS_Server;Job;16.head;Job Queued at request of pubuser@head, owner = pubuser@head, job name = STDIN, queue = batch
- qstat -f 16 showed nothing useful
Job Id: 16.head
Job_Name = STDIN
Job_Owner = pubuser@head
job_state = Q
queue = batch
server = head
Checkpoint = u
ctime = Sun May 13 19:33:56 2012
Error_Path = head:/fserver/home/pubuser/STDIN.e16
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Sun May 13 19:33:56 2012
Output_Path = head:/fserver/home/pubuser/STDIN.o16
Priority = 0
qtime = Sun May 13 19:33:56 2012
Rerunable = True
Resource_List.walltime = 01:00:00
substate = 10
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/,
PBS_O_WORKDIR=/fserver/home/pubuser,PBS_O_HOST=head,PBS_O_SERVER=head,
PBS_O_WORKDIR=/fserver/home/pubuser
euser = pubuser
egroup = users
queue_rank = 4
queue_type = E
etime = Sun May 13 19:33:56 2012
fault_tolerant = False
job_radix = 0
submit_host = head
init_work_dir = /fserver/home/pubuser
- All nodes are free:
sun1
state = free
np = 2
ntype = cluster
status = rectime=1336910403,varattr=,jobs=,state=free,netload=44492032184,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1697420kb,totmem=1802616kb,idletime=241085,nusers=0,nsessions=0,uname=Linux sun1 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
sun2
state = free
np = 2
ntype = cluster
status = rectime=1336910408,varattr=,jobs=,state=free,netload=39762812881,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1701012kb,totmem=1802616kb,idletime=239982,nusers=0,nsessions=0,uname=Linux sun2 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
sun3
state = free
np = 2
ntype = cluster
status = rectime=1336910400,varattr=,jobs=,state=free,netload=45984311925,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1699772kb,totmem=1802616kb,idletime=212303,nusers=0,nsessions=0,uname=Linux sun3 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
sun4
state = free
np = 2
ntype = cluster
status = rectime=1336910407,varattr=,jobs=,state=free,netload=37538584401,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805480kb,totmem=1908308kb,idletime=211197,nusers=0,nsessions=0,uname=Linux sun4 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
sun5
state = free
np = 2
ntype = cluster
status = rectime=1336910411,varattr=,jobs=,state=free,netload=173547166,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803816kb,totmem=1908308kb,idletime=211199,nusers=0,nsessions=0,uname=Linux sun5 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
sun6
state = free
np = 2
ntype = cluster
status = rectime=1336910411,varattr=,jobs=,state=free,netload=24641446,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805704kb,totmem=1908308kb,idletime=212999,nusers=0,nsessions=0,uname=Linux sun6 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
sun7
state = free
np = 2
ntype = cluster
status = rectime=1336910412,varattr=,jobs=,state=free,netload=1548383055,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805432kb,totmem=1908308kb,idletime=215630,nusers=0,nsessions=0,uname=Linux sun7 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
sun8
state = free
np = 2
ntype = cluster
status = rectime=1336910400,varattr=,jobs=,state=free,netload=128755968,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803448kb,totmem=1908308kb,idletime=211866,nusers=0,nsessions=0,uname=Linux sun8 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
sun9
state = free
np = 2
ntype = cluster
status = rectime=1336910374,varattr=,jobs=,state=free,netload=1371896399,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805664kb,totmem=1908308kb,idletime=211161,nusers=0,nsessions=0,uname=Linux sun9 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
- qmgr -c 'p s':
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = head
set server managers = pubuser@head
set server managers += root@head
set server operators = pubuser@head
set server operators += root@head
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server keep_completed = 0
set server submit_hosts = head
set server next_job_number = 17
set server moab_array_compatible = True
- momctl -d 13 on first node:
Host: sun1/sun1 Version: 4.0.1 PID: 5362
Server[0]: head (192.168.0.1:15001)
Last Msg From Server: 1584 seconds (DeleteJob)
Last Msg To Server: 7 seconds
HomeDirectory: /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (4457492 blocks available)
MOM active: 229485 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
TCP Timeout: 0 seconds
Trusted Client List: 127.0.0.1:0,192.168.0.1:0,192.168.0.101:0,192.168.0.101:15003,192.168.0.102:15003,192.168.0.103:15003,192.168.0.104:15003,192.168.0.105:15003,192.168.0.106:15003,192.168.0.107:15003,192.168.0.108:15003,192.168.0.109:15003: 0
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected
diagnostics complete
The problem is that TCP Timeout is 0 seconds, which does not seem to be normal. During the diagnostics, the following log was found in mom_logs
05/13/2012 20:30:10;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Resource temporarily unavailable (11) in tcp_read_proto_version, no protocol version number End of File (errno 2)
I googled it, but found nothing.
- I compiled OpenMPI with this torque 4.0.1 (for tm support), and I can mpirun test programs without any problem.
I hope someone can solve this problem. Thank you!