Jobs I add to the queue stays there in "Queued" state without attempts to be executed (unless I manually qrun
them)
/var/spool/torque/server_logs
say just
04/11/2011 12:43:27;0100;PBS_Server;Job;16.localhost;enqueuing into batch, state 1 hop 1
04/11/2011 12:43:27;0008;PBS_Server;Job;16.localhost;Job Queued at request of test@localhost, owner = test@localhost, job name = Qqq, queue = batch
The job requires just 1 CPU on 1 node.
# qmgr -c "list queue batch"
Queue batch
queue_type = Execution
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
max_running = 3
acl_host_enable = True
acl_hosts = localhost
resources_min.ncpus = 1
resources_min.nodect = 1
resources_default.ncpus = 1
resources_default.nodes = 1
resources_default.walltime = 00:00:10
mtime = Mon Apr 11 12:07:10 2011
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
kill_delay = 3
enabled = True
started = True
I can't set resources_assigned to nonzero because of Cannot set attribute, read only or insufficient permission resources_assigned.ncpus
.
When I qrun some task, this goes to mom's log:
04/11/2011 21:27:48;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE
04/11/2011 21:27:48;0001; pbs_mom;Job;TMomFinalizeJob3;job 18.localhost started, pid = 28592
04/11/2011 21:27:48;0080; pbs_mom;Job;18.localhost;scan_for_terminated: job 18.localhost task 1 terminated, sid=28592
04/11/2011 21:27:48;0008; pbs_mom;Job;18.localhost;job was terminated
04/11/2011 21:27:48;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
04/11/2011 21:27:48;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop
04/11/2011 21:27:48;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
04/11/2011 21:27:48;0080; pbs_mom;Job;18.localhost;obit sent to server
Scheduler log (/var/spool/torque/sched_logs/20110705
):
07/05/2011 21:44:53;0002; pbs_sched;Svr;Log;Log opened
07/05/2011 21:44:53;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20110705 opened
07/05/2011 21:44:53;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup pid 16234
qstat -f
:
Job Id: 26.localhost
Job_Name = qwe
Job_Owner = test@localhost
job_state = Q
queue = batch
server = localhost
Checkpoint = u
ctime = Tue Jul 5 21:43:31 2011
Error_Path = localhost:/home/test/jscfi/default/0.738784810485275/qwe.e26
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Jul 5 21:43:31 2011
Output_Path = localhost:/home/test/jscfi/default/0.738784810485275/qwe.o26
Priority = 0
qtime = Tue Jul 5 21:43:31 2011
Rerunable = True
Resource_List.ncpus = 1
Resource_List.neednodes = 1:ppn=1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 00:01:00
substate = 10
Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=test,
PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games,
PBS_O_MAIL=/var/mail/test,PBS_O_SHELL=/bin/sh,PBS_SERVER=127.0.0.1,
PBS_O_WORKDIR=/home/test/jscfi/default/0.738784810485275,
PBS_O_QUEUE=batch,PBS_O_HOST=localhost
euser = test
egroup = test
queue_rank = 1
queue_type = E
etime = Tue Jul 5 21:43:31 2011
submit_args = run.pbs
Walltime.Remaining = 6
fault_tolerant = False
How to make it execute jobs automatically, without manual qrun
?
I spent several hours on the problem with similar symptoms and at the end it was single option missing in server settings:
Normally it would be the scheduler that decides when jobs are to be run, i.e. when there are sufficient resources, and tells the server to run the job. Are you running a scheduler? TORQUE includes a basic scheduler (
pbs_sched
), or you could install and run the more sophisticated maui (free) or moab (pay-for).The
pbs_server
part of PBS/TORQUE is a "resource manager" - essentially just a 'framework'. It makes no decisions itself: that is the job of the scheduler.