Last modified: March 19, 2025
for jobs waiting in the queue, see qstat or qquota.
for failed jobs, see log files
If a job is waiting in the queue, then there aren’t enough resources (CPU,memory) to run the job.
r | running |
qw | waiting in queue |
Eqw | failed |
For example:
$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
100 0.55111 test joe r 02/29/2016 12:28:41 all.q@nx13.priv
101 0.55023 test joe qw 02/29/2016 09:46:30
For more information, use qstat -j <job-ID>, e.g.:
$ qstat -j 101
==============================================================
job_number: 101
...
scheduling info: (-l mem_free=2.95G) cannot run in queue "nx20.priv" because it offers only hc:mem_free=0.000
(-l mem_free=2.95G) cannot run in queue "nx18.priv" because it offers only hc:mem_free=2.000G
(-l mem_free=2.95G) cannot run in queue "nx19.priv" because it offers only hc:mem_free=0.000
(-l mem_free=2.95G) cannot run in queue "nx15.priv" because it offers only hc:mem_free=2.000G
(-l mem_free=2.95G) cannot run in queue "nx17.priv" because it offers only hc:mem_free=2.000G
(-l mem_free=2.95G) cannot run in queue "nx14.priv" because it offers only hc:mem_free=2.000G
(-l mem_free=2.95G) cannot run in queue "nx16.priv" because it offers only hc:mem_free=2.000G
cannot run because it exceeds limit "////nx21/" in rule "max_slots_hosts/6"
(-l mem_free=2.95G) cannot run in queue "nx12.priv" because it offers only hc:mem_free=2.000G
(-l mem_free=2.95G) cannot run in queue "nx10.priv" because it offers only hc:mem_free=2.000G
In this example, there aren’t enough resources avaialable to run job ID 101. The limiting resource is memory, except on nx21 where all CPU cores (slots) are used.
For example:
$ qquota
resource quota rule limit filter explanation
--------------------------------------------------------------------------------
slots_hosts/3 slots=7/8 hosts nx8 (computer CPU quota). On host 'nx8', 7 of the 8 available CPU cores (slots) are reserved.
slots_hosts/4 slots=4/4 hosts nx9
slots_hosts/5 slots=7/12 hosts nx12
slots_hosts/5 slots=4/12 hosts nx11
slots_hosts/5 slots=4/12 hosts nx10
slots_hosts/5 slots=6/12 hosts nx13
slots_hosts/6 slots=16/16 hosts nx14
slots_hosts/6 slots=11/16 hosts nx16
slots_hosts/6 slots=13/16 hosts nx15
slots_hosts/6 slots=12/16 hosts nx17
slots_users/7 slots=15/50 users joe (user CPU quota). User 'joe' is using 15 of 50 available slots.
slots_groups/2 slots=34/94 users @lab (group CPU quota). Group 'lab' is using 34 of 94 available slots.
mem_groups/2 mem_free=330.000G/40 users @lab (group memory quota). Group 'lab' reserved 330GB of 400GB RAM available
mem_users/2 mem_free=225.000G/26 users joe (user memory quota). User 'joe' reserved 225GB of 260GB RAM available
In this example, user ‘joe’ is probably limited by memory (RAM), not CPU cores (slots).
joe reserved only 15 of 50 cpu slot (core) quota, and there are many slots are available on nx8-nx17.
However, joe reserved 225GB of the 260GB RAM quota.
To learn more about memory reservations, see qmem below.
The qstats command lists all jobs. You can see your job’s priority in the queue relative to other users. Jobs in the queue are sorted by priority, so the first job in the queue will run next if resources are available. The qstats command prints a lot of output, so pipe it to the ‘less’ command to see a page at a time:
$ qstats | less
... and type ‘q’ to quit.
If there is no useful information in the log files, then edit your script to print more troubleshooting information.
If the output/error files are empty, or there is a ‘line 1’ syntax error, then
verify the first line of your script defines an interpreter that exists. Common interpreters are /bin/sh, /bin/bash, and python. The interpreter must begin with a #! symbol, e.g.:
#!/bin/sh#!/bin/bash#!/usr/bin/env pythonConfirm that your script doesn’t contain control characters. To do this, use the ‘cat -v <script>’ command, e.g.
cat -v sge_submit.sh
Please use the qdel command to remove failed jobs from the queue.
qmem prints memory currently available, and the maximum memory reservation for each workstation.