Last modified: February 06, 2024

troubleshooting

for jobs waiting in the queue, see qstat or qquota.

for failed jobs, see log files

qstat

If a job is waiting in the queue, then there aren’t enough resources (CPU,memory) to run the job.

The ‘qstat’ command without any options prints the jobs state and where the job is running. The 3 most common states are:
   
r running
qw waiting in queue
Eqw failed

For example:

$ qstat
job-ID  prior   name       user    state   submit/start at          queue                        slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
100     0.55111 test       joe     r       02/29/2016 12:28:41      all.q@nx13.priv
101     0.55023 test       joe     qw      02/29/2016 09:46:30
In this example, the first job (ID 100) is running on nx13, the second job is waiting in the queue. If you want to monitor the running job, log into nx13 with the ssh command, and run ‘top’. Monitor the CPU and memory (RES) columns. If the job is using less than 100% CPU, then the job is limited by a resource other than CPU - usually disk I/O. In that case, try to reduce the amount of data the jobs reads/writes to your home directory.
(type ‘q’ to quit the top command).

For more information, use qstat -j <job-ID>, e.g.:

$ qstat -j 101
==============================================================
job_number:                 101
...
scheduling info:            (-l mem_free=2.95G) cannot run in queue "nx20.priv" because it offers only hc:mem_free=0.000
                            (-l mem_free=2.95G) cannot run in queue "nx18.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx19.priv" because it offers only hc:mem_free=0.000
                            (-l mem_free=2.95G) cannot run in queue "nx15.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx17.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx14.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx16.priv" because it offers only hc:mem_free=2.000G
                            cannot run because it exceeds limit "////nx21/" in rule "max_slots_hosts/6"
                            (-l mem_free=2.95G) cannot run in queue "nx12.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx10.priv" because it offers only hc:mem_free=2.000G

In this example, there aren’t enough resources avaialable to run job ID 101. The limiting resource is memory, except on nx21 where all CPU cores (slots) are used.

qquota

‘qquota’ prints resource quotas on computers (slots_hosts), users (slots_users), and labs (slots_groups).

For example:

$ qquota
resource quota rule limit                filter         explanation
--------------------------------------------------------------------------------
slots_hosts/3      slots=7/8            hosts nx8       (computer CPU quota). On host 'nx8', 7 of the 8 available CPU cores (slots) are reserved.
slots_hosts/4      slots=4/4            hosts nx9
slots_hosts/5      slots=7/12           hosts nx12
slots_hosts/5      slots=4/12           hosts nx11
slots_hosts/5      slots=4/12           hosts nx10
slots_hosts/5      slots=6/12           hosts nx13
slots_hosts/6      slots=16/16          hosts nx14
slots_hosts/6      slots=11/16          hosts nx16
slots_hosts/6      slots=13/16          hosts nx15
slots_hosts/6      slots=12/16          hosts nx17
slots_users/7      slots=15/50          users joe       (user CPU quota). User 'joe' is using 15 of 50 available slots.
slots_groups/2     slots=34/94          users @lab      (group CPU quota). Group 'lab' is using 34 of 94 available slots.
mem_groups/2       mem_free=330.000G/40 users @lab      (group memory quota). Group 'lab' reserved 330GB of 400GB RAM available
mem_users/2        mem_free=225.000G/26 users joe       (user memory quota). User 'joe' reserved 225GB of 260GB RAM available

In this example, user ‘joe’ is probably limited by memory (RAM), not CPU cores (slots).

joe reserved only 15 of 50 cpu slot (core) quota, and there are many slots are available on nx8-nx17.

However, joe reserved 225GB of the 260GB RAM quota.

To learn more about memory reservations, see qmem below.

qstats

The qstats command lists all jobs. You can see your job’s priority in the queue relative to other users. Jobs in the queue are sorted by priority, so the first job in the queue will run next if resources are available. The qstats command prints a lot of output, so pipe it to the ‘less’ command to see a page at a time:

$ qstats | less

... and type ‘q’ to quit.

log files

log files help determine why a job failed or is behaving unexpectedly
Grid Engine produces output and error files named with the job name and number. They are located in your home directory, unless you specify otherwise.
  • If there is no useful information in the log files, then edit your script to print more troubleshooting information.

  • If the output/error files are empty, or there is a ‘line 1’ syntax error, then

    1. verify the first line of your script defines an interpreter that exists. Common interpreters are /bin/sh, /bin/bash, and python. The interpreter must begin with a #! symbol, e.g.:

      #!/bin/sh
      #!/bin/bash
      #!/usr/bin/env python
    2. Confirm that your script doesn’t contain control characters. To do this, use the ‘cat -v <script>’ command, e.g.

      cat -v sge_submit.sh

Please use the qdel command to remove failed jobs from the queue.

qmem

qmem prints memory currently available, and the maximum memory reservation for each workstation.