Last modified: September 12, 2019

troubleshooting

Determine why a job failed or waiting in the queue.

qquota

If a job is waiting in the queue, then there aren’t enough resources (CPU,memory) to run the job.
The ‘qquota’ command prints resource limitations.

For example:

$ qquota
resource quota rule limit                filter         explanation
--------------------------------------------------------------------------------
slots_hosts/3      slots=7/8            hosts nx8       (host CPU limit). Host 'nx8' has 8 CPUs, and 7 are in use
slots_hosts/4      slots=4/4            hosts nx9
slots_hosts/5      slots=7/12           hosts nx12
slots_hosts/5      slots=4/12           hosts nx11
slots_hosts/5      slots=4/12           hosts nx10
slots_hosts/5      slots=6/12           hosts nx13
slots_hosts/6      slots=16/16          hosts nx14
slots_hosts/6      slots=11/16          hosts nx16
slots_hosts/6      slots=13/16          hosts nx15
slots_hosts/6      slots=12/16          hosts nx17
slots_groups/2     slots=34/94          users @lab      (group CPU limit). Group 'lab' may run 94 jobs; they're currently running 34 jobs
slots_users/7      slots=15/50          users joe       (user CPU limit). User 'joe' may run 50 jobs; joe is currently running 15 jobs
mem_groups/2       mem_free=330.000G/40 users @lab      (group memory limit). Group 'lab' may use 400GB RAM; they're currently using 330GB
mem_users/2        mem_free=225.000G/26 users joe       (user memory limit). User 'joe' may use 260GB RAM; joe is currently using 225GB

In this example, there are many slots (CPU) available. However, joe’s jobs use a lot of memory, and workstations with free CPU may not have free memory. To learn more about why joe’s jobs aren’t running, use ‘qmem’ and/or ‘qstats’ (more information below).

The workstations nx18-nx27 are reserved: nx18-nx23 are reserved or the D’Esposito lab, and nx24-nx27 are reserved for the Jagust lab.

log files

log files help determine why a job failed or is behaving unexpectedly
Grid Engine produces output and error files named with the job name and number. They are located in your home directory, unless you specify otherwise.
  • If there is no useful information in the log files, then edit your script to print more troubleshooting information.

  • If the output/error files are empty, or there is a ‘line 1’ syntax error, then

    1. verify the first line of your script defines an interpreter that exists. Common interpreters are /bin/sh, /bin/bash, and python. The interpreter must begin with a #! symbol, e.g.:

      #!/bin/sh
      #!/bin/bash
      #!/usr/bin/env python
    2. Confirm that your script doesn’t contain control characters. To do this, use the ‘cat -v <script>’ command, e.g.

      cat -v sge_submit.sh

Please use the qdel command to remove failed jobs from the queue.

qstat

The qstat command prints job status. The output is useful for troubleshooting why a job failed, isn’t running, or is behaving unexpectedly.

The ‘qstat’ command without any options prints the jobs state and where the job is running. The 3 most common states are:
   
r running
qw waiting in queue
Eqw failed

For example:

$ qstat
job-ID  prior   name       user    state   submit/start at          queue                        slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
100     0.55111 test       joe     r       02/29/2016 12:28:41      all.q@nx13.priv
101     0.55023 test       joe     qw      02/29/2016 09:46:30
In this example, the first job (ID 100) is running on nx13, the second job is waiting in the queue. If you want to monitor the running job, log into nx13 with the ssh command, and run ‘top’. Monitor the CPU and memory (RES) columns. If the job is using less than 100% CPU, then the job is limited by a resource other than CPU - usually disk I/O. In that case, try to reduce the amount of data the jobs reads/writes to your home directory.
(type ‘q’ to quit the top command).

To determine why a job is queued or failed, use qstat -j <job-ID>, e.g.:

$ qstat -j 101
==============================================================
job_number:                 101
...
scheduling info:            (-l mem_free=2.95G) cannot run in queue "nx20.priv" because it offers only hc:mem_free=0.000
                            (-l mem_free=2.95G) cannot run in queue "nx18.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx19.priv" because it offers only hc:mem_free=0.000
                            (-l mem_free=2.95G) cannot run in queue "nx15.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx17.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx14.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx16.priv" because it offers only hc:mem_free=2.000G
                            cannot run because it exceeds limit "////nx21/" in rule "max_slots_hosts/6"
                            (-l mem_free=2.95G) cannot run in queue "nx12.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx10.priv" because it offers only hc:mem_free=2.000G

In this example, there aren’t enough resources avaialable to run job ID 101. The limiting resource is memory, except on nx21 where all CPU cores (slots) are used.

qstats

The qstats command lists all jobs. You can see your job’s priority in the queue relative to other users. Jobs in the queue are sorted by priority, so the first job in the queue will run next if resources are available. The qstats command prints a lot of output, so pipe it to the ‘less’ command to see a page at a time:

$ qstats | less

... and type ‘q’ to quit.

qmem

qmem prints memory currently available, and the maximum memory reservation for each workstation.