Last modified: |today| .. _troubleshootge: troubleshooting =============== for jobs waiting in the queue, see :ref:`qstattrouble` or :ref:`qquota`. for failed jobs, see :ref:`sgelogs` .. _qstattrouble: qstat ^^^^^ If a job is waiting in the queue, then there aren't enough resources (CPU,memory) to run the job. | The 'qstat' command without any options prints the jobs state and where the job is running. The 3 most common states are: ====== ==================== ====== ==================== r running qw waiting in queue Eqw failed ====== ==================== For example:: $ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 100 0.55111 test joe r 02/29/2016 12:28:41 all.q@nx13.priv 101 0.55023 test joe qw 02/29/2016 09:46:30 | In this example, the first job (ID 100) is running on nx13, the second job is waiting in the queue. If you want to monitor the running job, log into nx13 with the ssh command, and run 'top'. Monitor the CPU and memory (RES) columns. If the job is using less than 100% CPU, then the job is limited by a resource other than CPU - usually disk I/O. In that case, try to reduce the amount of data the jobs reads/writes to your home directory. | (type 'q' to quit the top command). For more information, use qstat -j , e.g.:: $ qstat -j 101 ============================================================== job_number: 101 ... scheduling info: (-l mem_free=2.95G) cannot run in queue "nx20.priv" because it offers only hc:mem_free=0.000 (-l mem_free=2.95G) cannot run in queue "nx18.priv" because it offers only hc:mem_free=2.000G (-l mem_free=2.95G) cannot run in queue "nx19.priv" because it offers only hc:mem_free=0.000 (-l mem_free=2.95G) cannot run in queue "nx15.priv" because it offers only hc:mem_free=2.000G (-l mem_free=2.95G) cannot run in queue "nx17.priv" because it offers only hc:mem_free=2.000G (-l mem_free=2.95G) cannot run in queue "nx14.priv" because it offers only hc:mem_free=2.000G (-l mem_free=2.95G) cannot run in queue "nx16.priv" because it offers only hc:mem_free=2.000G cannot run because it exceeds limit "////nx21/" in rule "max_slots_hosts/6" (-l mem_free=2.95G) cannot run in queue "nx12.priv" because it offers only hc:mem_free=2.000G (-l mem_free=2.95G) cannot run in queue "nx10.priv" because it offers only hc:mem_free=2.000G In this example, there aren't enough resources avaialable to run job ID 101. The limiting resource is memory, except on nx21 where all CPU cores (slots) are used. .. _qquota: qquota ^^^^^^ | 'qquota' prints resource quotas on computers (slots_hosts), users (slots_users), and labs (slots_groups). For example:: $ qquota resource quota rule limit filter explanation -------------------------------------------------------------------------------- slots_hosts/3 slots=7/8 hosts nx8 (computer CPU quota). On host 'nx8', 7 of the 8 available CPU cores (slots) are reserved. slots_hosts/4 slots=4/4 hosts nx9 slots_hosts/5 slots=7/12 hosts nx12 slots_hosts/5 slots=4/12 hosts nx11 slots_hosts/5 slots=4/12 hosts nx10 slots_hosts/5 slots=6/12 hosts nx13 slots_hosts/6 slots=16/16 hosts nx14 slots_hosts/6 slots=11/16 hosts nx16 slots_hosts/6 slots=13/16 hosts nx15 slots_hosts/6 slots=12/16 hosts nx17 slots_users/7 slots=15/50 users joe (user CPU quota). User 'joe' is using 15 of 50 available slots. slots_groups/2 slots=34/94 users @lab (group CPU quota). Group 'lab' is using 34 of 94 available slots. mem_groups/2 mem_free=330.000G/40 users @lab (group memory quota). Group 'lab' reserved 330GB of 400GB RAM available mem_users/2 mem_free=225.000G/26 users joe (user memory quota). User 'joe' reserved 225GB of 260GB RAM available In this example, user 'joe' is probably limited by memory (RAM), not CPU cores (slots). joe reserved only 15 of 50 cpu slot (core) quota, and there are many slots are available on nx8-nx17. However, joe reserved 225GB of the 260GB RAM quota. To learn more about memory reservations, see :ref:`qmem` below. qstats ^^^^^^ The qstats command lists all jobs. You can see your job's priority in the queue relative to other users. Jobs in the queue are sorted by priority, so the first job in the queue will run next if resources are available. The qstats command prints a lot of output, so pipe it to the 'less' command to see a page at a time:: $ qstats | less ... and type 'q' to quit. .. _sgelogs: log files ^^^^^^^^^ | log files help determine why a job failed or is behaving unexpectedly | Grid Engine produces output and error files named with the job name and number. They are located in your home directory, unless you specify otherwise. * If there is no useful information in the log files, then edit your script to print more troubleshooting information. * If the output/error files are empty, or there is a 'line 1' syntax error, then #. verify the first line of your script defines an interpreter that exists. Common interpreters are /bin/sh, /bin/bash, and python. The interpreter must begin with a #! symbol, e.g.: | #!/bin/sh | #!/bin/bash | #!/usr/bin/env python #. Confirm that your script doesn't contain control characters. To do this, use the 'cat -v