Grid Engine¶

Grid Engine is batch processing software that queues and schedules jobs, so you can efficiently use computing resources, and analyze data in parallel.

Scripting Tutorial¶

In this tutorial, we write a script to submit a job to the Grid Engine batch processing system.

A script defines the location of the data, and the commands to analyze the data.

Note

Before running a script in batch (parallel), confirm that the input/output of each job is independent (won’t overwrite, modify, or conflict) with other jobs.

Use variables whenever possible, so it’s easier to modify a name later.

For example, the following script uses the variables: HOME, DATADIR, and DATASET ($HOME is a pre-defined variable referencing the user’s home directory location):

#!/bin/sh

DATADIR=”$HOME/data”

DATASET=”subject_101”

echo “my data is here: $DATADIR/$DATASET”

Save the text to a file, e.g. test.sh, and run it from the command line to verify it works:

sh $HOME/test.sh

Then use qsub to run it via Grid Engine:

qsub $HOME/test.sh

You can monitor the job status with qstat.

When complete, Grid Engine creates 2 files in $HOME: an error file (that starts with ‘e’), and an output file (that starts with ‘o’). The file names include the job ID, e.g. 96606:

test.sh.e96606

test.sh.o96606

batch processing¶

To run jobs in parallel (batch), we need to define the data sets over which to iterate.

defining datasets with numeric_array¶

If our data sets are numbered in sequential order, then we can use a Grid Engine array, called a ‘numeric array’: Grid Engine executes a script once for each number in the array. The array value is referenced using the variable ‘SGE_TASK_ID’.

For example:

qsub -t 101-110  $HOME/test.sh

executes the $HOME/test.sh script 10 times. The SGE_TASK_ID variable is used to reference the array value.

See Scripting Example 1: qsub numeric array.

defining datasets with ‘submit’ qsub wrapper¶

Alternatively, a ‘submit’ qsub wrapper can be used to

Specify the directory where data sets are located, or

Specify the file listing the data sets

The ‘submit’ wrapper syntax is:

submit -s /path/to/SCRIPT [ -d /path/to/DIRECTORY | -f /path/to/FILE ] [ -o OPTIONS_FILE ]

For example:

submit -s $HOME/test.sh [ -d $HOME/datadir

The submit wrapper runs the specified script once for each data set (specified by the -f or the -d option). During each iteration, the name of the data set is stored with the SGE_TASK variable. The -o option is not required.

see Scripting Example 2: ‘submit’ qsub wrapper’.

Scripting Example 1: qsub numeric array¶

In this example, we edit the test.sh script to reference 3 data sets named: SUBJECT_101, SUBJECT_102, and SUBJECT_103.

Since the data sets are named sequentially with a numeric component (101-103), we can use a numeric array: -t 101-103.

We use the SGE_TASK_ID variable to reference the value of the array for the analysis. During the first iteration, SGE_TASK_ID=101; during the second iteration, SGE_TASK_ID=102; and in the third, SGE_TASK_ID=103.

Below is the modified ‘test.sh’ script:

#!/bin/sh

# define data set location

DATADIR=”$HOME/data”

DATASET=”subject_${SGE_TASK_ID}”

echo “my data is here: $DATADIR/$DATASET”

#

# Run ‘recon-all’ once per iteration

recon-all -sd $DATADIR -s $DATASET

run the ‘test.sh’ script with qsub:

qsub -t 101-103  $HOME/test.sh

When complete, 6 files are created in $HOME - one output file and one error file for each iteration.

Scripting Example 2: ‘submit’ qsub wrapper¶

In this example, we edit the test.sh script to reference 3 non-sequential data sets: SUBJECT_101, SUBJECT_202, and SUBJECT_301. These data sets aren’t named with a sequential numeric ID, so I use the ‘submit’ command to evaluate them.

We create a text file with a list of the data sets, called ‘test.subjects’:

subject_101

subject_201

subject_301

We edit ‘test.sh’ to use the SGE_TASK variable to analyze data from the test.subjects file above. Specifically, replace ‘SUBJECT_001’ with ‘${SGE_TASK}’:

#!/bin/sh

#

# define data set location

DATADIR=”$HOME/data”

DATASET=”${SGE_TASK}”

#

# Run recon-all once per iteration

recon-all -sd $DATADIR -s $DATASET

Warning

Be careful to reference your data sets correctly. The qsub numeric array uses the SGE_TASK_ID variable, and submit command uses the SGE_TASK variable. The primary difference is that SGE_TASK_ID is a number, whereas SGE_TASK is usually the full name of your data set.

Submit the ‘test.sh’ script with qsub:

submit -s $HOME/test.sh -f $HOME/test.subjects

When complete, 6 files are created in my home directory - one output file and one error file for each subject.

Scripting Example 3: matlab (with memory reservation)¶

In this example, we have a matlab script called test.m that requires 4G system RAM (memory).

verify that the test.m matlab script is the $MATLABPATH, or in the directory $HOME/matlab/:
```
mkdir $HOME/matlab
mv test.m $HOME/matlab
```
create a new file called runme.sh that executes the matlab command:
```
matlab -nosplash -nojvm -nodisplay -r test
```

Note

the test.m file is referenced as ‘test’

from a terminal window, submit the runme.sh script and reserve 4G memory:
```
qsub -l mem_free=4G $HOME/runme.sh
```

command reference¶

Below is a list of popular Grid Engine commands.

qsub¶

submit a batch job to Grid Engine

Run ‘qsub’ to to submit simple.sh script:
$ qsub simple.sh
Error ‘e’ and output ‘o’ files are created in your home directory:

simple.sh.e96606

simple.sh.o96606

Note

You can monitor job status with qstat.

Below is a subset of qsub options you may define:

-M emailaddress

Change ‘emailaddress’ to specify where you want to receive notifications

The default value is the email address specified in the .forward file in your home directory

-m b|e|a|s|n

The frequency of e-mail notifications.

The default is:

-m as

The arguments have the following meaning:

- b : Mail is sent at the beginning of the job

- e : Mail is sent at the end of the job

- a : Mail is sent when the job is aborted or rescheduled

- s : Mail is sent when the job is suspended

- n : No mail is sent

-e path

The directory for SGE error files

Change ‘path’ to the directory where Grid Engine saves error files

e.g. $HOME/sge/logs

The default is your home directory. If you change the default, verify the directory exists

-o path

The directory for SGE output files (can be the same as above)

Change ‘path’ to the directory where Grid Engine saves error files

e.g. $HOME/sge/logs

The default is your home directory. If you change the default, verify the directory exists

-N name

The name of the job

e.g. -N test

-j yes

Merge the output and error files

-l mem_free=value

Define the memory requirements for your job

Change ‘value’ to the amount of memory required by your script.

e.g. -l mem_free=15G reserves 15GB of RAM for the script

For example:

qsub -M me@berkeley.edu -N test /path/to/script

Using qsub options file¶

To specify qsub options in a text file, place option per line, e.g.:

-o $HOME/sge/logs

-e $HOME/sge/logs

-N test

Define the options file with the ‘-@’ flag followed by the filename, e.g.:

qsub -@ /path/to/options /path/to/script

qload¶

prints a summary of everyone’s submitted jobs, and lists your running jobs with system CPU/memory information.

qdel¶

deletes a job from the queue

To delete all of your jobs:
$ qdel -u $USER
Note

$USER will be interpreted by the shell as your unix account name. You may use $USER in shell commands and shell scripts instead of your unix account name.

To delete a single job:
$ qdel 6033

qstat¶

prints information about incomplete jobs, including errors and scheduling information. To use qstat for troubleshooting, see troubleshooting with qstat.

qacct¶

prints information about completed jobs, including how much memory a job used. See memory reservation for more information.

memory reservation¶

A job may be terminated if there is insufficient memory (RAM) on a workstation. In that case, you’ll receive email at the address in your $HOME/.forward file.

If your job uses more than 2GB RAM, then reserve memory via qsub (see below). But PLEASE only reserve what you need… reserved memory can’t be used by others, and there is a limited supply.

print memory usage¶

The ‘qacct’ command prints information about completed GridEngine jobs, including memory usage. So if your job finished successfully in the past, then use ‘qacct’ to print how much memory it used:

$ qacct

You may print up to 1000 entries using the -n option (the default is ‘30’):

$ qacct -n 100

print memory availability¶

The ‘qmem’ command prints memory currently available, and the maximum memory reservation for each workstation:

$qmem

reserve memory via qsub¶

To reserve memory with qsub, use the qsub -l option, and replace <maxvmem> with the numeric value:

qsub -l mem_free=<maxvmem>  /path/to/script

For example, if the memory requirement (maxvmem) is 4G:

qsub -l mem_free=4G /path/to/script

If you don’t know how much memory a job needs, then submit a single job, and monitor it with the qstat command. When it’s finished, use the qacct command as described above.

for more examples with qsub, see Scripting Tutorial.

reserve memory via qsub ‘submit’ wrapper¶

When using the ‘submit’ wrapper to run jobs, create an options file with the memory reservation. For example, if the name of your options file is $HOME/sge-options’, and you want to reserve 43G of RAM, then you can run:

echo "-l mem_free=43G" >> $HOME/sge-options

When you run submit, specify the options file with ‘-o $HOME/sge-options’

for more information about submit wrapper, see ‘ref:submitwrapper.

modify memory reservation¶

It’s impossible to change the reservation for running jobs, but you CAN change the memory reservation for jobs waiting in the queue. To do this, copy and paste the following command, and replace <maxvmem> with the value obtained from the qacct command:

for jobID in `qstat | grep qw | awk '{print $1}'`; do qalter -l mem_free=<maxvmem> $jobID; done

For example, if the memory reservation is 4G:

for jobID in `qstat | grep qw | awk '{print $1}'`; do qalter -l mem_free=4G $jobID; done

Why didn’t my job run?¶

for jobs waiting in the queue, see troubleshooting with qstat or troubleshooting with qquota.

for failed jobs, see using log files

troubleshooting with qstat¶

The ‘qstat’ command without any options prints the jobs state and where the job is running. The 3 most common states are:


r	running
qw	waiting in the queue
Eqw	failed

If a job is waiting in the queue, then there aren’t enough resources (CPU,memory) to run the job.

For example:

$ qstat
job-ID  prior   name       user    state   submit/start at          queue                        slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
100     0.55111 test       me      r       02/29/2016 12:28:41      all.q@nx13.priv
101     0.55023 test       me      qw      02/29/2016 09:46:30

In this example, Job-ID 100 is running on nx13. Job-ID 101 is waiting in the queue.

Note

You may log into the server where your job is running to monitor it with the ‘top’ command (type ‘q’ to quit the top command). For example:

ssh nx13
top

For detailed job scheduling information, use qstat -j <job-ID>, e.g.:

$ qstat -j 101
==============================================================
job_number:                 101
...
scheduling info:            (-l mem_free=2.95G) cannot run in queue "nx20.priv" because it offers only hc:mem_free=0.000
                            (-l mem_free=2.95G) cannot run in queue "nx18.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx19.priv" because it offers only hc:mem_free=0.000
                            (-l mem_free=2.95G) cannot run in queue "nx15.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx17.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx14.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx16.priv" because it offers only hc:mem_free=2.000G
                            cannot run because it exceeds limit "////nx21/" in rule "max_slots_hosts/6"
                            (-l mem_free=2.95G) cannot run in queue "nx12.priv" because it offers only hc:mem_free=2.000G
                            (-l mem_free=2.95G) cannot run in queue "nx10.priv" because it offers only hc:mem_free=2.000G

In this example, there aren’t enough resources avaialable to run job ID 101. The limiting resource is memory, except on nx21 where all CPU cores (slots) are used.

troubleshooting with qquota¶

‘qquota’ prints resource quotas on computers (slots_hosts), users (slots_users), and labs (slots_groups).

For example:

$ qquota
resource quota rule limit                filter         explanation
--------------------------------------------------------------------------------
slots_hosts/3      slots=7/8            hosts nx8       (computer CPU quota). On host 'nx8', 7 of the 8 available CPU cores (slots) are reserved.
slots_hosts/4      slots=4/4            hosts nx9
slots_hosts/5      slots=7/12           hosts nx12
slots_hosts/5      slots=4/12           hosts nx11
slots_hosts/5      slots=4/12           hosts nx10
slots_hosts/5      slots=6/12           hosts nx13
slots_hosts/6      slots=16/16          hosts nx14
slots_hosts/6      slots=11/16          hosts nx16
slots_hosts/6      slots=13/16          hosts nx15
slots_hosts/6      slots=12/16          hosts nx17
slots_users/7      slots=15/50          users me        (user CPU quota). User 'me' is using 15 of 50 available slots.
slots_groups/2     slots=34/94          users @lab      (group CPU quota). Group 'lab' is using 34 of 94 available slots.
mem_groups/2       mem_free=330.000G/40 users @lab      (group memory quota). Group 'lab' reserved 330GB of 400GB RAM available
mem_users/2        mem_free=225.000G/26 users me        (user memory quota). User 'me' reserved 225GB of 260GB RAM available

The last 4 lines show the quotas for the user and lab.

To learn more about memory reservations, see print memory availability below.

troubleshooting with qstats¶

The qstats command lists all jobs. You can see your job’s priority in the queue relative to other users. Jobs in the queue are sorted by priority, so the first job in the queue will run next if resources are available. The qstats command prints a lot of output, so pipe it to the ‘less’ command to see a page at a time:

$ qstats | less

… and type ‘q’ to quit.

using log files¶

log files help determine why a job failed or is behaving unexpectedly

Grid Engine produces output and error files named with the job name and number. They are located in your home directory, unless you specify otherwise.

If there is no useful information in the log files, then edit your script to print more troubleshooting information.
If the output/error files are empty, or there is a ‘line 1’ syntax error, then
1. verify the first line of your script defines an interpreter that exists. Common interpreters are /bin/sh, /bin/bash, and python. The interpreter must begin with a #! symbol, e.g.:
  
  #!/bin/bash
  
  or
  
  #!/usr/bin/env python
2. Confirm that your script doesn’t contain control characters. To do this, use the ‘cat -v <script>’ command, e.g.
  cat -v sge_submit.sh

Please use the qdel command to remove failed jobs from the queue.

Grid Engine¶

Scripting Tutorial¶

batch processing¶

defining datasets with numeric_array¶

defining datasets with ‘submit’ qsub wrapper¶

Scripting Example 1: qsub numeric array¶

Scripting Example 2: ‘submit’ qsub wrapper¶

Scripting Example 3: matlab (with memory reservation)¶

command reference¶

qsub¶

Using qsub options file¶

qload¶

qdel¶

qstat¶

qacct¶

memory reservation¶

print memory usage¶

print memory availability¶

reserve memory via qsub¶

reserve memory via qsub ‘submit’ wrapper¶

modify memory reservation¶

Why didn’t my job run?¶

troubleshooting with qstat¶

troubleshooting with qquota¶

troubleshooting with qstats¶

using log files¶

Supported Labs

Email Support

Previous topic

This Page


-M emailaddress	Change ‘emailaddress’ to specify where you want to receive notifications The default value is the email address specified in the .forward file in your home directory
-m b\|e\|a\|s\|n	The frequency of e-mail notifications. The default is: -m as The arguments have the following meaning: - b : Mail is sent at the beginning of the job - e : Mail is sent at the end of the job - a : Mail is sent when the job is aborted or rescheduled - s : Mail is sent when the job is suspended - n : No mail is sent
-e path	The directory for SGE error files Change ‘path’ to the directory where Grid Engine saves error files e.g. $HOME/sge/logs The default is your home directory. If you change the default, verify the directory exists
-o path	The directory for SGE output files (can be the same as above) Change ‘path’ to the directory where Grid Engine saves error files e.g. $HOME/sge/logs The default is your home directory. If you change the default, verify the directory exists
-N name	The name of the job e.g. -N test
-j yes	Merge the output and error files
-l mem_free=value	Define the memory requirements for your job Change ‘value’ to the amount of memory required by your script. e.g. -l mem_free=15G reserves 15GB of RAM for the script