.. _analysis:

|

:Last modified: |today|

|


##############
how to use SGE 
##############

Below are instructions to 

#. :ref:`run a job with a memory reservation<commandlinescript>`
#. :ref:`analyze data in parallel<gridenginescript>`


.. _commandlinescript:

run a job with a memory reservation
===================================
To submit a script via Grid Engine, run :ref:`qsub` followed by the script name, e.g.::

    qsub myscript.sh

To submit a script that requires more than 2GB memory, run 'qsub -l mem_free=XG', where 'X' is the required memory, e.g.::

    qsub -l mem_free=4G myscript.sh

.. note::

    To determine how much memory your job requires, see :ref:`memory`.

To monitor the job, run :ref:`qstat`. 

When complete, Grid Engine creates 2 files in your home directory: an error file (that starts with 'e'), and an output file (that starts with 'o'). The file names include the job ID, e.g. 96606:

       | test.sh.e96606
       | test.sh.o96606


submit a matlab script
^^^^^^^^^^^^^^^^^^^^^^
This section describes how to run a matlab script called test.m with a 4G memory reservation. 

#. verify that the matlab script (for example, test.m) is in your MATLABPATH. If you're not sure, then move it to the directory $HOME/matlab/::

    mkdir $HOME/matlab
    mv test.m $HOME/matlab

#. In a new file, put the matlab command you want to run. For example, create a file in your home directory called runme.sh with the contents::

    matlab -nosplash -nojvm -nodisplay -r test

#. from a terminal window, run qsub and reserve 4GB memory for your script::

    qsub -l mem_free=4G $HOME/runme.sh


.. _gridenginescript:

analyze data in parallel
========================
This section describes how to analyze multiple, independent datasets in parallel.

To start, write a script that uses environment variables to define data sets, for example:

       | #!/bin/sh                                        
       | #                                              
       | # define data set location
       | DATADIR="$HOME/DATA"                             
       | DATASET="SUBJECT_101"                           
       | echo "my data is here: $DATADIR/$DATASET"

This script uses the $HOME environment variable, which is a short-cut to your home directory. 

Save the text to a file, e.g. test.sh, and run it from the command line to verify it works::
    
    sh $HOME/test.sh

Then use qsub to run it via Grid Engine::

    qsub $HOME/test.sh

You can monitor the job status with :ref:`qstat`. 

When complete, Grid Engine creates 2 files in $HOME: an error file (that starts with 'e'), and an output file (that starts with 'o'). The file names include the job ID, e.g. 96606:

       | test.sh.e96606
       | test.sh.o96606

define your data sets
---------------------
The next step is to run commands on multiple data sets. There are two methods to define data sets: 

#. :ref:`numeric_array`
    The data sets are named with a sequential, numeric component.
    A script is executed once for each element of the numeric array. 

    or

#. :ref:`submit` command
    A script is executed once for each dataset defined in a directory or file.

Both are described below:

.. _numeric_array:

numeric array
^^^^^^^^^^^^^
To use an array, data sets must be named with a sequential, numeric value (e.g. sub1, sub2, sub3).
 
To define the array, use the qsub -t option, followed by the numeric value, e.g.  '-t 1-3'. 

Your script will be run once for each number in the array. The array value may be referenced in your script as the SGE_TASK_ID variable.
    
numeric array example
*********************

The 'test.sh' script evaluates SUBJECT_101. But there are actually 3 data sets I want to analyze: SUBJECT_101, SUBJECT_102, and SUBJECT_103. The data sets are named with a sequential, numeric component (101-103), so I define the data sets with a qsub numeric array: -t 101-103.

Grid Engine will run test.sh script 3 times, once for each value in the array. I use the SGE_TASK_ID variable to reference the value of the array for the analysis: during the first iteration, the value of SGE_TASK_ID is '101'; during the second iteration, SGE_TASK_ID is '102'; and in the third, SGE_TASK_ID is '103'.  

Next, I modify the test.sh script to use the SGE_TASK_ID variable to analyze data corresponding to each subject. Specifically, replace 'SUBJECT_001' with 'SUBJECT_${SGE_TASK_ID}.

Below is the modified 'test.sh' script:

       | #!/bin/sh                                                                
       | # 
       | # define data set location
       | DATADIR="$HOME/DATA"                             
       | DATASET="SUBJECT_${SGE_TASK_ID}"                                         
       | #                                                                         
       | # Run the following commands once per iteration
       | # If the output looks good, then remove the word 'echo' to run the recon-all command, instead of printing it
       | date
       | echo recon-all -sd $DATADIR -s $DATASET


run the 'test.sh' script with qsub::

    qsub -t 101-103  $HOME/test.sh

When complete, 6 files are created in $HOME - one output file and one error file for each iteration. 


.. _submit:


submit command
^^^^^^^^^^^^^^

Alternatively, you may use the 'submit' command to define the location of your 
data. The 'submit' command has the functionality of 'qsub', plus it allows you to define where your data is located. 

The submit command requires that you specify either:
  * a directory where data sets are located, or
  * a file with a list of data sets


The 'submit' command syntax is:: 

     submit -s /path/to/SCRIPT [ -d /path/to/DIRECTORY | -f /path/to/FILE ] [ -o OPTIONS_FILE ]

The submit command will run the script specified by the -s option once for each data set specified by the -f or the -d option. During each iteration, the name of the data set is stored with the SGE_TASK variable. (The qsub options file is specified by the -o option, and is not required). 


submit command example
**********************

As shown previously, the test.sh script evaluates data from SUBJECT_101. Now I want to edit the script to analyze 3 non-sequential data sets: SUBJECT_101, SUBJECT_202, and SUBJECT_301. These data sets aren't named with a sequential numeric ID, so I use the 'submit' command to evaluate them. 

I create a text file with a list of the data sets, called 'test.subjects':

       | SUBJECT_101   
       | SUBJECT_202   
       | SUBJECT_301   


I edit 'test.sh' to use the SGE_TASK variable to analyze data from the test.subjects file above. Specifically, replace 'SUBJECT_001' with '${SGE_TASK}':

       | #!/bin/sh                                                                
       | #                                                                         
       | # define data set location
       | DATADIR="$HOME/DATA"                             
       | DATASET="${SGE_TASK}"                                                    
       | #                                                                         
       | # Run the following commands once per iteration
       | # If the output looks good, then remove the word 'echo' to run the recon-all command, instead of printing it
       | date
       | echo recon-all -sd $DATADIR -s $DATASET
     
.. warning::

    Be careful to reference your data sets correctly. The :ref:`qsub numeric array<numeric_array>` uses the SGE_TASK_ID variable, and :ref:`submit command<submit>` uses the SGE_TASK variable. The primary difference is that SGE_TASK_ID is a number, whereas SGE_TASK is usually the full name of your data set.


Submit the 'test.sh' script with qsub::

    submit -s $HOME/test.sh -f $HOME/test.subjects 

When complete, 6 files are created in my home directory - one output file and one error file for each subject.


.. _config_qsub:

configure qsub
--------------

This section is optional. 

Below is a subset of qsub options you may define:
    +--------------------------+------------------------------------------------------------------------------------------------+
    +--------------------------+------------------------------------------------------------------------------------------------+
    | -M emailaddress          | | Change 'emailaddress' to specify where you want to receive notifications                     |
    |                          | | The default value is the address specified in the .forward file in your home directory       |
    +--------------------------+------------------------------------------------------------------------------------------------+
    | -m b|e|a|s|n             | | The frequency of e-mail notifications.                                                       |
    |                          | | The default is:                                                                              |
    |                          | | -m as                                                                                        |
    |                          | | The arguments have the following meaning:                                                    |
    |                          | | - b :   Mail is sent at the beginning of the job                                             |
    |                          | | - e :   Mail is sent at the end of the job                                                   |
    |                          | | - a :   Mail is sent when the job is aborted or rescheduled                                  |
    |                          | | - s :   Mail is sent when the job is suspended                                               |
    |                          | | - n :   No mail is sent                                                                      |
    +--------------------------+------------------------------------------------------------------------------------------------+
    | -e path                  | | Change 'path' to the directory where Grid Engine saves error files                           |
    |                          | | The default is your home directory. If you change the default, verify the directory exists   |
    +--------------------------+------------------------------------------------------------------------------------------------+
    | -o path                  | | Change 'path' to the directory where Grid Engine saves output files                          |
    |                          | | The default is your home directory. If you change the default, verify the directory exists   |
    +--------------------------+------------------------------------------------------------------------------------------------+
    | -N name                  | | Change 'name' to the name of the job                                                         |
    +--------------------------+------------------------------------------------------------------------------------------------+
    | -j yes                   | | Merge the Grid Engine output and error files                                                 |
    +--------------------------+------------------------------------------------------------------------------------------------+
    | -l mem_free=value        | | Change 'value' to the amount of memory required by your script.                              |
    |                          | | e.g. -l mem_free=15G reserves 15GB of RAM for the script                                     |
    +--------------------------+------------------------------------------------------------------------------------------------+

using qsub options
^^^^^^^^^^^^^^^^^^

You may specify qsub options immediately after the qsub command, e.g.::

    qsub -N test /path/to/script

Alternatively, you may put options in a text file, one option per line, e.g.:

       | -o $HOME/sge/logs
       | -e $HOME/sge/logs
       | -N test

If you use an options text file, then define the options file with the '-@' flag followed by the filename, e.g.::

    qsub -@ /path/to/options /path/to/script