Slurm

Slurm schedules jobs to analyze data in parallel, maximizing computing efficiency.

Useful commands

Below are commonly-used Slurm commands with some examples:

  • sbatch

    # use sbatch to submit a script to run in batch
    $ sbatch script.sh
    
  • srun

    # use srun to submit a job
    # srun tracks second-by-second performance of each step
    $ srun ~/slurm/test1.sh
    
  • squeue

    # monitor job in queue
    $ squeue -s --me
    
    # update squeue every 2 seconds:
    $ watch squeue -s --me
    
  • sstat

    # monitor running job process time, status, and cpu/memory usage
    $ watch sstat --format=JobID,MaxRSS,AveCPU -j <jobid>
    

    Explanation of fields in the –format string:

    MaxRSS

    The maximum Resident Set Size (the actual physical RAM) used by the job step so far. Crucial for detecting if your script is about to hit its –mem limit and crash.

    AveRSS

    The average RAM used across all nodes allocated to this step.

    AveCPU

    The average CPU time consumed by all tasks in the step.

    MaxPages

    The maximum page fault count (shows if your job is aggressively swapping to disk due to low RAM).

  • sacct

    $ sacct --format=JobID,JobName,Elapsed,NCPUS,State,MaxRSS,ExitCode -j <jobid>
    

    Explanation of fields in the –format string:

    JobID

    The unique tracking number assigned to the job and its individual steps.

    Elapsed

    The actual wall-clock time the job took to run from start to finish (DD-HH:MM:SS).

    NCPUS

    The total number of CPU tokens (physical cores on your updated cluster) allocated to the job.

    State

    The final status of the job execution (e.g., COMPLETED, FAILED, RUNNING, OUT_OF_MEMORY).

    MaxRSS

    The maximum amount of actual physical RAM used by the job during its run.

    CPUTime

    The total billing time consumed by the hardware, calculated mathematically as NCPUS × Elapsed.

    ExitCode

    The return code of the processes (0:0 means successful completion; non-zero values indicate specific system errors or crashes).

Example 1: Sequential batch job

Below is an example of batch script that runs 3 programs sequentially: test1.sh, test2.sh, and test3.sh.

Download scripts: sequential.sh | test1.sh | test2.sh | test3.sh

sequential.sh
1 #!/bin/bash
2 #SBATCH --job-name=sequential_test
3 #SBATCH --ntasks=1        # 1 task per subject
4 #SBATCH --mem-per-cpu=4G  # Allocates 4G RAM to each individual script
5
6 srun ~/slurm/test1.sh
7 srun ~/slurm/test2.sh
8 srun ~/slurm/test3.sh
$ sbatch sequential.sh
Submitted batch job 90
$ sacct --format=JobID,JobName,Elapsed,NCPUS,State,MaxRSS,ExitCode -j 90

JobID           JobName    Elapsed      NCPUS      State     MaxRSS ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
90           sequentia+   00:00:21          1  COMPLETED                 0:0
90.batch          batch   00:00:21          1  COMPLETED      5248K      0:0
90.extern        extern   00:00:21          1  COMPLETED                 0:0
90.0           test1.sh   00:00:10          1  COMPLETED       856K      0:0
90.1           test2.sh   00:00:05          1  COMPLETED       856K      0:0
90.2           test3.sh   00:00:05          1  COMPLETED       852K      0:0

The script finished in 21 seconds, and each test script (step) is identified by ‘.0’, ‘.1’ and ‘.2’:

  • 90.0 (test1.sh) ran 10 seconds

  • 90.1 (test2.sh) ran 5 seconds

  • 90.2 (test3.sh) ran 5 seconds

Example 2: Parallel batch job

Below is an example to run the same 3 programs in parallel:

Download script: parallel.sh

parallel.sh
 1 #!/bin/bash
 2 #SBATCH --job-name=parallel_test
 3 #SBATCH --ntasks=3          # reserve 3 slots for 3 parallel scripts
 4 #SBATCH --cpus-per-task=1   # 1 CPU per job
 5 #SBATCH --mem-per-cpu=4G    # Allocates 4G RAM to each individual script
 6
 7 # Launching each job step into the background using '&'
 8 srun --ntasks=1 ~/slurm/test1.sh &
 9 srun --ntasks=1 ~/slurm/test2.sh &
10 srun --ntasks=1 ~/slurm/test3.sh &
11
12 # Pauses the main script so it doesn't close before the tests finish
13 wait
$ sbatch parallel-test.sh
Submitted batch job 97
$ sacct --format=JobID,JobName,Elapsed,NCPUS,State,MaxRSS,ExitCode -j 97

JobID           JobName    Elapsed      NCPUS      State     MaxRSS ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
97           parallel_+   00:00:11          3  COMPLETED                 0:0
97.batch          batch   00:00:11          3  COMPLETED     14192K      0:0
97.extern        extern   00:00:11          3  COMPLETED                 0:0
97.0           test3.sh   00:00:05          1  COMPLETED       896K      0:0
97.1           test1.sh   00:00:10          1  COMPLETED       900K      0:0
97.2           test2.sh   00:00:05          1  COMPLETED       904K      0:0

The parallel script completes in 11 seconds, with each test script taking the same amount of time (10 seconds, 5 seconds, 5 seconds).

Example 3: Array batch job

Below is an example using an array definition to run the test scripts parallel:

Download script: array.sh

parallel.sh
1 #!/bin/bash
2 #SBATCH --job-name=array_test
3 #SBATCH --ntasks=1                     # 1 task per subject
4 #SBATCH --cpus-per-task=1              # 1 physical core per subject
5 #SBATCH --mem=4G                       # 4G RAM per subject
6 #SBATCH --array=1-3                    # Creates 3 parallel sub-jobs (1, 2, and 3)
7
8 # Slurm automatically updates $SLURM_ARRAY_TASK_ID for each sub-job
9 ~/slurm/test${SLURM_ARRAY_TASK_ID}.sh
$ sbatch array.sh
Submitted batch job 116
$ sacct --format=JobID,JobName,Elapsed,NCPUS,State,MaxRSS,ExitCode -j 116

JobID           JobName    Elapsed      NCPUS      State     MaxRSS ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
116_1        array_test   00:00:11          1  COMPLETED                 0:0
116_1.batch       batch   00:00:11          1  COMPLETED      1400K      0:0
116_1.extern     extern   00:00:11          1  COMPLETED                 0:0
116_2        array_test   00:00:05          1  COMPLETED                 0:0
116_2.batch       batch   00:00:05          1  COMPLETED      1532K      0:0
116_2.extern     extern   00:00:05          1  COMPLETED                 0:0
116_3        array_test   00:00:05          1  COMPLETED                 0:0
116_3.batch       batch   00:00:05          1  COMPLETED      1440K      0:0
116_3.extern     extern   00:00:05          1  COMPLETED                 0:0

Documentation

For more documentation, see the official Slurm Workload Manager Documentation.