Slurm¶
Slurm schedules jobs to analyze data in parallel, maximizing computing efficiency.
Useful commands¶
Below are commonly-used Slurm commands with some examples:
sbatch# use sbatch to submit a script to run in batch $ sbatch script.sh
srun# use srun to submit a job # srun tracks second-by-second performance of each step $ srun ~/slurm/test1.sh
squeue# monitor job in queue $ squeue -s --me # update squeue every 2 seconds: $ watch squeue -s --me
sstat# monitor running job process time, status, and cpu/memory usage $ watch sstat --format=JobID,MaxRSS,AveCPU -j <jobid>
Explanation of fields in the –format string:
- MaxRSS
The maximum Resident Set Size (the actual physical RAM) used by the job step so far. Crucial for detecting if your script is about to hit its –mem limit and crash.
- AveRSS
The average RAM used across all nodes allocated to this step.
- AveCPU
The average CPU time consumed by all tasks in the step.
- MaxPages
The maximum page fault count (shows if your job is aggressively swapping to disk due to low RAM).
sacct$ sacct --format=JobID,JobName,Elapsed,NCPUS,State,MaxRSS,ExitCode -j <jobid>
Explanation of fields in the –format string:
- JobID
The unique tracking number assigned to the job and its individual steps.
- Elapsed
The actual wall-clock time the job took to run from start to finish (DD-HH:MM:SS).
- NCPUS
The total number of CPU tokens (physical cores on your updated cluster) allocated to the job.
- State
The final status of the job execution (e.g., COMPLETED, FAILED, RUNNING, OUT_OF_MEMORY).
- MaxRSS
The maximum amount of actual physical RAM used by the job during its run.
- CPUTime
The total billing time consumed by the hardware, calculated mathematically as NCPUS × Elapsed.
- ExitCode
The return code of the processes (0:0 means successful completion; non-zero values indicate specific system errors or crashes).
Example 1: Sequential batch job¶
Below is an example of batch script that runs 3 programs sequentially: test1.sh, test2.sh, and test3.sh.
Download scripts: sequential.sh | test1.sh | test2.sh | test3.sh
1 #!/bin/bash
2 #SBATCH --job-name=sequential_test
3 #SBATCH --ntasks=1 # 1 task per subject
4 #SBATCH --mem-per-cpu=4G # Allocates 4G RAM to each individual script
5
6 srun ~/slurm/test1.sh
7 srun ~/slurm/test2.sh
8 srun ~/slurm/test3.sh
$ sbatch sequential.sh
Submitted batch job 90
$ sacct --format=JobID,JobName,Elapsed,NCPUS,State,MaxRSS,ExitCode -j 90
JobID JobName Elapsed NCPUS State MaxRSS ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
90 sequentia+ 00:00:21 1 COMPLETED 0:0
90.batch batch 00:00:21 1 COMPLETED 5248K 0:0
90.extern extern 00:00:21 1 COMPLETED 0:0
90.0 test1.sh 00:00:10 1 COMPLETED 856K 0:0
90.1 test2.sh 00:00:05 1 COMPLETED 856K 0:0
90.2 test3.sh 00:00:05 1 COMPLETED 852K 0:0
The script finished in 21 seconds, and each test script (step) is identified by ‘.0’, ‘.1’ and ‘.2’:
90.0 (test1.sh) ran 10 seconds
90.1 (test2.sh) ran 5 seconds
90.2 (test3.sh) ran 5 seconds
Example 2: Parallel batch job¶
Below is an example to run the same 3 programs in parallel:
Download script: parallel.sh
1 #!/bin/bash
2 #SBATCH --job-name=parallel_test
3 #SBATCH --ntasks=3 # reserve 3 slots for 3 parallel scripts
4 #SBATCH --cpus-per-task=1 # 1 CPU per job
5 #SBATCH --mem-per-cpu=4G # Allocates 4G RAM to each individual script
6
7 # Launching each job step into the background using '&'
8 srun --ntasks=1 ~/slurm/test1.sh &
9 srun --ntasks=1 ~/slurm/test2.sh &
10 srun --ntasks=1 ~/slurm/test3.sh &
11
12 # Pauses the main script so it doesn't close before the tests finish
13 wait
$ sbatch parallel-test.sh
Submitted batch job 97
$ sacct --format=JobID,JobName,Elapsed,NCPUS,State,MaxRSS,ExitCode -j 97
JobID JobName Elapsed NCPUS State MaxRSS ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
97 parallel_+ 00:00:11 3 COMPLETED 0:0
97.batch batch 00:00:11 3 COMPLETED 14192K 0:0
97.extern extern 00:00:11 3 COMPLETED 0:0
97.0 test3.sh 00:00:05 1 COMPLETED 896K 0:0
97.1 test1.sh 00:00:10 1 COMPLETED 900K 0:0
97.2 test2.sh 00:00:05 1 COMPLETED 904K 0:0
The parallel script completes in 11 seconds, with each test script taking the same amount of time (10 seconds, 5 seconds, 5 seconds).
Example 3: Array batch job¶
Below is an example using an array definition to run the test scripts parallel:
Download script: array.sh
1 #!/bin/bash
2 #SBATCH --job-name=array_test
3 #SBATCH --ntasks=1 # 1 task per subject
4 #SBATCH --cpus-per-task=1 # 1 physical core per subject
5 #SBATCH --mem=4G # 4G RAM per subject
6 #SBATCH --array=1-3 # Creates 3 parallel sub-jobs (1, 2, and 3)
7
8 # Slurm automatically updates $SLURM_ARRAY_TASK_ID for each sub-job
9 ~/slurm/test${SLURM_ARRAY_TASK_ID}.sh
$ sbatch array.sh
Submitted batch job 116
$ sacct --format=JobID,JobName,Elapsed,NCPUS,State,MaxRSS,ExitCode -j 116
JobID JobName Elapsed NCPUS State MaxRSS ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
116_1 array_test 00:00:11 1 COMPLETED 0:0
116_1.batch batch 00:00:11 1 COMPLETED 1400K 0:0
116_1.extern extern 00:00:11 1 COMPLETED 0:0
116_2 array_test 00:00:05 1 COMPLETED 0:0
116_2.batch batch 00:00:05 1 COMPLETED 1532K 0:0
116_2.extern extern 00:00:05 1 COMPLETED 0:0
116_3 array_test 00:00:05 1 COMPLETED 0:0
116_3.batch batch 00:00:05 1 COMPLETED 1440K 0:0
116_3.extern extern 00:00:05 1 COMPLETED 0:0
Documentation¶
For more documentation, see the official Slurm Workload Manager Documentation.