Skip to main content

Submission of a large number of jobs

Suppose we want to perform a large number of similar, independent and sequential (or weakly parallel) tasks.

For instance, we might want to execute the same sequential program with different input parameters, stored in the following files:

file_1
file_2
...
file_N

Let's suppose my_program takes these files as input parameter. If N is large and/or computations in my_program take a long time, then sequential treatment of these N files will be slow.

The following presents different approaches and briefly discusses pros and cons.

Naive approach

caution

Except for a small number of jobs, this approach is not recommended.

We could write a Slurm script of type single-core which takes the file as input and passes it to the program my_program.

naive-manyjobs.slurm
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:15:00
./my_program $1

Then we could submit this script N times with the according file:

for i in {1..$N};do sbatch naive-manyjobs.slurm file_$i;done

This approach is simple. However, for a very large number of jobs it will saturate the job queue and may slow down the work scheduler. Moreover, Slurm works in cycles and is configured to consider at most 50 jobs per cycle.

The approach is quite inefficient, especially for short-running jobs, as the waiting time and scheduling overhead are likely to exceed the actual running time.

Job arrays

Slurm's Job Array feature allows to do this by submitting one single script to the work scheduler. For example (with N=1000):

jobarray.slurm
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:15:00
#SBATCH --array=1-1000%24

./my_program file_$SLURM_ARRAY_TASK_ID

The --array option creates a job array (a collection of jobs with identical submission options) and defines environment variables like SLURM_ARRAY_TASK_ID.

Here an array of 1000 jobs is created (1-1000) and the number of jobs allowed to run simultaneously is limited to 24 (optional).

By default, stdout is redirected to the files

slurm-{$SLURM_ARRAY_JOB_ID}_{$SLURM_ARRAY_TASK_ID}.out

This approach is more efficient than the previous one.

However, the submission of very large job-array may overload the task manager, therefore the maximum job-array size is limited to 10000.

Background jobs

In order to avoid saturating the job queue we might move the for-loop inside the script.

for i in {1..$N}; do ./my_program file_$i; done

The obvious problem with this is the sequential processing of the files.

One solution is to write a script of type multi-core and launch several concurrent instances of my_program on the available cores. For N=1000 it could look like this:

#!/bin/bash                                                                                                               
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --time=12:00:00

N = 1000

for i in {1..$N}
do
./my_program file_${i} &
[[ $((i%$SLURM_CPUS_PER_TASK)) -eq 0 ]] && wait
done

The & symbol in line 11 launches my_program in the background (non-blocking) and the command wait in line 12 (called every SLURM_CPUS_PER_TASK iterations) waits for all background jobs to complete.

This approach is suboptimal because the duration of each group of SLURM_CPUS_PER_TASK tasks is determined by the longest-running task, i.e. for highly imbalanced tasks this approach is impractical.

GNU parallel

A very good solution is to use the GNU Parallel tool.

To get an idea how it works, just issue the following command

parallel echo {} ::: {1..10}

Inside a Slurm script it could look like this.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --time=00:10:00
#SBATCH --job-name=gnu_parallel

parallel -j $SLURM_CPUS_PER_TASK --joblog jobs.log 'sleep {%}s && echo core {%} : task {}' ::: {1..100}

The -j $SLURM_CPUS_PER_TASK option tells parallel to use 16 cores and --joblog allows to get an execution log.

Inside the command {} is substituted by {1..100} (numbers going from 1 to 100) and {%} is substituted by a job slot number.