The job scheduler : Slurm

The cluster workload manager is slurm.
The computations on the Cluster are carried out via a job scheduler which manages the queue and launches the computations when the requested resources are available.
Reminder: it is strictly forbidden to calculate directly on the frontend node (zeus)!
Instead of directly executing their programs, users submit scripts to the Slurm job scheduler. Those scripts specify requested resources - what machines? for how long? etc - and commands to be executed on the allocated resources.
Therefore it is very important that requested resources match the requirement of the codes to be executed. As long as resources are unavailable the script won't start (it is useless to ask for too much resources) - and if allocated resources are insufficient the program fails to complete.
This page provides necessary information to write and submit job scripts adapted to your needs.
Maximum walltime maximum
Caution, the maximum allowed walltime depends on the number of cores requested:
- below 48 cores, the limit is 384 hours (16 days)
- for more than 48 cores the limit decreases proprtionally with the number of cores requested.
The amount of requested core-hours cannot be larger than 48 x 384 core-hours.
The table below presents some examples, the variation is linear.
| #cores | Walltime max (hours) | Walltime max (days) |
|---|---|---|
| <=48 | 384 | 16 |
| 64 | 288 | 12 |
| 96 | 192 | 8 |
| 128 | 144 | 6 |
| 192 | 96 | 4 |
| 256 | 72 | 3 |
| 384 | 48 | 2 |
| 512 | 36 | 1.5 |
| 768 | 24 | 1 |
Interactive jobs
To run short tests you may use the following interactive machines :
| Name | Type | CPU | Memory |
|---|---|---|---|
| interactive-1 | CPU | 2x10 cœurs @2.20GHz, Intel Xeon Silver 4114 | 192 Go |
| interactive-2 | CPU | 2x10 cœurs @2.20GHz, Intel Xeon Silver 4114 | 192 Go |
From zeus these machines are accessible directly via ssh.
The command
ssh interactive
will connect you to either of the two machines interactive-[1,2].
As for the frontend, these nodes are shared among all users.
Therefore, you should avoid launching long-running jobs on those machines and keep that fact in mind regarding performance measures.
For longer tests you must submit a job via the job scheduler.
Job submission (sbatch)
The following command submits a script called job.slurm to the job scheduler
sbatch job.slurm
The job.slurm file contains instructions for Slurm as well as commands to execute on the granted resources. This command returns a unique identifier of the job (JOBID).
For example, the following script describes a job that reserves 1 node (--nodes=1) for a duration of at most 10 minutes and executes the commands hostname and sleep 60.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=00:10:00
hostname
sleep 60
Slurm instructions (more precisely, sbatch options) start with #SBATCH followed by an option. The main sbatch options are listed below.
Same sample submission scripts for frequently used codes can be found in the /share/doc/submission/ folder.
If you would like to see an example script added or spot a mistake, let us know (hpc@univ-lille.fr).
Some examples of submission scripts are available here.
--time, -t
The maximum duration of the job. Beyond this limit the job will be killed by the system.
Valid formats include :
#SBATCH --time 24:00:00→ 24 hours, the default value#SBATCH --time 2-12→ 2 days and 12 hours#SBATCH --time 30→ 30 minutes#SBATCH --time 30:00→ also 30 minutes
The equivalent short syntax is -t 24:00:00.
Any job that request more than the maximum allowed walltime will remain in the waiting queue indefinitely.
--nodes, -N
The number of requested nodes. For example, to request 1 node (default value):
#SBATCH --nodes=1
To request at least 2 and at most 4 nodes:
#SBATCH --nodes=2-4
--ntasks-per-node
The number of cores requested per node (here 1 core, the default value).
This option is to be used in conjunction with --nodes
#SBATCH --ntasks-per-node=1
--job_name, -J
The name of the job
#SBATCH --job-name=my_job
--mem
The maximum amount of memory required per node (here 1024 MB or 1 GB, which is default).
#SBATCH --mem=1024M
Slurm accepts other formats than M (default), for example G. So the following two options are equivalent
SBATCH --mem=2048→ 2048 MBSBATCH --mem=2G→ 2 GB → 2048 MB
be aware of the difference between --mem=128G and --mem=128000! A job that requests --mem=128G (131072 MB) won't be able to run on the 128 GB nodes (see this) as those are configured with MEMORY=128000 in Slurm.
For further information on how to set this value appropriately see this
--mail_user
To receive email notifications concerning your job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=you@univ-lille.fr
--gres
gres stands for consumable generic resource. Often gres is used to reserve GPUs (see this for further information). For instance,
#SBATCH --gres=gpu:2
requests two GPUs per node (without specifying the GPU type)
The command sbatch --gres=help allows to list available gres options.
--output,--error
allow to specify the output files.
By default, stdout and stderr streams are redirected to a file named slurm-%j.out (where %j is the SLURM_JOB_ID) inside the launch directory.
It is also possible to create one output file per task or per node. The Slurm documentation explains how to create output filenames based on job parameters.