Monitoring et gestion de jobs

squeue

Shows your jobs currently waiting in the queue or running.

$ squeue
   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    ...        ...      ...      

This command allows you to obtain informations such as the ID of a job, the nodes reserved, elapsed time etc.

The ST column shows the state of the job : among the possible states the most frequent are: R (Running), PD (Pending), F (Failed).

For pending jobs (PD), the (REASON) column shows a reason why the job is pending - the list is rather long.

The two reasons frequently met are Priority (other jobs have higher priority) and Resources (waiting for resources to become available). If other reasons are shown it might be useful to check if the resource request is satifiable.

info

Caution, on Zeus this command shows only your own jobs.

scancel

Allows to cancel your jobs.

scancel JOBID cancels job JOBID.
scancel -n toto cancels all jobs named toto.
scancel -n toto -t PENDING cancels all jobs toto in pending state.
scancel -u user.login cancels all jobs of user

sinfo

Gives information of the current state of the cluster, available resources and their configuration.

It is possible to format sinfo's output to get more or less detailed information.

For instance,

sinfo -s gives a summary of the state of the cluster
sinfo -N --long gives more detailed node-by-node information

sacct

Gives information on past jobs. For example,

sacct -S MMDD

returns a list of jobs submitted since a given date where MM corresponds to the month and DD to the day of the current year.

For instance, to get the history of jobs since July, 15 :

sacct -S 0715

You can also define an end date with the -E option :

sacct -S MMDD -E MMDD

Adjust requested memory

Slurm continuously controls the consumed resources in terms of number of cores and amount of memory. Jobs that consume more resources than requested are killed automatically.

If a job exceeds the requested amount of memory, the following error message may appear in the output file :

slurmstepd: error: Exceeded step memory limit at some point.

While it may be difficult to accurately estimate the required amount of memory, Slurm allow to query the amount of memory used by a job after its execution.

After completion of a job you may use the following command (replacing JOBID by your job identifier):

sacct -o jobid,jobname,reqnodes,reqcpus,reqmem,maxrss,averss,elapsed -j JOBID

The output will be similar to this one :

    ReqMem     MaxRSS     AveRSS    Elapsed
---------- ---------- ---------- ----------
   55000Mn        16?              00:08:33
   55000Mn  17413256K  16269776K   00:08:33
   55000Mn  17440808K  16246408K   00:08:32

where ReqMem is the amount of memory requested with

#SBATCH --mem=55000M

MaxRss is the maximum amount of memory used on one node and AveRSS is the average amount used per node.

Here, the memory consumption has peaked at about 18 GB per node.

You might consider requesting less memory for similar jobs, for example :

#SBATCH --mem=20G

To see this info for past jobs since YYYY-MM-DD:

sacct -o jobid,jobname,reqnodes,reqcpus,reqmem,maxrss,averss,elapsed -S YYYY-MM-DD

Watch out: if you get an error message indicating that you've exceeded the memory limit, the shown MaxRSS value is not necessarily larger than ReqMem because the job gets cancelled before Slurm records that value.

The cluster contains three types of nodes :

24 cores with 128GB or 192GB of memory
32 cores with 192GB of memory
32 cores with 512GB of memory

Jobs are automatically placed on different types of nodes according to the requested resources.

Monitoring et gestion de jobs

squeue​

scancel​

sinfo​

sacct​

Adjust requested memory​

squeue

scancel

sinfo

sacct

Adjust requested memory