Jobs

Here, all you ever wanted to know about jobs: job scripts, submission, cancellation, status, monitoring & accounting...

Introduction

On gaudi, every jobs (serial or parallel) of any scientific applications are submitted via the Slurm Workload Manager. Its primary task is to allocate computational tasks, i.e., batch jobs, among the available computing resources.
The Slurm Workload Manager is an open source distributed resource manager providing control over batch jobs and distributed compute nodes. Its name stands for Simple Linux Utility for Resource Management.

Job scripts

In order to run a job under batch you must submit a script which contains the commands to be executed.
The first line (a shebang: #!) indicates the shell used by the script.
The following lines, beginning with #SBATCH, are the Slurm directives.
And then comes a sequence of UNIX command-line instructions, typically : copy of the input files in the temporary directory (TMPDIR or SCRDIR), execution of the program, copy of the output files from the TMPDIR (SCRDIR) to the HOME directory.

Here's a detailed example for a OpenMP submission script (16 cores on 1 node) where #blabla are comments and should be removed for a functional script file :

#!/usr/bash #a Bash shell script

#the 4 following lines are Slurm directives
#SBATCH -J Ultra #Ultra is the name of the job
#SBATCH --nodes=1 #the job will run on 1 node (nodes=1)
#SBATCH --ntasks=16 #the job will run on 16 cores (ntasks=16)
#SBATCH --mem=42gb #42gb of memory are required
#SBATCH --time=30:21:56 #the job will be killed if it exceeds 30 h 21 min 56 s

module load program_to_use #load the module that corresponds to the program you want to use

cd $TMPDIR #enter the TMPDIR directory on the node where the calculation is performed (typically for serial and OpenMP jobs)
# cd $SCRDIR #enter the SCRDIR directory on the parallel file system (essentially, but not only, for MPI jobs)
cp $HOME/input_file(s) . #copy the input file(s) from you HOME to your TMPDIR directory
exec input_file(s) #exec is the executable file of the desired program
cp -f output_files $HOME/. #copy the output file(s) to your HOME directory

/bin/rm -rf $TMPDIR $SCRDIR #remove all traces of the calculation

For a serial (sequential) calculation, the ntask variable should be adjusted :

SBATCH --ntasks=1

As well as for a MPI calculation that uses more than one compute node (let say 3), the nodes variable must read :

SBATCH --nodes=3

Notes:

We recommand that you automatically load the slurm module upon login by adding the following command to your .bashrc file:

module load slurm

You can visualize all the installed modules with the command : module avail
During the calculation, the storage of the output and temporary files will be either on the TMPDIR directory of the compute node (serial & openMP jobs), either on the SCRDIR directory of the parallel file system (MPI jobs). You have total control over this, by specifying TMPDIR or SCRDIR in your Slurm submission script.

Partitions

Depending on numbers of nodes (--nodes), of cores (--ntasks), of gpu (--gres=gpu) and the amount of memory (--mem) required, your job will belong to a specific partition :

	Affected compute nodes
serial : serial calculations (1 core);	c3n[00-01]
omp : OpenMP calculations;	c4n[00-07],c3n[02-34,39-42]
parac4 or parac3 : MPI calculations;	c4n[08-15,20-21] or c3n[02-34,39-42]
gpu : calculations that use gpu;	c4n[17-19],c3n[36-38]
c4 : any calculations which use the most recent compute nodes;	c4n[00-15,20-21]
bigmem : calculations that require a huge amount of memory;	c3n35
amd : calculation on the AMD compute node (the other nodes carry Intel cpu);	c4n16
ens : only for teaching and classroom calculations.	node-d0[1-7]

The specifications of the different compute nodes are given in the Compute Nodes section of the Equipment menu.

Submission

From a script file suitable for the application you want to use, the submission is done via the sbatch command :

$ sbatch script.slurm

If all goes well, the following lines then appear :

Submitted batch job 101

which indicate the job identifier assigned by Slurm (JOBID, here 101).

Monitoring jobs

The squeue command provides a status listing of all jobs and partitions associated with the cluster.

$ squeue

resulting :

 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 11584       gpu   tarass    user1  R      28:23      1 c4n18
 11581        c4   boulba    Ostap  R    1:25:47      1 c4n04
 11580        c4   est un   Bender  R    1:34:37      1 c4n03
 11571        c4    roman    user2  R 2-08:44:41      1 c4n02
 11555       omp       de    user3  R 4-01:47:35      1 c4n04
 11550        c4    gogol    Ostap  R 4-12:00:58      1 c4n03

Columns give respectively :

the job identifier assigned by Slurm : JOBID;
the partition in which the job currently resides : PARTITION;
the job name given by the submitter : NAME;
the job owner : USER;
the job's current state : ST;
the amount of wall time used by the job (dd-hh:mm:ss) : TIME;
the number of nodes requested by the job : NODES;
the node(s) allocated to the job : NODELIST.

The most common job state codes are :

R : job is running;
PD : job is awaiting resource allocation;
CD : job has terminated all processes on all nodes;
F : job terminated with non-zero exit code or other failure condition.

For the sake of clarity it is sometimes convenient to display only the jobs of a particular user. It could be done by using the -u option followed by the desired username :

$ squeue -u Ostap

It will result :

 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 11581        c4  strange   Ostap   R    1:25:47      1 c4n04
 11550        c4     love   Ostap   R 4-12:00:58      1 c4n03

The scontrol command and its multiple options can be used to show more detailed information about a job, such as the job with ID 101 :

$ scontrol show jobid 101

For more options on available information about jobs located in the Slurm scheduling queue see the squeue and/or scontrol manpages.

Cancellation

For several reasons (that are your own), you may need to kill a job. A batch job may be deleted by its owner by using the scancel command followed by the JOBID as shown below for the JOBID = 101:

$ scancel 101

How to check current calculations ?

It is relatively easy.

For serial calculations or calculations using the OpenMP protocole, the output and (eventually) temporary files are stored on the compute node where the job takes place (displayed by using squeue -$USER). Thus, to check a current output file of a running job, access via ssh to the node allocated to the job and then enter to your TMPDIR directory which is defined as /tmp/$USER_$JOBID:
$ ssh c4n02

$ cd /tmp/Ostap_101
For calculations using the MPI protocole, usually the output and (eventually) temporary files are stored on the parallel file system located on the /scratch partition. Your SCRDIR directory is then /scratch/$USER_$JOBID:
$ cd /scratch/Ostap_101

It also can be done to transfer files from your TMPDIR to your home directory in the case of a job has reached the time limit (with the cp command).

Job accounting

The sacct command displays information on jobs, job steps, status, and exitcodes by default regardless of it being in the past, present, or future execution.
sacct's output, as with most Slurm informational commands, can be customized in a large number of ways: sacct manpage.
Here are a few of the more useful options:

Information about the job with ID 101:

$ sacct -j 101

Information about all the jobs of the user Ostap since the May 6^th 1973:

$ sacct -u Ostap --starttime 1973-05-06

Display the maximum number of bytes written by all tasks and the maximum virtual memory size of all tasks in the job:

$ sacct -j 101 --format=MaxDiskWrite,MaxVMSize

Finally, by using the Slurm sreport command with appropriate options, you can check if you are the best or at least among the best (although it is not really obvious that the term best is the most appropriate):