Jobs
Here, all you ever wanted to know about jobs: job scripts, submission, cancellation, status, monitoring & accounting...Introduction
On gaudi, every jobs (serial or parallel) of any scientific applications are submitted via the Slurm Workload Manager. Its primary task is to allocate computational tasks, i.e., batch jobs, among the available computing resources.The Slurm Workload Manager is an open source distributed resource manager providing control over batch jobs and distributed compute nodes. Its name stands for Simple Linux Utility for Resource Management.
Job scripts
In order to run a job under batch you must submit a script which contains the commands to be executed.The first line (a shebang: #!) indicates the shell used by the script.
The following lines, beginning with #SBATCH, are the Slurm directives.
And then comes a sequence of UNIX command-line instructions, typically : copy of the input files in the temporary directory (TMPDIR or SCRDIR), execution of the program, copy of the output files from the TMPDIR (SCRDIR) to the HOME directory.
Here's a detailed example for a OpenMP submission script (16 cores on 1 node) where #blabla are comments and should be removed for a functional script file :
#!/usr/bash
#a Bash shell script
#the 4 following lines are Slurm directives
#SBATCH -J Ultra
#Ultra is the name of the job
#SBATCH --nodes=1
#the job will run on 1 node (nodes=1)
#SBATCH --ntasks=16
#the job will run on 16 cores (ntasks=16)
#SBATCH --mem=42gb
#42gb of memory are required
#SBATCH --time=30:21:56
#the job will be killed if it exceeds 30 h 21 min 56 s
module load program_to_use
#load the module that corresponds to the program you want to use
cd $TMPDIR
#enter the TMPDIR directory on the node where the calculation is performed (typically for serial and OpenMP jobs)
# cd $SCRDIR
#enter the SCRDIR directory on the parallel file system (essentially, but not only, for MPI jobs)
cp $HOME/input_file(s) .
#copy the input file(s) from you HOME to your TMPDIR directory
exec input_file(s)
#exec is the executable file of the desired program
cp -f output_files $HOME/.
#copy the output file(s) to your HOME directory
/bin/rm -rf $TMPDIR $SCRDIR
#remove all traces of the calculation
For a serial (sequential) calculation, the ntask variable should be adjusted :
SBATCH --ntasks=1
As well as for a MPI calculation that uses more than one compute node (let say 3), the nodes variable must read :SBATCH --nodes=3
Notes:- We recommand that you automatically load the slurm module upon login by adding the following command to your .bashrc file:
module load slurm
- You can visualize all the installed modules with the command : module avail
- During the calculation, the storage of the output and temporary files will be either on the TMPDIR directory of the compute node (serial & openMP jobs), either on the SCRDIR directory of the parallel file system (MPI jobs). You have total control over this, by specifying TMPDIR or SCRDIR in your Slurm submission script.
Partitions
Depending on numbers of nodes (--nodes), of cores (--ntasks), of gpu (--gres=gpu) and the amount of memory (--mem) required, your job will belong to a specific partition :- serial : serial calculations (1 core);
- omp : OpenMP calculations;
- parac4 or parac3 : MPI calculations;
- gpu : calculations that use gpu;
- c4 : any calculations which use the most recent compute nodes;
- bigmem : calculations that require a huge amount of memory;
- amd : calculation on the AMD compute node (the other nodes carry Intel cpu);
- ens : only for teaching and classroom calculations.
Affected compute nodes | |
c3n[00-01] | c4n[00-07],c3n[02-34,39-42] | c4n[08-15,20-21] or c3n[02-34,39-42] | c4n[17-19],c3n[36-38] | c4n[00-15,20-21] | c3n35 | c4n16 | node-d0[1-7] |
Submission
From a script file suitable for the application you want to use, the submission is done via the sbatch command :$ sbatch script.slurm
If all goes well, the following lines then appear :Submitted batch job 101
which indicate the job identifier assigned by Slurm (JOBID, here 101).
Monitoring jobs
The squeue command provides a status listing of all jobs and partitions associated with the cluster.$ squeue
resulting :JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 11584 gpu tarass user1 R 28:23 1 c4n18 11581 c4 boulba Ostap R 1:25:47 1 c4n04 11580 c4 est un Bender R 1:34:37 1 c4n03 11571 c4 roman user2 R 2-08:44:41 1 c4n02 11555 omp de user3 R 4-01:47:35 1 c4n04 11550 c4 gogol Ostap R 4-12:00:58 1 c4n03Columns give respectively :
- the job identifier assigned by Slurm : JOBID;
- the partition in which the job currently resides : PARTITION;
- the job name given by the submitter : NAME;
- the job owner : USER;
- the job's current state : ST;
- the amount of wall time used by the job (dd-hh:mm:ss) : TIME;
- the number of nodes requested by the job : NODES;
- the node(s) allocated to the job : NODELIST.
The most common job state codes are :
- R : job is running;
- PD : job is awaiting resource allocation;
- CD : job has terminated all processes on all nodes;
- F : job terminated with non-zero exit code or other failure condition.
$ squeue -u Ostap
It will result :JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 11581 c4 strange Ostap R 1:25:47 1 c4n04 11550 c4 love Ostap R 4-12:00:58 1 c4n03
The scontrol command and its multiple options can be used to show more detailed information about a job, such as the job with ID 101 :
$ scontrol show jobid 101
For more options on available information about jobs located in the Slurm scheduling queue see the squeue and/or scontrol manpages.Cancellation
For several reasons (that are your own), you may need to kill a job. A batch job may be deleted by its owner by using the scancel command followed by the JOBID as shown below for the JOBID = 101:$ scancel 101
How to check current calculations ?
It is relatively easy.- For serial calculations or calculations using the OpenMP protocole, the output and (eventually) temporary files are stored on the compute node where the job takes place (displayed by using squeue -$USER). Thus, to check a current output file of a running job, access via ssh to the node allocated to the job and then enter to your TMPDIR directory which is defined as /tmp/$USER_$JOBID:
$ ssh c4n02
$ cd /tmp/Ostap_101
- For calculations using the MPI protocole, usually the output and (eventually) temporary files are stored on the parallel file system located on the /scratch partition. Your SCRDIR directory is then /scratch/$USER_$JOBID:
$ cd /scratch/Ostap_101
Job accounting
The sacct command displays information on jobs, job steps, status, and exitcodes by default regardless of it being in the past, present, or future execution.sacct's output, as with most Slurm informational commands, can be customized in a large number of ways: sacct manpage.
Here are a few of the more useful options:
- Information about the job with ID 101:
- Information about all the jobs of the user Ostap since the May 6th 1973:
- Display the maximum number of bytes written by all tasks and the maximum virtual memory size of all tasks in the job:
$ sacct -j 101
$ sacct -u Ostap --starttime 1973-05-06
$ sacct -j 101 --format=MaxDiskWrite,MaxVMSize
Finally, by using the Slurm sreport command with appropriate options, you can check if you are the best or at least among the best (although it is not really obvious that the term best is the most appropriate):