Maxwell : Running batch jobs

Basics
- Runtime parameters
Standard Job Script
General remarks
First batch job
Running Job-Scripts in Batch
- Using constraints
Testing Jobs
Running graphical applications interactively
Controlling Jobs
Job Information
Environment
Jobs with Dependencies
Job Array

schedmd offers exhaustive documentation how to use SLURM: http://slurm.schedmd.com/. We have just collected a few examples below.

Maxwell useful commands provides a short list of commands which might become handy.

Basics

sbatch is the prime command to run jobs on the maxwell cluster. It accepts a script, copies the script to the scheduler, and the scheduler executes the script as soon as compute resources become available. It's particularly handy for running large number of tasks, without the need to worry about the whereabouts of the jobs. Unlike salloc or srun it returns your shell immediately, so you can continue to work, and the jobs will not be affected by accidentally closing a shell or session or crash of the login node used to submit jobs. The jobs won't even be affected by accidentally deleting the job-script: slurm has a copy and you can even recreate the job-scripts while jobs are still running or pending. The details are explained on a separate page on running batch jobs.

Runtime parameters

A batch job needs to inform the scheduler (slurm) about certain parameters, like the partition a job should run in, how long a job is expected to run, how many nodes it should allocate and so on. All these parameters are explained on schedmd's sbatch documentation.The parameters can be declared as "comments" in a batch script, or on the command line (see below) and either in extended or abbreviated form. Parameters provided on the command line always override parameters specified in the batch script.

The most commonly used parameters:

Parameter	sbatch comment	abbreviation	default	environment in	environment out	purpose
Partition	#SBATCH --partition=maxcpu	-p maxcpu	maxcpu	SBATCH_PARTITION	SLURM_JOB_PARTITION	set the partition a job should run in
Runtime	#SBATCH --time=1-00:00:00	-t 1-01:00	depends on the partition!	SBATCH_TIMELIMIT		set duration of a job
Number of Nodes	#SBATCH --nodes=1	-N 2	1 node		SLURM_JOB_NUM_NODES	min-max number of nodes per job
Working Directory	#SBATCH --chdir=somewhere	-D somewhere	current working directory			jobs starting point
Job ID					SLURM_JOB_ID	Job ID to be used in scontrol, sacct, ...
Job Name	#SBATCH --job-name=testjob	-J testjob	name of the command	SBATCH_JOB_NAME	SLURM_JOB_NAME	define a custom, useful for accounting and dependencies
Standard Output File	#SBATCH --output=first-%N-%j.out	-o job.log	slurm-<jobid>.out
Error Output File	#SBATCH --error=first-%N-%j.err #	-e job.error	standard output file
Notifications	#SBATCH --mail-type=END	n/a	none
Mail Address	#SBATCH --mail-user=max.muster@desy.de	n/a	userid@mail.desy.de
Constraints	#SBATCH --constraint=INTEL	-C AMD	none	SBATCH_CONSTRAINT		Define nodes features
Dependencies	#SBATCH --dependency=singleton	-d singleton	none		SLURM_JOB_DEPENDENCY	Set jobs to wait for other jobs to finish
Reservation	#SBATCH --reservation=my-rsv01	n/a	none	SBATCH_RESERVATION	SLURM_JOB_RESERVATION	Use a reservation
	environment out: environment variables available inside a running batch job environment in: environment variables to control sbatch. The variable overrides values specified in the batch-script. Values specified on the command-like always take precedence.

Standard Job Script

as mentioned slurm needs to be instructed what to use in which partition and for how long. A most simple batch script could look like this:

#!/bin/bash
#SBATCH --partition=maxcpu     
#SBATCH --time=00:10:00        
#SBATCH --nodes=1    
unset LD_PRELOAD
source /etc/profile.d/modules.sh
module purge

compute something

Remarks:

unset LD_PRELOAD: reduces meaningless warnings in the logfiles when submitting from display nodes
source /etc/profile.d/modules.sh: allows to use the module command inside batch scripts, which can become quite handy
module purge: batch jobs will inherit by default the current environment. purging modules at least ensures that there are not potentially interfering modules loaded

Pitfalls:

slurm interprets #SBATCH instructions until it find the first command (everything other than a comment). Any #SBATCH statement after the first command will be ignored
it's not possible to use environment variables in #SBATCH statements.

General remarks

Currently all nodes are configured as non-shared resources. A job will have exclusive access to the resources requested, and it's up to the job to consume all CPU-cores or a subset of it, using mpi, p-threads or forking processes and so on. It consequently makes in many cases little sense to use the --ntasks or --cpus_per_task setting. Also try to avoid using specific hostnames; there is no advantage and tends to be counter-productive.

See the FAQ and the schedmd documentation for more details. To get a quick overview on a commands syntax use the man pages (available on any of the slurm login nodes): man <command>. You will most likely need not more than a handful of commands, like salloc, sbatch, scancel, scontrol, sinfo ...

First batch job

Though sbatch usually expects a job script, it can create a wrapper automatically around a command:

simple batch job using wrap option to launch command

# simple job which prints hostname
@max-wgs001:~$ sbatch --wrap hostname
Submitted batch job 1516

@max-wgs001:~$ cat slurm-1516.out
max-wn022.desy.de

# simple python test
@max-wgs001:~$ sbatch --output=numpy.out --time=0-00:01:00 --partition=allcpu --wrap 'python3 -c "import numpy;print(numpy.__version__)"'

@max-wgs001:~$ cat numpy.out 
1.12.1

Running Job-Scripts in Batch

batch script sample

# simple job which prints hostname 
[@max-wgs ~]$ cat hostname.sh
#!/bin/bash
#SBATCH --partition=maxcpu
#SBATCH --time=00:10:00                           # Maximum time requested
#SBATCH --nodes=1                                 # Number of nodes
#SBATCH --chdir=/home/mmuster/slurm/output        # directory must already exist!
#SBATCH --job-name=hostname
#SBATCH --output=hostname-%N-%j.out               # File to which STDOUT will be written
#SBATCH --error=hostname-%N-%j.err                # File to which STDERR will be written
#SBATCH --mail-type=END                           # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=max.muster@desy.de            # Email to which notifications will be sent. It defaults to <userid@mail.desy.de> if none is set.
unset LD_PRELOAD
source /etc/profile.d/modules.sh
module purge
 
/bin/hostname

# submit to batch queue for one node with one task
# requesting 10 mins of wall time 
[@max-wgs ~]$ sbatch hostname.sh
Submitted batch job 2163

[@max-wgs ~]$ ls
hostname.sh  slurm-2163.out
 
[@max-wgs ~]$ cat slurm-2163.out
max-wn004.desy.de

[@max-wgs ~]$ scontrol show job 2163
JobId=2163 JobName=hostname
   UserId=mmuster(1234) GroupId=cfel(3512)
   Priority=5001 Nice=0 Account=cfel QOS=cfel
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2016-01-20T14:50:17 EligibleTime=2016-01-20T14:50:17
   StartTime=2016-01-20T14:50:17 EndTime=2016-01-20T14:50:18
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=cfel AllocNode:Sid=max-cfel001:1345
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=max-cfel004
   BatchHost=max-cfel004
   NumNodes=1 NumCPUs=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/mmuster/slurm/hostname.sh
   WorkDir=/home/mmuster/slurm/output
   StdErr=/home/mmuster/slurm/output/hostname-%N-2163.err
   StdIn=/dev/null
   StdOut=/home/mmuster/slurm/output/hostname-%N-2163.out
   Power= SICP=0

Using constraints

For many job-types - in particular mpi jobs - you want to select specific hardware and identical hardware for mulit-node jobs. This can be done with so called constraints (sometime also called features).

constraints

#SBATCH --constraint=GPU                          # select node(s) with GPU
#SBATCH --constraint='A100|V100'                  # select node(s) with A100 or V100 GPU. multi-node jobs might get a mix of both 
#SBATCH --constraint='EPYC&7402'                  # select node(s) with EPYC 7402 CPU
#SBATCH --constraint='[(EPYC&7402)|Gold-6240]'    # select node(s) with EPYC 7402 CPU OR Intel Gold-6240 CPU, but all nodes identical
#SBATCH --constraint=EPYC                         # request AMD EPYC nodes
#SBATCH --constraint="GPUx1&V100"                 # request a node with exactly one NVIDIA V100 GPU.
#SBATCH --constraint=INTEL                        # request intel nodes. 
#SBATCH --constraint="INTEL&V3"                   # request intel nodes with v3 CPUs (haswell). 
#SBATCH --constraint="[AMD|INTEL]"                # request either N*INTEL or N*AMD nodes. Without [] any combination of INTEL and AMD nodes is being requested.

Remarks:

constraints are case sensitive!
consult https://slurm.schedmd.com/sbatch.html for more examples and details

There are a couple of options to list available constraints, and current availability

information about constraints

[@max-wgs001 ~]$ sinfo -p maxgpu -o '%n %f %t'              # show nodes in maxgpu, with features/constraints and node status
[@max-wgs001 ~]$ sinfo -p maxgpu -o '%n %f %t' | grep idle  # only show idle nodes

sinfo is however very misleading. A node allocated in allgpu partition is shown as allocated, but available in the maxgpu partition, since jobs in the allgpu partition can be preempted. savail gives a better view:

information about constraints

[@max-wgs001 ~]$ /software/tools/bin/savail -p maxgpu -f 'A100|V100'    # show available and preemptable A100|V100 nodes in maxgpu
[@max-wgs001 ~]$ module load maxwell                                    # add /software/tools/bin to PATH, so you don't have to remember the full path
[@max-wgs001 ~]$ savail -p maxgpu -f 'A100|V100'

Testing Jobs

You can test in advance if the resources requested by your job are available at all, and when the job is expected to start. For example

# request a single node without particular specs
[@max-wgs001 ~]$ sbatch --test-only my-app.sh  # --test-only indicates a dry-run. It won't submit the job
sbatch: Job 1652551 to start at 2019-01-20T09:29:06 using 32 processors on nodes max-wn004 in partition maxcpu

# request a V100 GPU
[@max-wgs001 ~]$ sbatch  --constraint=V100 --test-only my-app.sh
sbatch: Job 1652553 to start at 2019-01-20T09:29:26 using 40 processors on nodes max-wng019 in partition maxcpu

# request 2 V100 GPUs in a single node, which is an invalid constraint
[@max-wgs001 ~]$ sbatch --constraint="GPUx2&V100" --test-only my-app.sh
allocation failure: Requested node configuration is not available

# ask for a P100 OR V100 in one of the partitions:
[@max-wgs001 ~]$ sbatch --partition="allcpu,petra4,upex" --constraint="V100|P100" --test-only my-app.sh
sbatch: Job 1652560 to start at 2019-01-18T11:34:15 using 40 processors on nodes max-exflg006 in partition upex
# having neither permission to use petra4 or upex my job must be running in the allcpu partition. 
# --test-only does not validate the partition proposed. Make sure to specify only partitions usable for your account.

# to verify:
[@max-wgs001 ~]$ sbatch --partition="allcpu,petra4,upex" --constraint="V100|P100"  my-app.sh
Submitted batch job 1652561
[schluenz@max-wgs001 ~]$ sacct -j 1652561
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1652561       my-app.sh     allcpu         it         40  COMPLETED      0:0 
# as expected running in allcpu partition.

Running graphical applications interactively

[@max-wgs ~]$ salloc -N 1 --partition=maxcpu                 
salloc: Granted job allocation 214

[@max-wgs ~]$ ssh -t -Y $SLURM_JOB_NODELIST matlab_R2018a                   # this will work on max-wgs, but crash on max-display!
[@max-wgs ~]$ ssh -t -Y $SLURM_JOB_NODELIST matlab_R2018a -softwareopengl   # this will always work

# you could write a small wrapper named $HOME/bin/s:
#!/bin/bash
if [[ "x$SLURM_JOB_NODELIST" != "x" ]]; then 
    ssh -t -Y $SLURM_JOB_NODELIST "$@"
else
    echo "salloc -N 1 before using s!"
fi

[@max-wgs ~]$ s matlab_R2018a -softwareopengl 

# remember to release the allocation once done!
[@max-wgs ~]$ exit
exit
salloc: Relinquishing job allocation 214
salloc: Job allocation 214 has been revoked.

Controlling Jobs

scancel <jobid>                   # cancel a job
scancel -u <username>             # cancel all the jobs for a user. you can cancel only your own jobs (of course)
scancel -t PENDING -u <username>  # cancel all the pending jobs for a user
scancel --name myJobName          # cancel one or more jobs by name
scontrol hold <jobid>             # pause a particular job
scontrol resume <jobid>           # resume a particular job
scontrol requeue <jobid>          # requeue (cancel and rerun) a particular job
scontrol update jobid=<jobid>     # manipulate a pending job, e.g. scontrol update jobid=12345 partition=allcpu to move a job to the allcpu partition
scontrol show job <jobid>         # give information about job

Job Information

To get a quick overview about jobs, partitions and so on, slurm provides a tool sview:

[@max-wgs ~]$ sview &

By default it will only show partitions you are entitled to submit jobs to, and consequently only jobs running in these partitions. To view all jobs on all partitions enable under Options: Show Hidden.

squeue -u <username>                                                         # List all current jobs for a user
squeue -u <username> -t RUNNING                                              # List all running jobs for a user
squeue -u <username> -t PENDING                                              # List all pending jobs for a user
squeue -u <username> -p maxcpu                                               # List all current jobs in the maxcpu partition for a user
scontrol show jobid -dd <jobid>                                              # List detailed information for a job (useful for troubleshooting)
scontrol update jobid=<jobid> partition=maxcpu NumNodes=6                    # update a (pending) job for example by setting the number of nodes compliant with partition-limits
 
# Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed                       # To get statistics on completed jobs by jobID:
sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed                    # To view the same information for all jobs of a user

[@max-wgs ~]$ sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 1652579   # some stats about running job like memory consumption.
    AveCPU   AvePages     AveRSS  AveVMSize        JobID 
---------- ---------- ---------- ---------- ------------ 
 00:00.000          0      1469K     27840K 1652579.0    


# /software/tools/bin/slurm is a convenient tool to extract queue and job information
[@max-wgs ~]$ module load maxwell tools
[@max-wgs ~]$ slurm 
Show or watch job queue:
 slurm [watch] queue     show own jobs
 slurm [watch] q <user>  show user's jobs
 slurm [watch] quick     show quick overview of own jobs
 slurm [watch] shorter   sort and compact entire queue by job size
 slurm [watch] short     sort and compact entire queue by priority
 slurm [watch] full      show everything
 slurm [w] [q|qq|ss|s|f] shorthands for above!

 slurm qos               show job service classes
 slurm top [queue|all]   show summary of active users

Show detailed information about jobs:
 slurm prio [all|short]  show priority components
 slurm j|job <jobid>     show everything else
 slurm steps <jobid>     show memory usage of running srun job steps

Show usage and fair-share values from accounting database:
 slurm h|history <time>  show jobs finished since, e.g. "1day" (default)
 slurm shares

Show nodes and resources in the cluster:
 slurm p|partitions      all partitions
 slurm n|nodes           all cluster nodes
 slurm c|cpus            total cpu cores in use
 slurm cpus <partition>  cores available to partition, allocated and free
 slurm cpus jobs         cores/memory reserved by running jobs
 slurm cpus queue        cores/memory required by pending jobs
 slurm features          List features and GRES
 slurm brief_features    List features with node counts
 slurm matrix_features   List possible combinations of features with node counts

# example: show jobs
slurm q 
JOBID              PARTITION NAME                  TIME       START_TIME    STATE NODELIST(REASON)
1599696            maxcpu    test1                 0:00              N/A  PENDING (QOSMaxJobsPerUserLimit)
1599695            maxcpu    test2                 0:00              N/A  PENDING (QOSMaxJobsPerUserLimit)
1599689            maxcpu    test3              3:53:05 2019-01-18T08:02  RUNNING max-wn[017,026],max-wna[022-025]
# slurm w q would continously update the view on your jobs

Environment

[@max-wgs001 ~]$ salloc -N 1 -J test         # request a single node in the default partition                                                                                                        
salloc: Granted job allocation 3327                                                                                                                                                                                                          
salloc: Waiting for resource configuration                                                                                                                       
salloc: Nodes max-wn007 are ready for job    # the host(s) allocated. You can ssh into the node, even from a different host (e.g. your windows pc)                                                                

[@max-wgs001 ~]$ env | grep SLURM            # show environment
SLURM_SUBMIT_DIR=/home/schluen
SLURM_SUBMIT_HOST=max-wgs001.desy.de
SLURM_JOB_ID=3327
SLURM_JOB_NAME=test
SLURM_JOB_NUM_NODES=1
SLURM_JOB_NODELIST=max-wn007
SLURM_NODE_ALIASES=(null)
SLURM_JOB_PARTITION=maxcpu
SLURM_JOB_CPUS_PER_NODE=32
SLURM_JOBID=3327
SLURM_NNODES=1
SLURM_NODELIST=max-wn007
SLURM_TASKS_PER_NODE=32
SLURM_CLUSTER_NAME=maxwell
[max-wgs001 ~]$ exit                        # return resources
exit
salloc: Relinquishing job allocation 3327
salloc: Job allocation 3327 has been revoked.

The list of nodes is actually represented as a range. If you need a regular hostlist a scriplet like (see https://rc.fas.harvard.edu/resources/running-jobs/)

#!/bin/bash 
hostlist=$(scontrol show hostname $SLURM_JOB_NODELIST)
rm -f hosts
 
for f in $hostlist
  do
  echo $f':64' >> hosts
done

should do. Now you can use the 256 cores in an mpi job:

[@max-wgs ~]$ mpirun -n 256 hello-mpi    # would give you 256 lines of "Hello world" back
Hello world from processor max-wn005.desy.de, rank 165 out of 256 processors
Hello world from processor max-wn004.desy.de, rank 124 out of 256 processors
# Note: since mpirun knows about the slurm allocation it will use the allocated hosts, not the local host!

OpenMPI would know about the hosts to use. However, running an application like mathematica interactively would still use the WGS you initially used to make the allocation with salloc; but while your allocation is valid, you can connect to any host allocated with ssh and run for example mathematica kernels across the nodes allocated.

Jobs with Dependencies

Jobs can be chained in a way that one jobs doesn't start before a set of jobs hasn't reached a particular state. Most commonly is probably to start a jobs only after some other job has finished:

[@max-wgs ~]$ sbatch --dependency=afterok:1234:1235  dep1.sh                     # start dep1.sh only after jobs with jobid 1234 and 1235 have finished.

[@max-wgs ~]$ sbatch --dependency=singleton --job-name=singleton singleton.sh    #  start this job only after all jobs with the same job-name have finished.     
                                                                                 #  makes sure that only a single job of this name can run at a time.

Job Array

Array jobs allow to launch a set of identical, indexed jobs. Lets assume you want to process 10 images with identical environment:

# array-job.sh
#!/bin/bash
#SBATCH --time    0-00:01:00
#SBATCH --nodes            1
#SBATCH --partition allcpu
#SBATCH --array 1-10
#SBATCH --job-name job-array
#SBATCH --output array-%A_%a.out
unset LD_PRELOAD
source /etc/profile.d/modules.sh
echo "SLURM_JOB_ID           $SLURM_JOB_ID"
echo "SLURM_ARRAY_JOB_ID     $SLURM_ARRAY_JOB_ID"
echo "SLURM_ARRAY_TASK_ID    $SLURM_ARRAY_TASK_ID"
echo "SLURM_ARRAY_TASK_COUNT $SLURM_ARRAY_TASK_COUNT"
echo "SLURM_ARRAY_TASK_MAX   $SLURM_ARRAY_TASK_MAX"
echo "SLURM_ARRAY_TASK_MIN   $SLURM_ARRAY_TASK_MIN"

process image_${SLURM_ARRAY_TASK_ID}.tif


[@max-wgs ~]$ sbatch array-job.sh