schedmd offers exhaustive documentation how to use SLURM: http://slurm.schedmd.com/. We have just collected a few examples below.
Maxwell useful commands provides a short list of commands which might become handy.
Basics
sbatch is the prime command to run jobs on the maxwell cluster. It accepts a script, copies the script to the scheduler, and the scheduler executes the script as soon as compute resources become available. It's particularly handy for running large number of tasks, without the need to worry about the whereabouts of the jobs. Unlike salloc or srun it returns your shell immediately, so you can continue to work, and the jobs will not be affected by accidentally closing a shell or session or crash of the login node used to submit jobs. The jobs won't even be affected by accidentally deleting the job-script: slurm has a copy and you can even recreate the job-scripts while jobs are still running or pending. The details are explained on a separate page on running batch jobs.
Runtime parameters
A batch job needs to inform the scheduler (slurm) about certain parameters, like the partition a job should run in, how long a job is expected to run, how many nodes it should allocate and so on. All these parameters are explained on schedmd's sbatch documentation.The parameters can be declared as "comments" in a batch script, or on the command line (see below) and either in extended or abbreviated form. Parameters provided on the command line always override parameters specified in the batch script.
The most commonly used parameters:
Parameter | sbatch comment | abbreviation | default | environment in | environment out | purpose |
---|---|---|---|---|---|---|
Partition | #SBATCH --partition=maxcpu | -p maxcpu | maxcpu | SBATCH_PARTITION | SLURM_JOB_PARTITION | set the partition a job should run in |
Runtime | #SBATCH --time=1-00:00:00 | -t 1-01:00 | depends on the partition! | SBATCH_TIMELIMIT | set duration of a job | |
Number of Nodes | #SBATCH --nodes=1 | -N 2 | 1 node | SLURM_JOB_NUM_NODES | min-max number of nodes per job | |
Working Directory | #SBATCH --chdir=somewhere | -D somewhere | current working directory | jobs starting point | ||
Job ID | SLURM_JOB_ID | Job ID to be used in scontrol, sacct, ... | ||||
Job Name | #SBATCH --job-name=testjob | -J testjob | name of the command | SBATCH_JOB_NAME | SLURM_JOB_NAME | define a custom, useful for accounting and dependencies |
Standard Output File | #SBATCH --output=first-%N-%j.out | -o job.log | slurm-<jobid>.out | |||
Error Output File | #SBATCH --error=first-%N-%j.err # | -e job.error | standard output file | |||
Notifications | #SBATCH --mail-type=END | n/a | none | |||
Mail Address | #SBATCH --mail-user=max.muster@desy.de | n/a | userid@mail.desy.de | |||
Constraints | #SBATCH --constraint=INTEL | -C AMD | none | SBATCH_CONSTRAINT | Define nodes features | |
Dependencies | #SBATCH --dependency=singleton | -d singleton | none | SLURM_JOB_DEPENDENCY | Set jobs to wait for other jobs to finish | |
Reservation | #SBATCH --reservation=my-rsv01 | n/a | none | SBATCH_RESERVATION | SLURM_JOB_RESERVATION | Use a reservation |
|
Standard Job Script
as mentioned slurm needs to be instructed what to use in which partition and for how long. A most simple batch script could look like this:
#!/bin/bash #SBATCH --partition=maxcpu #SBATCH --time=00:10:00 #SBATCH --nodes=1 unset LD_PRELOAD source /etc/profile.d/modules.sh module purge compute something
Remarks:
- unset LD_PRELOAD: reduces meaningless warnings in the logfiles when submitting from display nodes
- source /etc/profile.d/modules.sh: allows to use the module command inside batch scripts, which can become quite handy
- module purge: batch jobs will inherit by default the current environment. purging modules at least ensures that there are not potentially interfering modules loaded
Pitfalls:
- slurm interprets #SBATCH instructions until it find the first command (everything other than a comment). Any #SBATCH statement after the first command will be ignored
- it's not possible to use environment variables in #SBATCH statements.
General remarks
Currently all nodes are configured as non-shared resources. A job will have exclusive access to the resources requested, and it's up to the job to consume all CPU-cores or a subset of it, using mpi, p-threads or forking processes and so on. It consequently makes in many cases little sense to use the --ntasks or --cpus_per_task setting. Also try to avoid using specific hostnames; there is no advantage and tends to be counter-productive.
See the FAQ and the schedmd documentation for more details. To get a quick overview on a commands syntax use the man pages (available on any of the slurm login nodes): man <command>. You will most likely need not more than a handful of commands, like salloc, sbatch, scancel, scontrol, sinfo ...
First batch job
Though sbatch usually expects a job script, it can create a wrapper automatically around a command:
# simple job which prints hostname @max-wgs001:~$ sbatch --wrap hostname Submitted batch job 1516 @max-wgs001:~$ cat slurm-1516.out max-wn022.desy.de # simple python test @max-wgs001:~$ sbatch --output=numpy.out --time=0-00:01:00 --partition=allcpu --wrap 'python3 -c "import numpy;print(numpy.__version__)"' @max-wgs001:~$ cat numpy.out 1.12.1
Running Job-Scripts in Batch
# simple job which prints hostname [@max-wgs ~]$ cat hostname.sh #!/bin/bash #SBATCH --partition=maxcpu #SBATCH --time=00:10:00 # Maximum time requested #SBATCH --nodes=1 # Number of nodes #SBATCH --chdir=/home/mmuster/slurm/output # directory must already exist! #SBATCH --job-name=hostname #SBATCH --output=hostname-%N-%j.out # File to which STDOUT will be written #SBATCH --error=hostname-%N-%j.err # File to which STDERR will be written #SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL #SBATCH --mail-user=max.muster@desy.de # Email to which notifications will be sent. It defaults to <userid@mail.desy.de> if none is set. unset LD_PRELOAD source /etc/profile.d/modules.sh module purge /bin/hostname # submit to batch queue for one node with one task # requesting 10 mins of wall time [@max-wgs ~]$ sbatch hostname.sh Submitted batch job 2163 [@max-wgs ~]$ ls hostname.sh slurm-2163.out [@max-wgs ~]$ cat slurm-2163.out max-wn004.desy.de [@max-wgs ~]$ scontrol show job 2163 JobId=2163 JobName=hostname UserId=mmuster(1234) GroupId=cfel(3512) Priority=5001 Nice=0 Account=cfel QOS=cfel JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2016-01-20T14:50:17 EligibleTime=2016-01-20T14:50:17 StartTime=2016-01-20T14:50:17 EndTime=2016-01-20T14:50:18 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=cfel AllocNode:Sid=max-cfel001:1345 ReqNodeList=(null) ExcNodeList=(null) NodeList=max-cfel004 BatchHost=max-cfel004 NumNodes=1 NumCPUs=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/home/mmuster/slurm/hostname.sh WorkDir=/home/mmuster/slurm/output StdErr=/home/mmuster/slurm/output/hostname-%N-2163.err StdIn=/dev/null StdOut=/home/mmuster/slurm/output/hostname-%N-2163.out Power= SICP=0
Using constraints
For many job-types - in particular mpi jobs - you want to select specific hardware and identical hardware for mulit-node jobs. This can be done with so called constraints (sometime also called features).
#SBATCH --constraint=GPU # select node(s) with GPU #SBATCH --constraint='A100|V100' # select node(s) with A100 or V100 GPU. multi-node jobs might get a mix of both #SBATCH --constraint='EPYC&7402' # select node(s) with EPYC 7402 CPU #SBATCH --constraint='[(EPYC&7402)|Gold-6240]' # select node(s) with EPYC 7402 CPU OR Intel Gold-6240 CPU, but all nodes identical #SBATCH --constraint=EPYC # request AMD EPYC nodes #SBATCH --constraint="GPUx1&V100" # request a node with exactly one NVIDIA V100 GPU. #SBATCH --constraint=INTEL # request intel nodes. #SBATCH --constraint="INTEL&V3" # request intel nodes with v3 CPUs (haswell). #SBATCH --constraint="[AMD|INTEL]" # request either N*INTEL or N*AMD nodes. Without [] any combination of INTEL and AMD nodes is being requested.
Remarks:
- constraints are case sensitive!
- consult https://slurm.schedmd.com/sbatch.html for more examples and details
There are a couple of options to list available constraints, and current availability
[@max-wgs001 ~]$ sinfo -p maxgpu -o '%n %f %t' # show nodes in maxgpu, with features/constraints and node status [@max-wgs001 ~]$ sinfo -p maxgpu -o '%n %f %t' | grep idle # only show idle nodes
sinfo is however very misleading. A node allocated in allgpu partition is shown as allocated, but available in the maxgpu partition, since jobs in the allgpu partition can be preempted. savail gives a better view:
[@max-wgs001 ~]$ /software/tools/bin/savail -p maxgpu -f 'A100|V100' # show available and preemptable A100|V100 nodes in maxgpu [@max-wgs001 ~]$ module load maxwell # add /software/tools/bin to PATH, so you don't have to remember the full path [@max-wgs001 ~]$ savail -p maxgpu -f 'A100|V100'
Testing Jobs
You can test in advance if the resources requested by your job are available at all, and when the job is expected to start. For example
# request a single node without particular specs [@max-wgs001 ~]$ sbatch --test-only my-app.sh # --test-only indicates a dry-run. It won't submit the job sbatch: Job 1652551 to start at 2019-01-20T09:29:06 using 32 processors on nodes max-wn004 in partition maxcpu # request a V100 GPU [@max-wgs001 ~]$ sbatch --constraint=V100 --test-only my-app.sh sbatch: Job 1652553 to start at 2019-01-20T09:29:26 using 40 processors on nodes max-wng019 in partition maxcpu # request 2 V100 GPUs in a single node, which is an invalid constraint [@max-wgs001 ~]$ sbatch --constraint="GPUx2&V100" --test-only my-app.sh allocation failure: Requested node configuration is not available # ask for a P100 OR V100 in one of the partitions: [@max-wgs001 ~]$ sbatch --partition="allcpu,petra4,upex" --constraint="V100|P100" --test-only my-app.sh sbatch: Job 1652560 to start at 2019-01-18T11:34:15 using 40 processors on nodes max-exflg006 in partition upex # having neither permission to use petra4 or upex my job must be running in the allcpu partition. # --test-only does not validate the partition proposed. Make sure to specify only partitions usable for your account. # to verify: [@max-wgs001 ~]$ sbatch --partition="allcpu,petra4,upex" --constraint="V100|P100" my-app.sh Submitted batch job 1652561 [schluenz@max-wgs001 ~]$ sacct -j 1652561 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1652561 my-app.sh allcpu it 40 COMPLETED 0:0 # as expected running in allcpu partition.
Running graphical applications interactively
[@max-wgs ~]$ salloc -N 1 --partition=maxcpu salloc: Granted job allocation 214 [@max-wgs ~]$ ssh -t -Y $SLURM_JOB_NODELIST matlab_R2018a # this will work on max-wgs, but crash on max-display! [@max-wgs ~]$ ssh -t -Y $SLURM_JOB_NODELIST matlab_R2018a -softwareopengl # this will always work # you could write a small wrapper named $HOME/bin/s: #!/bin/bash if [[ "x$SLURM_JOB_NODELIST" != "x" ]]; then ssh -t -Y $SLURM_JOB_NODELIST "$@" else echo "salloc -N 1 before using s!" fi [@max-wgs ~]$ s matlab_R2018a -softwareopengl # remember to release the allocation once done! [@max-wgs ~]$ exit exit salloc: Relinquishing job allocation 214 salloc: Job allocation 214 has been revoked.
Controlling Jobs
scancel <jobid> # cancel a job scancel -u <username> # cancel all the jobs for a user. you can cancel only your own jobs (of course) scancel -t PENDING -u <username> # cancel all the pending jobs for a user scancel --name myJobName # cancel one or more jobs by name scontrol hold <jobid> # pause a particular job scontrol resume <jobid> # resume a particular job scontrol requeue <jobid> # requeue (cancel and rerun) a particular job scontrol update jobid=<jobid> # manipulate a pending job, e.g. scontrol update jobid=12345 partition=allcpu to move a job to the allcpu partition scontrol show job <jobid> # give information about job
Job Information
To get a quick overview about jobs, partitions and so on, slurm provides a tool sview:
[@max-wgs ~]$ sview &
By default it will only show partitions you are entitled to submit jobs to, and consequently only jobs running in these partitions. To view all jobs on all partitions enable under Options: Show Hidden.
squeue -u <username> # List all current jobs for a user squeue -u <username> -t RUNNING # List all running jobs for a user squeue -u <username> -t PENDING # List all pending jobs for a user squeue -u <username> -p maxcpu # List all current jobs in the maxcpu partition for a user scontrol show jobid -dd <jobid> # List detailed information for a job (useful for troubleshooting) scontrol update jobid=<jobid> partition=maxcpu NumNodes=6 # update a (pending) job for example by setting the number of nodes compliant with partition-limits # Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed # To get statistics on completed jobs by jobID: sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed # To view the same information for all jobs of a user
[@max-wgs ~]$ sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 1652579 # some stats about running job like memory consumption. AveCPU AvePages AveRSS AveVMSize JobID ---------- ---------- ---------- ---------- ------------ 00:00.000 0 1469K 27840K 1652579.0 # /software/tools/bin/slurm is a convenient tool to extract queue and job information [@max-wgs ~]$ module load maxwell tools [@max-wgs ~]$ slurm Show or watch job queue: slurm [watch] queue show own jobs slurm [watch] q <user> show user's jobs slurm [watch] quick show quick overview of own jobs slurm [watch] shorter sort and compact entire queue by job size slurm [watch] short sort and compact entire queue by priority slurm [watch] full show everything slurm [w] [q|qq|ss|s|f] shorthands for above! slurm qos show job service classes slurm top [queue|all] show summary of active users Show detailed information about jobs: slurm prio [all|short] show priority components slurm j|job <jobid> show everything else slurm steps <jobid> show memory usage of running srun job steps Show usage and fair-share values from accounting database: slurm h|history <time> show jobs finished since, e.g. "1day" (default) slurm shares Show nodes and resources in the cluster: slurm p|partitions all partitions slurm n|nodes all cluster nodes slurm c|cpus total cpu cores in use slurm cpus <partition> cores available to partition, allocated and free slurm cpus jobs cores/memory reserved by running jobs slurm cpus queue cores/memory required by pending jobs slurm features List features and GRES slurm brief_features List features with node counts slurm matrix_features List possible combinations of features with node counts # example: show jobs slurm q JOBID PARTITION NAME TIME START_TIME STATE NODELIST(REASON) 1599696 maxcpu test1 0:00 N/A PENDING (QOSMaxJobsPerUserLimit) 1599695 maxcpu test2 0:00 N/A PENDING (QOSMaxJobsPerUserLimit) 1599689 maxcpu test3 3:53:05 2019-01-18T08:02 RUNNING max-wn[017,026],max-wna[022-025] # slurm w q would continously update the view on your jobs
Environment
[@max-wgs001 ~]$ salloc -N 1 -J test # request a single node in the default partition salloc: Granted job allocation 3327 salloc: Waiting for resource configuration salloc: Nodes max-wn007 are ready for job # the host(s) allocated. You can ssh into the node, even from a different host (e.g. your windows pc) [@max-wgs001 ~]$ env | grep SLURM # show environment SLURM_SUBMIT_DIR=/home/schluen SLURM_SUBMIT_HOST=max-wgs001.desy.de SLURM_JOB_ID=3327 SLURM_JOB_NAME=test SLURM_JOB_NUM_NODES=1 SLURM_JOB_NODELIST=max-wn007 SLURM_NODE_ALIASES=(null) SLURM_JOB_PARTITION=maxcpu SLURM_JOB_CPUS_PER_NODE=32 SLURM_JOBID=3327 SLURM_NNODES=1 SLURM_NODELIST=max-wn007 SLURM_TASKS_PER_NODE=32 SLURM_CLUSTER_NAME=maxwell [max-wgs001 ~]$ exit # return resources exit salloc: Relinquishing job allocation 3327 salloc: Job allocation 3327 has been revoked.
The list of nodes is actually represented as a range. If you need a regular hostlist a scriplet like (see https://rc.fas.harvard.edu/resources/running-jobs/)
#!/bin/bash hostlist=$(scontrol show hostname $SLURM_JOB_NODELIST) rm -f hosts for f in $hostlist do echo $f':64' >> hosts done
should do. Now you can use the 256 cores in an mpi job:
[@max-wgs ~]$ mpirun -n 256 hello-mpi # would give you 256 lines of "Hello world" back Hello world from processor max-wn005.desy.de, rank 165 out of 256 processors Hello world from processor max-wn004.desy.de, rank 124 out of 256 processors # Note: since mpirun knows about the slurm allocation it will use the allocated hosts, not the local host!
OpenMPI would know about the hosts to use. However, running an application like mathematica interactively would still use the WGS you initially used to make the allocation with salloc; but while your allocation is valid, you can connect to any host allocated with ssh and run for example mathematica kernels across the nodes allocated.
Jobs with Dependencies
Jobs can be chained in a way that one jobs doesn't start before a set of jobs hasn't reached a particular state. Most commonly is probably to start a jobs only after some other job has finished:
[@max-wgs ~]$ sbatch --dependency=afterok:1234:1235 dep1.sh # start dep1.sh only after jobs with jobid 1234 and 1235 have finished. [@max-wgs ~]$ sbatch --dependency=singleton --job-name=singleton singleton.sh # start this job only after all jobs with the same job-name have finished. # makes sure that only a single job of this name can run at a time.
Job Array
Array jobs allow to launch a set of identical, indexed jobs. Lets assume you want to process 10 images with identical environment:
# array-job.sh #!/bin/bash #SBATCH --time 0-00:01:00 #SBATCH --nodes 1 #SBATCH --partition allcpu #SBATCH --array 1-10 #SBATCH --job-name job-array #SBATCH --output array-%A_%a.out unset LD_PRELOAD source /etc/profile.d/modules.sh echo "SLURM_JOB_ID $SLURM_JOB_ID" echo "SLURM_ARRAY_JOB_ID $SLURM_ARRAY_JOB_ID" echo "SLURM_ARRAY_TASK_ID $SLURM_ARRAY_TASK_ID" echo "SLURM_ARRAY_TASK_COUNT $SLURM_ARRAY_TASK_COUNT" echo "SLURM_ARRAY_TASK_MAX $SLURM_ARRAY_TASK_MAX" echo "SLURM_ARRAY_TASK_MIN $SLURM_ARRAY_TASK_MIN" process image_${SLURM_ARRAY_TASK_ID}.tif [@max-wgs ~]$ sbatch array-job.sh