Maxwell : Useful commands

Overview

Command	Environment	man page	Explained	Example
my-resources			Show major resources and their availability for your account
my-partitions			Show all maxwell partitons and which ones are accessible for your account
my-licenses			Show licenses for major commercial products. Lists all the licenses you are currently using	my-licenses -p matlab
my-quota			Show quota of maxwell home directory
xwhich			find applications and print required setup	xwhich Avizo
Job controls
sbatch		man sbatch	Submit a batch job to the cluster	sbatch -p allcpu --constraint=P100 --time=1-12:00:00
salloc		man salloc	Submit a request for an interactive job to the cluster	salloc --partition=allcpu --nodes=1
srun		man srun	Run interactive job	srun -p allcpu --pty -t 0-06:00 matlab_R2018a
scancel		man scancel	signal jobs or job steps	scancel -j 12345
scontrol		man scontrol	view and modify Slurm configuration and state
Job & Cluster information
Please note that these commands create additional load on the management nodes, so you should not execute them without need or with high frequency. Every 60sec would be nice.
savail	module load maxwell	savail -h	Show the real availability of nodes	savail -p maxcpu
webavail			web based and much more powerful alternative to sview
sview		man sview	GUI for SLURM to show current status of the cluster
sinfo		man sinfo	view information about Slurm nodes and partitions
squeue		man squeue	view information about jobs in scheduling queue
sacct		man sacct	Accounting information for jobs invoked with Slurm
sstat		man sstat	Status information for running jobs invoked with Slurm
slurm	module load maxwell
max-limits			Show limits of partitions	max-limits -p jhub -a

sbatch

Create a batch script my-script.sh like the following and submit with sbatch my-script.sh:

#!/bin/bash
#SBATCH --time      0-00:01:00
#SBATCH --nodes     1
#SBATCH --partition maxcpu
#SBATCH --job-name  slurm-01
export LD_PRELOAD=""                 # useful on max-display nodes, harmless on others
source /etc/profile.d/modules.sh     # make the module command available
...                                  # your actual job

That's the core information would you probably should also keep. Note: never add a #SBATCH after a regular command. It will be ignored like any other comment.

A simple example for a mathematica:

#!/bin/bash
#SBATCH --time      0-00:01:00
#SBATCH --nodes     1
#SBATCH --partition allcpu
#SBATCH --job-name  mathematica
export LD_PRELOAD=""                 # useful on max-display nodes, harmless on others
source /etc/profile.d/modules.sh     # make the module command is available
module load mathematica
export nprocs=$((`/usr/bin/nproc` / 2))   # we have hyperthreading enabled. nprocs==number of physical cores
math -noprompt -run '<<math-trivial.m' 

# sample math-trivial.m:
tmp = Environment["nprocs"]
nprocs = FromDigits[tmp]
LaunchKernels[nprocs]
Do[Pause[1];f[i],{i,nprocs}] // AbsoluteTiming        >> "math-trivial.out"
ParallelDo[Pause[1];f[i],{i,nprocs}] // AbsoluteTiming  >>> "math-trivial.out"
Quit[]

salloc

salloc uses the same syntax as sbatch.

# request one node with a P100 GPU for 8hours in the allcpu partition:
salloc --nodes=1 --partition=allcpu --constraint=P100 --time=08:00:00

# start an interactive graphical matlab session on the allocated host.
ssh -t -Y $SLURM_JOB_NODELIST matlab_R2018a

# the allocation won't disappear when being idle. You have to terminate the session
exit

scancel

scancel 1234                       # cancel job 1234
scancel -u $USER                   # cancel all my jobs
scancel -u $USER -t PENDING        # cancel all my pending jobs
scancel --name myjob               # cancel a named job
scancel 1234_3                     # cancel an indexed job in a job array

sinfo

sinfo                                                               # basic list of partitions
sinfo -N -p allcpu                                                  # list all nodes and state in all partition 
sinfo -N -p petra4 -o "%10P %.6D %8c %8L %12l %8m %30f %N"          # list all nodes with limits and features in petra4 partition

squeue

squeue                              # show all jobs
squeue -u $USER                     # show all jobs of user 
squeue -u $USER -p upex -t PENDING  # all pending jobs of user in upex partition

sacct

Provides accounting information. Never use it for time spans exceeding a month!

sacct -j 1628456                                                    # accounting information for jobid 
sacct -u $USER                                                      # todays jobs 

# get detailed information about all my jobs since 2019-01-01 and grep for all that FAILED:
sacct -u $USER --format="partition,jobid,state,start,end,nodeList,CPUTime,MaxRSS" --starttime 2019-01-01 | grep FAILED

scontrol

Display information about currently running/pending jobs, configuration of partitions and nodes. Allows to alter job characteristics of pending jobs.

scontrol show job 12345                              # show information about job 12345. Will show nothing after a job has finished.
scontrol show reservation                            # list current and future reservations
sontrol update jobid=12345 partition=allcpu          # move pending job 12345 to partition allcpu

slurm

module load maxwell tools
slurm 

#Show or watch job queue:
 slurm [watch] queue     # show own jobs
 slurm [watch] q <user>  # show user's jobs
 slurm [watch] quick     # show quick overview of own jobs
 slurm [watch] shorter   # sort and compact entire queue by job size
 slurm [watch] short     # sort and compact entire queue by priority
 slurm [watch] full      # show everything
 slurm [w] [q|qq|ss|s|f] shorthands for above!

 slurm qos               # show job service classes
 slurm top [queue|all]   # show summary of active users

#Show detailed information about jobs:
 slurm prio [all|short]  # how priority components
 slurm j|job <jobid>     # how everything else
 slurm steps <jobid>     # show memory usage of running srun job steps

#Show usage and fair-share values from accounting database:
 slurm h|history <time>  # show jobs finished since, e.g. "1day" (default)
 slurm shares

#Show nodes and resources in the cluster:
 slurm p|partitions      # all partitions
 slurm n|nodes           # all cluster nodes
 slurm c|cpus            # total cpu cores in use
 slurm cpus <partition>  # cores available to partition, allocated and free
 slurm cpus jobs         # cores/memory reserved by running jobs
 slurm cpus queue        # cores/memory required by pending jobs
 slurm features          # List features and GRES
 slurm brief_features    # List features with node counts
 slurm matrix_features   # List possible combinations of features with node counts