Maxwell : Running single core batch job

A separate slurm instance has been created to support single or few-core jobs. The slurm commands are almost identical to those described for standard full-node jobs, except that you need to specify the slurm instance:

max-wgse002:~$ sinfo -M solaris   # or sinfo --cluster=solaris
CLUSTER: solaris
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
solcpu*      up 7-00:00:00      5   idle max-wn[008-012]

The slurm instance - named solaris - contains a single partition - named solcpu - with a handful of old nodes:

max-wgse002:~$ sinfo --cluster=solaris -o '%n %f'
CLUSTER: solaris
HOSTNAMES AVAIL_FEATURES
max-wn008 INTEL,V4,E5-2640,256G
max-wn009 INTEL,V4,E5-2640,256G
max-wn010 INTEL,V4,E5-2640,256G
max-wn011 INTEL,V4,E5-2640,256G
max-wn012 INTEL,V4,E5-2640,256G

Job configuration

The solaris instance supports allocation of specific number of cores, and specification of memory. This means, that you have to set sensible limits. The node will otherwise either be poorly utilized, or your jobs terminated once exceeding the limits.

The default memory allocated to a job is 4GB.

Example 1:

Allocate 4 cores:

#!/bin/bash
#SBATCH --cluster=solaris
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=0-00:10:00
unset LD_PRELOAD

np=$(nproc)
echo "Cores   available: $np"

srun -n $np hostname

# Output:
Cores   available: 4
max-wn008.desy.de
max-wn008.desy.de
max-wn008.desy.de
max-wn008.desy.de

Example 2:

Allocate 4 cores and try to use 6 cores:

#!/bin/bash
#SBATCH --cluster=solaris
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=4G
#SBATCH --time=0-00:10:00
unset LD_PRELOAD

np=$(nproc)
echo "Cores   available: $np"

srun -n 6 hostname

# Output:
Cores   available: 4
srun: error: Unable to create step for job 51: More processors requested than permitted

Example 3:

Allocate 4GB of memory and try to use 5GB

#!/bin/bash
#SBATCH --cluster=solaris
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=0-00:10:00
unset LD_PRELOAD

np=$(nproc)
echo "Cores   available: $np"

# try to allocate 5G of memory:
timeout 10 cat /dev/zero | head -c 5G | tail

# Output:
/var/spool/slurmd/job00050/slurm_script: line 17: 24886 Broken pipe             timeout 10 cat /dev/zero
     24887                       | head -c 5G
     24888 Killed                  | tail
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=50.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

# Note: the job state will in this case be OUT_OF_MEMORY

Job information

the squeue, sinfo, sacct ... commands work all as usual, just that you need to add --cluster=solaris. So to see your job it's

# squeue
squeue -u $USER -M solaris              # or 
squeue --user=$USER --cluster=solaris

# sacct
sacct -M solaris              # or 
sacct --cluster=solaris       # or
sacct -L                      # for both slurm instances (maxwell,solaris)