Maxwell : Automatic job requeue

Resubmit job when reaching timelimit

slurm allows to requeue a job upon preemption, but not when running into a timelimit. Occasionally, you want to resubmit a job automatically, possibly with a different timelimit or continuing the incomplete calculation. To do so, you need to the trap the signal sent to the job upon reaching the timelimit. and requeue the job inside a signal handler. A simple example is shown below.

#!/bin/bash -l
#SBATCH --job-name=test-restart
#SBATCH --output=test-restart.out
#SBATCH --time=0-00:03:00
#SBATCH --partition=maxcpu
unset LD_PRELOAD

# the sleep-loop at the end is running for max_iteration*30s
max_iteration=10

# only allow a single restart of the job. 
max_restarts=1

# new partition and timelimit for 2nd and subsequent job runs
alt_partition=all
alt_timelimit=0-01:00:00

# just gather some information about the job
scontext=$(scontrol show job $SLURM_JOB_ID)
restarts=$(echo "$scontext" | grep -o 'Restarts=.' | cut -d= -f2)
outfile=$(echo "$scontext"  | grep 'StdOut='       | cut -d= -f2)
errfile=$(echo "$scontext"  | grep 'StdErr='       | cut -d= -f2)
timelimit=$(echo "$scontext" | grep -o 'TimeLimit=.*' | awk '{print $1}' | cut -d= -f2) 

# term handler
# the function is executed once the job gets the TERM signal
term_handler()
{
    echo "executing term_handler at $(date)"
    if [[ $restarts -lt $max_restarts ]]; then
	   # copy the logfile. will be overwritten by the 2nd run
	   cp -v $outfile $outfile.$restarts
	   # requeue the job and put it on hold. It's not possible to change partition otherwise
	   scontrol requeuehold $SLURM_JOB_ID
	   # change timelimit and partition
	   scontrol update JobID=$SLURM_JOB_ID TimeLimit=$alt_timelimit Partition=$alt_partition
 	   # release the job. It will wait in the queue for 2 minutes before the 2nd run can start
	   scontrol release $SLURM_JOB_ID
    fi
}

# declare the function handling the TERM signal
trap 'term_handler' TERM

# print some job-information
cat <<EOF
SLURM_JOB_ID:         $SLURM_JOB_ID
SLURM_JOB_NAME:       $SLURM_JOB_NAME
SLURM_JOB_PARTITION:  $SLURM_JOB_PARTITION
SLURM_SUBMIT_HOST:    $SLURM_SUBMIT_HOST
TimeLimit:            $timelimit
Restarts:             $restarts
EOF

# the actual "calculation"
echo "starting calculation at $(date)"
i=0
while [[ $i -lt $max_iteration ]]; do
    sleep 30
    i=$(($i+1))
    echo "$i out of $max_iteration done at $(date) "
done

echo "all done at $(date)"


The above script will run at least twice (unless it has finished before the timelimit). The first run - shown on the left - executes half of it in the maxcpu partition, before it runs into a timeout.

# output test-restart.out.0 of the first run:
# note: the job keeps the jobID!
SLURM_JOB_ID:         5744919
SLURM_JOB_NAME:       test-restart
# first run on maxcpu partition for 3 minutes
SLURM_JOB_PARTITION:  maxcpu
SLURM_SUBMIT_HOST:    max-display001.desy.de
TimeLimit:            00:03:00
# job hasn't been restarted yet
Restarts:             0
starting calculation at Sun Oct  4 23:21:00 CEST 2020
1 out of 10 done at Sun Oct  4 23:21:30 CEST 2020 
2 out of 10 done at Sun Oct  4 23:22:00 CEST 2020 
3 out of 10 done at Sun Oct  4 23:22:30 CEST 2020 
4 out of 10 done at Sun Oct  4 23:23:00 CEST 2020 
5 out of 10 done at Sun Oct  4 23:23:30 CEST 2020 
6 out of 10 done at Sun Oct  4 23:24:00 CEST 2020 
# after 3 minutes plus a grace period of ~30 seconds the job receives a TERM signal
slurmstepd: error: *** JOB 5744919 ON max-wn050 CANCELLED AT 2020-10-04T23:24:26 DUE TO TIME LIMIT ***
Terminated
executing term_handler at Sun Oct  4 23:24:26 CEST 2020


After the timeout the job-script is executed a second time, this time in the allcpu partition. The timelimit and partition specified in the job-script are overwritten by the scontrol command...

# output test-restart.out of the second run:
# note: the job keeps the jobID!
SLURM_JOB_ID:         5744919
SLURM_JOB_NAME:       test-restart
# second run on all partition with changes timelimit
SLURM_JOB_PARTITION:  allcpu
SLURM_SUBMIT_HOST:    max-display001.desy.de
TimeLimit:            01:00:00
# job has restarted once
Restarts:             1
starting calculation at Sun Oct  4 23:26:58 CEST 2020
1 out of 10 done at Sun Oct  4 23:27:28 CEST 2020 
2 out of 10 done at Sun Oct  4 23:27:58 CEST 2020 
3 out of 10 done at Sun Oct  4 23:28:28 CEST 2020 
4 out of 10 done at Sun Oct  4 23:28:58 CEST 2020 
5 out of 10 done at Sun Oct  4 23:29:28 CEST 2020 
6 out of 10 done at Sun Oct  4 23:29:58 CEST 2020 
7 out of 10 done at Sun Oct  4 23:30:28 CEST 2020 
8 out of 10 done at Sun Oct  4 23:30:58 CEST 2020 
9 out of 10 done at Sun Oct  4 23:31:28 CEST 2020 
10 out of 10 done at Sun Oct  4 23:31:58 CEST 2020 
all done at Sun Oct  4 23:31:58 CEST 2020