Resubmit job when reaching timelimit
slurm allows to requeue a job upon preemption, but not when running into a timelimit. Occasionally, you want to resubmit a job automatically, possibly with a different timelimit or continuing the incomplete calculation. To do so, you need to the trap the signal sent to the job upon reaching the timelimit. and requeue the job inside a signal handler. A simple example is shown below.
#!/bin/bash -l #SBATCH --job-name=test-restart #SBATCH --output=test-restart.out #SBATCH --time=0-00:03:00 #SBATCH --partition=maxcpu unset LD_PRELOAD # the sleep-loop at the end is running for max_iteration*30s max_iteration=10 # only allow a single restart of the job. max_restarts=1 # new partition and timelimit for 2nd and subsequent job runs alt_partition=all alt_timelimit=0-01:00:00 # just gather some information about the job scontext=$(scontrol show job $SLURM_JOB_ID) restarts=$(echo "$scontext" | grep -o 'Restarts=.' | cut -d= -f2) outfile=$(echo "$scontext" | grep 'StdOut=' | cut -d= -f2) errfile=$(echo "$scontext" | grep 'StdErr=' | cut -d= -f2) timelimit=$(echo "$scontext" | grep -o 'TimeLimit=.*' | awk '{print $1}' | cut -d= -f2) # term handler # the function is executed once the job gets the TERM signal term_handler() { echo "executing term_handler at $(date)" if [[ $restarts -lt $max_restarts ]]; then # copy the logfile. will be overwritten by the 2nd run cp -v $outfile $outfile.$restarts # requeue the job and put it on hold. It's not possible to change partition otherwise scontrol requeuehold $SLURM_JOB_ID # change timelimit and partition scontrol update JobID=$SLURM_JOB_ID TimeLimit=$alt_timelimit Partition=$alt_partition # release the job. It will wait in the queue for 2 minutes before the 2nd run can start scontrol release $SLURM_JOB_ID fi } # declare the function handling the TERM signal trap 'term_handler' TERM # print some job-information cat <<EOF SLURM_JOB_ID: $SLURM_JOB_ID SLURM_JOB_NAME: $SLURM_JOB_NAME SLURM_JOB_PARTITION: $SLURM_JOB_PARTITION SLURM_SUBMIT_HOST: $SLURM_SUBMIT_HOST TimeLimit: $timelimit Restarts: $restarts EOF # the actual "calculation" echo "starting calculation at $(date)" i=0 while [[ $i -lt $max_iteration ]]; do sleep 30 i=$(($i+1)) echo "$i out of $max_iteration done at $(date) " done echo "all done at $(date)"
The above script will run at least twice (unless it has finished before the timelimit). The first run - shown on the left - executes half of it in the maxcpu partition, before it runs into a timeout.
# output test-restart.out.0 of the first run: # note: the job keeps the jobID! SLURM_JOB_ID: 5744919 SLURM_JOB_NAME: test-restart # first run on maxcpu partition for 3 minutes SLURM_JOB_PARTITION: maxcpu SLURM_SUBMIT_HOST: max-display001.desy.de TimeLimit: 00:03:00 # job hasn't been restarted yet Restarts: 0 starting calculation at Sun Oct 4 23:21:00 CEST 2020 1 out of 10 done at Sun Oct 4 23:21:30 CEST 2020 2 out of 10 done at Sun Oct 4 23:22:00 CEST 2020 3 out of 10 done at Sun Oct 4 23:22:30 CEST 2020 4 out of 10 done at Sun Oct 4 23:23:00 CEST 2020 5 out of 10 done at Sun Oct 4 23:23:30 CEST 2020 6 out of 10 done at Sun Oct 4 23:24:00 CEST 2020 # after 3 minutes plus a grace period of ~30 seconds the job receives a TERM signal slurmstepd: error: *** JOB 5744919 ON max-wn050 CANCELLED AT 2020-10-04T23:24:26 DUE TO TIME LIMIT *** Terminated executing term_handler at Sun Oct 4 23:24:26 CEST 2020
After the timeout the job-script is executed a second time, this time in the allcpu partition. The timelimit and partition specified in the job-script are overwritten by the scontrol command...
# output test-restart.out of the second run: # note: the job keeps the jobID! SLURM_JOB_ID: 5744919 SLURM_JOB_NAME: test-restart # second run on all partition with changes timelimit SLURM_JOB_PARTITION: allcpu SLURM_SUBMIT_HOST: max-display001.desy.de TimeLimit: 01:00:00 # job has restarted once Restarts: 1 starting calculation at Sun Oct 4 23:26:58 CEST 2020 1 out of 10 done at Sun Oct 4 23:27:28 CEST 2020 2 out of 10 done at Sun Oct 4 23:27:58 CEST 2020 3 out of 10 done at Sun Oct 4 23:28:28 CEST 2020 4 out of 10 done at Sun Oct 4 23:28:58 CEST 2020 5 out of 10 done at Sun Oct 4 23:29:28 CEST 2020 6 out of 10 done at Sun Oct 4 23:29:58 CEST 2020 7 out of 10 done at Sun Oct 4 23:30:28 CEST 2020 8 out of 10 done at Sun Oct 4 23:30:58 CEST 2020 9 out of 10 done at Sun Oct 4 23:31:28 CEST 2020 10 out of 10 done at Sun Oct 4 23:31:58 CEST 2020 all done at Sun Oct 4 23:31:58 CEST 2020