Computing : Some jobs out of a cluster don't complete successfully

of course the following also applies to a failing 'single' job or to any other 'something went wrong' situation' !

In general HTC knows three outcomes after a job has finished:


The 'hold' state

If something formally in the HTC universe goes wrong that is believed to be possibly fixed by the intervention of the user the job does not leave the queue but goes into 'hold'

A hold job
[root@bird-htc-sched13 ~]# condor_q <job id>
-- Schedd: bird-htc-sched13.desy.de : <131.169.223.41:26735> @ 11/04/19 14:14:06
OWNER   BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
<uid> ID: <job id>  11/4  07:39      _      _      _      1      1 <job id>

Hold jobs can be listed by using 'condor_q -hold' including the display of the 'hold reason':

condor_q -hold
[root@bird-htc-sched13 ~]# condor_q -hold
-- Schedd: bird-htc-sched13.desy.de : <131.169.223.41:26735> @ 11/04/19 14:18:19
 ID          OWNER          HELD_SINCE  HOLD_REASON
<job id>   <uid>        11/4  07:39 Error from slot1@bird431.desy.de: Failed to open '/nfs/<snip>/PromptBkgHist<snip>20779870' as standard output: File name too long (errno 36)

The hold reason should give you an idea why the job is on 'hold'. After having fixed the issue (in the case above the output file name was too long) you can release your 'held' jobs using 'condor_release' ...


The 'removed' state

In case of a 'problem' that is not considered to be solvable by user-intervention the job gets into the 'removed' state and leaves the queue. You can list these jobs together with your successfull jobs using 'condor_history':

condor_history
[root@bird-htc-sched13 ~]# condor_history
 ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED   CMD            
<job id> <uid>       11/4  13:52   0+00:06:08 C  11/4  15:05 /nfs/dust/cms/user/<snip>/Synch/CMSSW_10_2_0_pre4
<job id> <uid>       11/4  14:41   0+00:22:00 C  11/4  15:05 /nfs/dust/cms/user/<snip>CMSSW_9_4_14/src/TopA
<job id> <uid>       11/4  08:08   0+06:45:56 X  11/4  15:05 /bin/zsh jobs/2018_TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia

The removed jobs have a 'JobStatus  = 3' as opposed to successful jobs with JobStatus  = 4' you can use this fact to list the unsuccessful jobs using:

condor_history -constraint
[root@bird-htc-sched13 ~]# condor_history -constraint 'JobStatus == 3'
 ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED   CMD            
<job id>   <uid>       11/4  15:08              X         /bin/bash parsl.parsl.auto.1572876519.3833995.script
<job id>   <uid>       11/4  15:08              X         /bin/bash parsl.parsl.auto.1572876519.1979938.script
<job id>   <uid>       11/4  15:08              X         /bin/bash parsl.parsl.auto.1572876519.0146322.script
<job id>   <uid>       11/4  15:08              X         /bin/bash parsl.parsl.auto.1572876518.8283262.script
<job id>   <uid>       11/4  15:08              X         /bin/bash parsl.parsl.auto.1572876518.644221.script


You couldn't find your Jobs in condor_history ?

You couldn't find your jobs with condor_history to get infos, why they're not running successfully, please be aware, that with the sometimes millions and millions of tiny jobs over a day,

we have to rotate the history file regularly - or else we would run out of space pretty fast (plus: even with more log space, parsing through gigantic history with millions++ jobs would probably no fun ;) )