of course the following also applies to a failing 'single' job or to any other 'something went wrong' situation' !
In general HTC knows three outcomes after a job has finished:
The 'hold' state
If something formally in the HTC universe goes wrong that is believed to be possibly fixed by the intervention of the user the job does not leave the queue but goes into 'hold'
[root@bird-htc-sched13 ~]# condor_q <job id> -- Schedd: bird-htc-sched13.desy.de : <131.169.223.41:26735> @ 11/04/19 14:14:06 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS <uid> ID: <job id> 11/4 07:39 _ _ _ 1 1 <job id>
Hold jobs can be listed by using 'condor_q -hold' including the display of the 'hold reason':
[root@bird-htc-sched13 ~]# condor_q -hold -- Schedd: bird-htc-sched13.desy.de : <131.169.223.41:26735> @ 11/04/19 14:18:19 ID OWNER HELD_SINCE HOLD_REASON <job id> <uid> 11/4 07:39 Error from slot1@bird431.desy.de: Failed to open '/nfs/<snip>/PromptBkgHist<snip>20779870' as standard output: File name too long (errno 36)
The hold reason should give you an idea why the job is on 'hold'. After having fixed the issue (in the case above the output file name was too long) you can release your 'held' jobs using 'condor_release' ...
The 'removed' state
In case of a 'problem' that is not considered to be solvable by user-intervention the job gets into the 'removed' state and leaves the queue. You can list these jobs together with your successfull jobs using 'condor_history':
[root@bird-htc-sched13 ~]# condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD <job id> <uid> 11/4 13:52 0+00:06:08 C 11/4 15:05 /nfs/dust/cms/user/<snip>/Synch/CMSSW_10_2_0_pre4 <job id> <uid> 11/4 14:41 0+00:22:00 C 11/4 15:05 /nfs/dust/cms/user/<snip>CMSSW_9_4_14/src/TopA <job id> <uid> 11/4 08:08 0+06:45:56 X 11/4 15:05 /bin/zsh jobs/2018_TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia
The removed jobs have a 'JobStatus = 3' as opposed to successful jobs with JobStatus = 4' you can use this fact to list the unsuccessful jobs using:
[root@bird-htc-sched13 ~]# condor_history -constraint 'JobStatus == 3' ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD <job id> <uid> 11/4 15:08 X /bin/bash parsl.parsl.auto.1572876519.3833995.script <job id> <uid> 11/4 15:08 X /bin/bash parsl.parsl.auto.1572876519.1979938.script <job id> <uid> 11/4 15:08 X /bin/bash parsl.parsl.auto.1572876519.0146322.script <job id> <uid> 11/4 15:08 X /bin/bash parsl.parsl.auto.1572876518.8283262.script <job id> <uid> 11/4 15:08 X /bin/bash parsl.parsl.auto.1572876518.644221.script
You couldn't find your Jobs in condor_history ?
You couldn't find your jobs with condor_history to get infos, why they're not running successfully, please be aware, that with the sometimes millions and millions of tiny jobs over a day,
we have to rotate the history file regularly - or else we would run out of space pretty fast (plus: even with more log space, parsing through gigantic history with millions++ jobs would probably no fun ;) )