Computing : Checking & Managing own Jobs

Default condor_q output

Per default 'condor_q' will show all of your jobs on all schedulers in the pool, this is menat to be helpful in case the hostname of the scheduler has changed due to administrative intervention or if you submit through different scheduler.

Do not quest all schedulers in the pool

You can use the command 'unalias condor_q' to change the default behaviour of condor_q in a way to just show the default scheduler of the submit host you are working on in the current shell

condor_q
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q

-- Schedd: bird-htc-sched04.desy.de : <131.169.56.41:9618?... @ 01/02/19 13:54:59
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for chbeyer: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 6 jobs; 0 completed, 0 removed, 3 idle, 1 running, 2 held, 0 suspended


-- Schedd: bird-htc-sched14.desy.de : <131.169.223.42:9618?... @ 01/02/19 13:54:59
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for chbeyer: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 1425 jobs; 0 completed, 0 removed, 205 idle, 219 running, 1001 held, 0 suspended


-- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 13:54:59
OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
chbeyer sleep_test   1/2  13:54      _      _      8      _      8 1612637.0 ... 1612644.0

Total for query: 8 jobs; 0 completed, 0 removed, 8 idle, 0 running, 0 held, 0 suspended 
Total for chbeyer: 8 jobs; 0 completed, 0 removed, 8 idle, 0 running, 0 held, 0 suspended 
Total for all users: 16885 jobs; 0 completed, 73 removed, 13748 idle, 3046 running, 18 held, 0 suspended


-- Schedd: bird-htc-sched11.desy.de : <131.169.223.39:9618?... @ 01/02/19 13:54:59
OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for chbeyer: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 245 jobs; 0 completed, 4 removed, 0 idle, 241 running, 0 held, 0 suspended


-- Schedd: bird-htc-sched02.desy.de : <131.169.56.95:9618?... @ 01/02/19 13:54:59
OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for chbeyer: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 1 jobs; 0 completed, 1 removed, 0 idle, 0 running, 0 held, 0 suspended


-- Schedd: bird-htc-sched01.desy.de : <131.169.56.32:9618?... @ 01/02/19 13:54:59
OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for chbeyer: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 2 jobs; 0 completed, 2 removed, 0 idle, 0 running, 0 held, 0 suspended


View all jobs in a batch

You canu use 'condor_q -nobatch <clusterid>' to view all jobs in a batch, omitting <clusterid> shows all jobs with 1 line per job:

condor_q -nobatch
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q                      

-- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 14:08:33
OWNER   BATCH_NAME      SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
chbeyer sleep_test     1/2  14:07      _     20      _     20 1612757.0-19
chbeyer sleep_test_1   1/2  14:07      _     20      _     20 1612760.0-19

Total for query: 40 jobs; 0 completed, 0 removed, 0 idle, 40 running, 0 held, 0 suspended 
Total for chbeyer: 40 jobs; 0 completed, 0 removed, 0 idle, 40 running, 0 held, 0 suspended 
Total for all users: 21630 jobs; 0 completed, 73 removed, 18605 idle, 2934 running, 18 held, 0 suspended

[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -nobatch 1612757

-- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 14:08:52
 ID         OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1612757.0   chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.1   chbeyer         1/2  14:07   0+00:01:20 R  0    0.0 sleep_runtime.sh 600
1612757.2   chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.3   chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.4   chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.5   chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.6   chbeyer         1/2  14:07   0+00:01:20 R  0    0.0 sleep_runtime.sh 600
1612757.7   chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.8   chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.9   chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.10  chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.11  chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.12  chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.13  chbeyer         1/2  14:07   0+00:00:00 R  0    0.0 sleep_runtime.sh 600
1612757.14  chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.15  chbeyer         1/2  14:07   0+00:01:20 R  0    0.0 sleep_runtime.sh 600
1612757.16  chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.17  chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.18  chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600
1612757.19  chbeyer         1/2  14:07   0+00:01:19 R  0    0.0 sleep_runtime.sh 600

Total for query: 20 jobs; 0 completed, 0 removed, 0 idle, 20 running, 0 held, 0 suspended 
Total for all users: 21651 jobs; 0 completed, 73 removed, 18595 idle, 2965 running, 18 held, 0 suspended


View jobs from all users

By default, condor_q will just show you information about your jobs. To get information about all jobs in the queue use condor_q -all

condor_q -all
-- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 14:16:58
OWNER    BATCH_NAME                                 SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
finnern  Job: 4all.juno.long                      11/22 15:32    537     13      _      _    550 28.335-512
zenaiev  mg_tt1j-cp3_P0_gg_ttxg_1_all_201_4_sub1  12/18 16:05      _      1      _      _      1 779624.0
tewsalex ID: 979166                               12/21 10:18      _      _      _      1      1 979166.0
tewsalex ID: 979177                               12/21 10:18      _      _      _      1      1 979177.0
tewsalex ID: 979178                               12/21 10:18      _      _      _      1      1 979178.0
tewsalex ID: 979179                               12/21 10:18      _      _      _      1      1 979179.0
tewsalex ID: 979180                               12/21 10:18      _      _      _      1      1 979180.0
tewsalex ID: 979217                               12/21 10:18      _      _      _      1      1 979217.0
tewsalex ID: 979219                               12/21 10:18      _      _      _      1      1 979219.0
tewsalex ID: 979220                               12/21 10:18      _      _      _      1      1 979220.0
tewsalex ID: 979221                               12/21 10:18      _      _      _      1      1 979221.0
tewsalex ID: 979292                               12/21 10:19      _      _      _      1      1 979292.0
tewsalex ID: 979293                               12/21 10:19      _      _      _      1      1 979293.0
tewsalex ID: 979536                               12/21 10:19      _      _      _      1      1 979536.0
tewsalex ID: 979537                               12/21 10:19      _      _      _      1      1 979537.0
tewsalex ID: 979804                               12/21 10:20      _      _      _      1      1 979804.0
tewsalex ID: 980424                               12/21 10:23      _      _      _      1      1 980424.0
tewsalex ID: 980698                               12/21 10:24      _      _      _      1      1 980698.0
tewsalex ID: 980700                               12/21 10:24      _      _      _      1      1 980700.0
jbechtel ID: 983838                               12/21 10:37     24      _      _      _     25 983838.16
jbechtel ID: 983841                               12/21 10:37     24      _      _      _     25 983841.22
jbechtel ID: 983843                               12/21 10:37     24      _      _      _     25 983843.21
jbechtel ID: 984764                               12/21 10:40     24      _      _      _     25 984764.4
jbechtel ID: 984845                               12/21 10:41     23      _      _      _     25 984845.1-11
zenaiev  mg_tt1j-cp8_P0_gg_ttxg_1_all_4_4_sub1    12/22 15:59      _      1      _      _      1 1044044.0
zenaiev  mg_tt1j-cp8_P0_gg_ttxg_1_all_10_4_sub1   12/22 16:00      _      1      _      _      1 1044186.0
zenaiev  mg_tt1j-cp8_P0_gg_ttxg_1_all_18_4_sub1   12/22 16:01      _      1      _      _      1 1044193.0
zenaiev  mg_tt1j-cp8_P0_gg_ttxg_1_all_21_4_sub1   12/22 16:02      _      1      _      _      1 1044196.0
<snip> 

Total for query: 21771 jobs; 0 completed, 73 removed, 18488 idle, 3192 running, 18 held, 0 suspended 
Total for all users: 21771 jobs; 0 completed, 73 removed, 18488 idle, 3192 running, 18 held, 0 suspended

Determine why jobs are on hold

Sometimes jobs can not be run successfully by htcondor and stay in the queue as 'held' jobs. These held jobs can be released by yourself using 'condor_release' once the initial 'hold-reason' is understood and corrected.

'condor_q -hold' will show you the hold reasons for all of your jobs that are on hold;  ' condor_q -hold <jobid>' will show you the hold reason for a specific job.

The hold reason is sometimes cut-off; try the following to see the entire hold reason:

[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -hold -af HoldReason
condorq -hold
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -hold

-- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 15:22:24
 ID         OWNER          HELD_SINCE  HOLD_REASON
 979166.0   snip       12/21 10:59 Error from slot1@bird657.desy.de: Failed to open '/afs/desy.de/user/s/snip/CMSSW_8_0_22/src/shorttrack/TrackRefitting/bird/job_18-12-21-10-17-26_261.sh.o979166' as standard output: File too large (errno 27)
 979177.0   snip       12/21 11:00 Error from slot1@bird657.desy.de: Failed to open '/afs/desy.de/user/s/snip/CMSSW_8_0_22/src/shorttrack/TrackRefitting/bird/job_18-12-21-10-17-26_272.sh.o979177' as standard output: File too large (errno 27)
 979178.0   snip       12/21 11:00 Error from slot1@bird306.desy.de: Failed to open '/afs/desy.de/user/s/snip/CMSSW_8_0_22/src/shorttrack/TrackRefitting/bird/job_18-12-21-10-17-26_273.sh.o979178' as standard output: File too large (errno 27)
 979179.0   snip       12/21 11:00 Error from slot1@bird474.desy.de: Failed to open '/afs/desy.de/user/s/snip/CMSSW_8_0_22/src/shorttrack/TrackRefitting/bird/job_18-12-21-10-17-26_274.sh.o979179' as standard output: File too large (errno 27)

Detailed information for a job

Similarly, you request a more detailed report on a job with 'condor_q -better analyze', that returns information to the matchmaking decisions by Condor like what conditions/requirements were made and which nodes could have fulfilled these

condor_q -better-analyze
> condor_q -better-analyze 1234567.0  ...
  Job 1234567.000 defines the following attributes:
  ...
  1234567.000: Job is held.
  Hold reason: Memory usage too high (> 3 x requested-memory)
  ...


Find out where jobs are running.

To see which computers your jobs are running on, use:

condor_q -nobatch -run
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -nobatch -run

-- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 15:32:52
 ID         OWNER            SUBMITTED     RUN_TIME HOST(S)
1613982.0   chbeyer         1/2  15:32   0+00:00:15 slot1@bird451.desy.de
1613982.1   chbeyer         1/2  15:32   0+00:00:14 slot1@bird454.desy.de
1613982.2   chbeyer         1/2  15:32   0+00:00:14 slot1@bird445.desy.de
1613982.3   chbeyer         1/2  15:32   0+00:00:14 slot1@bird455.desy.de
1613982.4   chbeyer         1/2  15:32   0+00:00:14 slot1@bird458.desy.de
1613982.5   chbeyer         1/2  15:32   0+00:00:14 slot1@bird443.desy.de
1613982.6   chbeyer         1/2  15:32   0+00:00:14 slot1@bird623.desy.de
1613982.7   chbeyer         1/2  15:32   0+00:00:10 slot1@bird629.desy.de
1613982.8   chbeyer         1/2  15:32   0+00:00:14 slot1@bird654.desy.de
1613982.9   chbeyer         1/2  15:32   0+00:00:14 slot1@bird582.desy.de
1613982.10  chbeyer         1/2  15:32   0+00:00:14 slot1@bird649.desy.de
1613982.11  chbeyer         1/2  15:32   0+00:00:14 slot1@bird584.desy.de
1613982.12  chbeyer         1/2  15:32   0+00:00:14 slot1@bird526.desy.de
1613982.13  chbeyer         1/2  15:32   0+00:00:12 slot1@bird645.desy.de
1613982.14  chbeyer         1/2  15:32   0+00:00:10 slot1@bird523.desy.de
1613982.15  chbeyer         1/2  15:32   0+00:00:12 slot1@bird541.desy.de
1613982.16  chbeyer         1/2  15:32   0+00:00:12 slot1@bird562.desy.de
<snip>