Default condor_q output
Per default 'condor_q' will show all of your jobs on all schedulers in the pool, this is menat to be helpful in case the hostname of the scheduler has changed due to administrative intervention or if you submit through different scheduler.
Do not quest all schedulers in the pool
You can use the command 'unalias condor_q' to change the default behaviour of condor_q in a way to just show the default scheduler of the submit host you are working on in the current shell
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -- Schedd: bird-htc-sched04.desy.de : <131.169.56.41:9618?... @ 01/02/19 13:54:59 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for chbeyer: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 6 jobs; 0 completed, 0 removed, 3 idle, 1 running, 2 held, 0 suspended -- Schedd: bird-htc-sched14.desy.de : <131.169.223.42:9618?... @ 01/02/19 13:54:59 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for chbeyer: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 1425 jobs; 0 completed, 0 removed, 205 idle, 219 running, 1001 held, 0 suspended -- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 13:54:59 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS chbeyer sleep_test 1/2 13:54 _ _ 8 _ 8 1612637.0 ... 1612644.0 Total for query: 8 jobs; 0 completed, 0 removed, 8 idle, 0 running, 0 held, 0 suspended Total for chbeyer: 8 jobs; 0 completed, 0 removed, 8 idle, 0 running, 0 held, 0 suspended Total for all users: 16885 jobs; 0 completed, 73 removed, 13748 idle, 3046 running, 18 held, 0 suspended -- Schedd: bird-htc-sched11.desy.de : <131.169.223.39:9618?... @ 01/02/19 13:54:59 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for chbeyer: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 245 jobs; 0 completed, 4 removed, 0 idle, 241 running, 0 held, 0 suspended -- Schedd: bird-htc-sched02.desy.de : <131.169.56.95:9618?... @ 01/02/19 13:54:59 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for chbeyer: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 1 jobs; 0 completed, 1 removed, 0 idle, 0 running, 0 held, 0 suspended -- Schedd: bird-htc-sched01.desy.de : <131.169.56.32:9618?... @ 01/02/19 13:54:59 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for chbeyer: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 2 jobs; 0 completed, 2 removed, 0 idle, 0 running, 0 held, 0 suspended
View all jobs in a batch
You canu use 'condor_q -nobatch <clusterid>' to view all jobs in a batch, omitting <clusterid> shows all jobs with 1 line per job:
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 14:08:33 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS chbeyer sleep_test 1/2 14:07 _ 20 _ 20 1612757.0-19 chbeyer sleep_test_1 1/2 14:07 _ 20 _ 20 1612760.0-19 Total for query: 40 jobs; 0 completed, 0 removed, 0 idle, 40 running, 0 held, 0 suspended Total for chbeyer: 40 jobs; 0 completed, 0 removed, 0 idle, 40 running, 0 held, 0 suspended Total for all users: 21630 jobs; 0 completed, 73 removed, 18605 idle, 2934 running, 18 held, 0 suspended [chbeyer@htc-it02]~/htcondor/testjobs% condor_q -nobatch 1612757 -- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 14:08:52 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1612757.0 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.1 chbeyer 1/2 14:07 0+00:01:20 R 0 0.0 sleep_runtime.sh 600 1612757.2 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.3 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.4 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.5 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.6 chbeyer 1/2 14:07 0+00:01:20 R 0 0.0 sleep_runtime.sh 600 1612757.7 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.8 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.9 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.10 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.11 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.12 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.13 chbeyer 1/2 14:07 0+00:00:00 R 0 0.0 sleep_runtime.sh 600 1612757.14 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.15 chbeyer 1/2 14:07 0+00:01:20 R 0 0.0 sleep_runtime.sh 600 1612757.16 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.17 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.18 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 1612757.19 chbeyer 1/2 14:07 0+00:01:19 R 0 0.0 sleep_runtime.sh 600 Total for query: 20 jobs; 0 completed, 0 removed, 0 idle, 20 running, 0 held, 0 suspended Total for all users: 21651 jobs; 0 completed, 73 removed, 18595 idle, 2965 running, 18 held, 0 suspended
View jobs from all users
By default, condor_q
will just show you information about your jobs. To get information about all jobs in the queue use condor_q -all
-- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 14:16:58 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS finnern Job: 4all.juno.long 11/22 15:32 537 13 _ _ 550 28.335-512 zenaiev mg_tt1j-cp3_P0_gg_ttxg_1_all_201_4_sub1 12/18 16:05 _ 1 _ _ 1 779624.0 tewsalex ID: 979166 12/21 10:18 _ _ _ 1 1 979166.0 tewsalex ID: 979177 12/21 10:18 _ _ _ 1 1 979177.0 tewsalex ID: 979178 12/21 10:18 _ _ _ 1 1 979178.0 tewsalex ID: 979179 12/21 10:18 _ _ _ 1 1 979179.0 tewsalex ID: 979180 12/21 10:18 _ _ _ 1 1 979180.0 tewsalex ID: 979217 12/21 10:18 _ _ _ 1 1 979217.0 tewsalex ID: 979219 12/21 10:18 _ _ _ 1 1 979219.0 tewsalex ID: 979220 12/21 10:18 _ _ _ 1 1 979220.0 tewsalex ID: 979221 12/21 10:18 _ _ _ 1 1 979221.0 tewsalex ID: 979292 12/21 10:19 _ _ _ 1 1 979292.0 tewsalex ID: 979293 12/21 10:19 _ _ _ 1 1 979293.0 tewsalex ID: 979536 12/21 10:19 _ _ _ 1 1 979536.0 tewsalex ID: 979537 12/21 10:19 _ _ _ 1 1 979537.0 tewsalex ID: 979804 12/21 10:20 _ _ _ 1 1 979804.0 tewsalex ID: 980424 12/21 10:23 _ _ _ 1 1 980424.0 tewsalex ID: 980698 12/21 10:24 _ _ _ 1 1 980698.0 tewsalex ID: 980700 12/21 10:24 _ _ _ 1 1 980700.0 jbechtel ID: 983838 12/21 10:37 24 _ _ _ 25 983838.16 jbechtel ID: 983841 12/21 10:37 24 _ _ _ 25 983841.22 jbechtel ID: 983843 12/21 10:37 24 _ _ _ 25 983843.21 jbechtel ID: 984764 12/21 10:40 24 _ _ _ 25 984764.4 jbechtel ID: 984845 12/21 10:41 23 _ _ _ 25 984845.1-11 zenaiev mg_tt1j-cp8_P0_gg_ttxg_1_all_4_4_sub1 12/22 15:59 _ 1 _ _ 1 1044044.0 zenaiev mg_tt1j-cp8_P0_gg_ttxg_1_all_10_4_sub1 12/22 16:00 _ 1 _ _ 1 1044186.0 zenaiev mg_tt1j-cp8_P0_gg_ttxg_1_all_18_4_sub1 12/22 16:01 _ 1 _ _ 1 1044193.0 zenaiev mg_tt1j-cp8_P0_gg_ttxg_1_all_21_4_sub1 12/22 16:02 _ 1 _ _ 1 1044196.0 <snip> Total for query: 21771 jobs; 0 completed, 73 removed, 18488 idle, 3192 running, 18 held, 0 suspended Total for all users: 21771 jobs; 0 completed, 73 removed, 18488 idle, 3192 running, 18 held, 0 suspended
Determine why jobs are on hold
Sometimes jobs can not be run successfully by htcondor and stay in the queue as 'held' jobs. These held jobs can be released by yourself using 'condor_release' once the initial 'hold-reason' is understood and corrected.
'condor_q -hold' will show you the hold reasons for all of your jobs that are on hold; ' condor_q -hold <jobid>' will show you the hold reason for a specific job.
The hold reason is sometimes cut-off; try the following to see the entire hold reason:
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -hold -af HoldReason
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -hold -- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 15:22:24 ID OWNER HELD_SINCE HOLD_REASON 979166.0 snip 12/21 10:59 Error from slot1@bird657.desy.de: Failed to open '/afs/desy.de/user/s/snip/CMSSW_8_0_22/src/shorttrack/TrackRefitting/bird/job_18-12-21-10-17-26_261.sh.o979166' as standard output: File too large (errno 27) 979177.0 snip 12/21 11:00 Error from slot1@bird657.desy.de: Failed to open '/afs/desy.de/user/s/snip/CMSSW_8_0_22/src/shorttrack/TrackRefitting/bird/job_18-12-21-10-17-26_272.sh.o979177' as standard output: File too large (errno 27) 979178.0 snip 12/21 11:00 Error from slot1@bird306.desy.de: Failed to open '/afs/desy.de/user/s/snip/CMSSW_8_0_22/src/shorttrack/TrackRefitting/bird/job_18-12-21-10-17-26_273.sh.o979178' as standard output: File too large (errno 27) 979179.0 snip 12/21 11:00 Error from slot1@bird474.desy.de: Failed to open '/afs/desy.de/user/s/snip/CMSSW_8_0_22/src/shorttrack/TrackRefitting/bird/job_18-12-21-10-17-26_274.sh.o979179' as standard output: File too large (errno 27)
Detailed information for a job
Similarly, you request a more detailed report on a job with 'condor_q -better analyze
', that returns information to the matchmaking decisions by Condor like what conditions/requirements were made and which nodes could have fulfilled these
> condor_q -better-analyze 1234567.0 ... Job 1234567.000 defines the following attributes: ... 1234567.000: Job is held. Hold reason: Memory usage too high (> 3 x requested-memory) ...
Find out where jobs are running.
To see which computers your jobs are running on, use:
[chbeyer@htc-it02]~/htcondor/testjobs% condor_q -nobatch -run -- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:9618?... @ 01/02/19 15:32:52 ID OWNER SUBMITTED RUN_TIME HOST(S) 1613982.0 chbeyer 1/2 15:32 0+00:00:15 slot1@bird451.desy.de 1613982.1 chbeyer 1/2 15:32 0+00:00:14 slot1@bird454.desy.de 1613982.2 chbeyer 1/2 15:32 0+00:00:14 slot1@bird445.desy.de 1613982.3 chbeyer 1/2 15:32 0+00:00:14 slot1@bird455.desy.de 1613982.4 chbeyer 1/2 15:32 0+00:00:14 slot1@bird458.desy.de 1613982.5 chbeyer 1/2 15:32 0+00:00:14 slot1@bird443.desy.de 1613982.6 chbeyer 1/2 15:32 0+00:00:14 slot1@bird623.desy.de 1613982.7 chbeyer 1/2 15:32 0+00:00:10 slot1@bird629.desy.de 1613982.8 chbeyer 1/2 15:32 0+00:00:14 slot1@bird654.desy.de 1613982.9 chbeyer 1/2 15:32 0+00:00:14 slot1@bird582.desy.de 1613982.10 chbeyer 1/2 15:32 0+00:00:14 slot1@bird649.desy.de 1613982.11 chbeyer 1/2 15:32 0+00:00:14 slot1@bird584.desy.de 1613982.12 chbeyer 1/2 15:32 0+00:00:14 slot1@bird526.desy.de 1613982.13 chbeyer 1/2 15:32 0+00:00:12 slot1@bird645.desy.de 1613982.14 chbeyer 1/2 15:32 0+00:00:10 slot1@bird523.desy.de 1613982.15 chbeyer 1/2 15:32 0+00:00:12 slot1@bird541.desy.de 1613982.16 chbeyer 1/2 15:32 0+00:00:12 slot1@bird562.desy.de <snip>