Computing : Something went wrong

this section tries to sum up some of the more common problem you could run into


Job runs for a while, then goes on hold

there can be different reasons for a job to go into hold state, usually the 'hold_reason' should give you an idea of what is actually going on.

  • check jobs in hold & hold_reason: 'condor_q -hold'
  • fix the hold_reason and relaese the job using 'condor_release <jobid>'

Job goes on hold due to memory consumption

all jobs do have a memeory reservation that is either part of your submit file or does get set automatically (default memory). For the freedom of your job the system tolerates a memory overconsumption at times if the requested memory is available on the workernode at that point in time. To make this kind of memory overcommittment work it needs to be kept in some reasonable boundaries though. Currently we do accept memory usage up to 3 times of the reserved memory. If the memory consumption gets higher than that the job will go on hold with a corresponding hold_reason (' Memory usage too high (> 3 x requested-memory)').

You can either delete and reconfigure & rerun these jobs or if you want, readjust the memory request and release the jobs, the requested amount of memory should be a bit more than 1/3 of the max memory your job needs at least:

check actual memory usage of hold job
[root@bird-htc-sched12 ~]# condor_q 10793359.0 -af MemoryUsage
48829
Set memory request to more than 1/3 of actual neede memory
[root@bird-htc-sched12 ~]# condor_qedit 10793359.0 "RequestMemory = 17000"
Set attribute "RequestMemory" for 1 matching jobs.
release job
[root@bird-htc-sched12 ~]# condor_release 10793359.0
Job 10793359.0 released

[root@bird-htc-sched12 ~]# condor_q 10793359.0
-- Schedd: bird-htc-sched12.desy.de : <131.169.223.40:28714> @ 08/10/20 10:12:54
OWNER  BATCH_NAME      SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
avg-joe ID: 10793359   8/7  20:29   _      1      _      1 10793359.0

Job goes on hold because "Wrong or Unauthorized Project"

You need for submitting to BIRD the registry ressource 'BATCH' (speak to your local admin or UCO to get this done)