When your job is started by HTCondor in the NAF, its resource usage is tracked. If your job exceeds a specific resource limit, Condor can kill your current job and put it on hold. While some limits are strictly enforced, some limits are more generous and are only enforced, if they are getting tight.
For example, the job run time is strictly enforced. A default job will run as a lite
-class job, that can run up to three hours (which allowes us to be more flexible with the job scheduling and route it also to unused resources from other groups). If a lite
job runs longer than 3h, it gets killed by Condor in any case - and you should better run it with a dedicated job runtime requirement (called bide
in our local terminology) asking for a longer run time.
If your job consumes temporarily more memory than the default 1500MB, the chances are good, that Condor will keep your job running. Normally, on a node more memory is available, than all jobs need at that moment - so Condor can be a bit more relaxed about enforcing a job's memory limit. However, if such a job requires much more memory (like with a memory leak) or keeps using excessive memory over quite some time (and other jobs need memory according to their requirements), Condor will remove the job and put it back on hold (after too many attempts).
When your job get killed by Condor and is on hold, you can check for all your currently jobs in hold with
condor_q -held USERNAME
to see, what the reason for sending to held was and how much memory a specific job of yours has requested and actually used as well as how long the job was on a node
condor_q JOB.ID -af HoldReason RequestMemory MemoryUsage RequestRuntime RemoteWallClockTime JobCurrentStartDate CompletionDate
(the run time is a bit more complex, as one has to take into account how many cores a job has requested. E.g., a job requesting 3 cores will be accounted for 3x the real time -- so the wall clock will be 3x the difference between the jobs end and start time)
(for all details on a job, run 'condor_q -l condor_q JOB.ID
' for a job still in Condor as running or held // or for a historic/finished job 'condor_history -l condor_q JOB.ID
' )
Editing an exiting job on hold
When your job got on hold, because it exceeded its allocated resources or crashed a few times, you can change the on-hold-job with
condor_qedit JOB.ID
and release it to be run again (now with updated requirements) with
condor_release JOB.ID
Submitting jobs with specific requirements
To avoid jobs running unproductive and getting removed due to exceeding limits, you should submit jobs with matching requirements, if they do not fit in the default lite
job scheme (1core, 1.5GB memory, less than 3h run time). Also, Condor will be better able to squeeze jobs into free resources, if their requirements are more accurate and close to the reality (if the job requirements are far from the defaults).
#################### # Run Time +RequestRuntime = <seconds> # request a runtime differing from the 3h default, up to 7 days are currently supported ##################### # basic resource types # RequestMemory = <MByte> # request any amount of memory different from the 1.5 gb default in 'MiB', see also request_memory RequestCPU = <1,2,3...> # number of cores assigned to the job, default is 1, see also request_cpu RequestDisk = <kByte> # request an amount of disk different from the 3GB default, see also request_disk # #################### ## Example ## Requirements = OpSysAndVer == "OS-Name" # CentOS 7 is the default, Scientific Linux 6 'SL6' is only supported until the end of November 2020! # Requirements = ( OpSysAndVer == "CentOS7" || OpSysAndVer == "SL6") # if you can run on both distributions. The SL6 requirement is kept here just for educational purposes as it is superfluous, since no SL6 resources are available anymore
Please note: Requesting more resources for a job compared to the default lite
job scheme (1core, 1.5GB mem, < 3h run time) will make your job more expensive. Condor will have to reserve more node resources as CPU cores or memory for such a job, so that your user priority will get worse, as you use more resources – meaning, that your waiting time for a free slot on node will get longer in favour of colleagues, that have used less resources over the last time .
Also reducing job run times to short intervals will affect you negatively, as the overhead to allocate a free slot to a job might be longer than the actual run time - and waste resources.
In the end resource requirements all end up in the 'Requirements
' class ad variable, so that in an advanced use case one can define a complex requirement - see the Condor documentation for some examples
Additionally, you can also express preferences for nodes with available resources with thr rank
option, for example to prefer the machines with the currently most free memory.