Computing : job run times and job bugs

The time jobs run can have a significant impact on the job and overall cluster efficiency with impacts on all involved users.

background

In Condor's work model, a worker nodes asks a central negotiator service for new work aka jobs – and vice versa a scheduler, to which jobs are submitted, asks for for resources to run its jobs. The negotiator will then try to match these requests for resources and for work. As this negotiations are done in cycles, there is this organisational overhead for each new brokering of a new job to a freed job slot on a worker node. If a job runs shorter than the negotiation cycle, the resources on the worker node are wasted unused.

As some users send jobs, running just for a few seconds, there can be a significant loss of efficiency. To cope with that, BIRD's Condor is tuned, to keep a 'fast lane' open for a user to a brokered job slot on a worker node. That means, that after a scheduler got a job slot on a worker node, the scheduler can send a new job by this user onto this slot, if the previous job has finished before time. This makes short jobs till not efficient, but reduces a bit the waste.

Conclusion

Ideally, have job run times longer than at least 5 mintues as not to waste resources.

job bugs

In case you have not tested your job and the jobs crash just after job start, things can get nasty because of the 'fast lane'. If you submit a list of faulty job, these jobs will be send one after the other on the 'fast lane' to a worker, die immediately and another doomed job is starter - which is obviously not very effective.

Conclusion

Before large set of jobs, ensure that they are running stable

memory leaks

If your jobs have memory leaks and you have problems to keep them in the memory limitations, do not reduce their event number or so for processing just to push them below some limits. Better try to fix the memory leaks, as they might point also to other problems and can corrupt your analyses anyway.