The problems regarding the all and allgpu partition are solved. If you still have issue regarding the schedule of your batch jobs please send us a mail to maxwell.service@desy.de


Original Message:
After the slurm update last week we see some problems regarding the "all" and "allgpu" partitions. Jobs from "privileged" partitions (exfl,cssb,upex ...) preempting (killing) jobs which were submitted to the all* partitions. Even if the privileged jobs can't use the preempted nodes afterwards due to constaints in the job definition. (see https://confluence.desy.de/display/IS/Running+Jobs+on+Maxwell) The privileged job will "kill" a job in the all* partition every 3 minutes until a matching node is found and the "privileged" job starts. As this bug is only triggered by pending jobs in the privileged partitions with extra constraints , not all jobs in the all* queues will fail. So for example the last 10h no job was preempted in the all* queue We filed a bug report to SchedMD (the company we have a SLURM support contract with) and looking forward for a solution.
  • No labels