From 98.10.2019 9:00 till 14:00 o'clock we had severe problems with the infiniband network on the Maxwell cluster. The home directories and several other GPFS offline storages were not available.
So login to Maxwell was not possible and running jobs could be disturbed.
for another bugfix, minor configuration changes and addition of a few extensions we need to restart the jupyterhub (https://max-jhub.desy.de/) today at 19:00.
That will take only a few seconds, but it will most likely disconnect running kernels. In that case, you'd need to use the control panel to "Start My Server" and relaunch your notebook. The session (i.e. the slurm job) will persist.
Apologies for the inconvenience.
on 19th Sep in the time from ~5:00 to 9:00 the home filesystem was on several nodes in the maxwell cluster not available.
The problem is solved. For further questions send an email to email@example.com
with the update last week (2.9.2019) we removed all remaining python34 packages, because
python3.4 has reached end of life (https://www.python.org/downloads/release/python-3410/)
And so the third party repository also don't offer it anymore.
DESY also provided some packages for 3.4 and not all of them are
rebuilded for 3.6 now. So if you miss a package give us an email
with the update last week to Slurm 19.05 the syntax in sbatch command
files changed. The parameter "--workdir" is renamed in "--chdir"
like in all other commands.
We provide an updated GIT client in the software section
The problems regarding the all and allgpu partition are solved. If you still have issue regarding the schedule of your batch jobs please send us a mail to firstname.lastname@example.org
After the slurm update last week we see some problems regarding the "all" and "allgpu" partitions. Jobs from "privileged" partitions (exfl,cssb,upex ...) preempting (killing) jobs which were submitted to the all* partitions. Even if the privileged jobs can't use the preempted nodes afterwards due to constaints in the job definition. (see https://confluence.desy.de/display/IS/Running+Jobs+on+Maxwell) The privileged job will "kill" a job in the all* partition every 3 minutes until a matching node is found and the "privileged" job starts. As this bug is only triggered by pending jobs in the privileged partitions with extra constraints , not all jobs in the all* queues will fail. So for example the last 10h no job was preempted in the all* queue We filed a bug report to SchedMD (the company we have a SLURM support contract with) and looking forward for a solution.