The changes in favor of Petra4-computing tasks announced Aug. 17th have been reverted, and new nodes and partitions have been added accordingly:
- The maxwell partition has been restored with all nodes and a maximum job runtime of 7 days.
- The petra4 partition has been increased by 40 new AMD EPYC-7402 nodes
- A new short partition has been created.
- The short partition also contains the 40 new AMD EPYC-7402 nodes
- The maximum job runtime is 4 hours
- jobs in the petra4 partition are prioritized. Jobs in the short partition might be delayed but will never be terminated (preempted).
A short overview of the relevant changes:
|# of nodes||Nodes/Job||Max # of Jobs||Default Time||Maximum Time||Allowed Groups|
|petra4||66||no limit||no limit||1:00:00||14-00:00:00||max-petra4-sim-users|
For details of available hardware consult the hardware pages
Some useful commands to get more information about the current setup:
you will find a new version of octave on maxwell
[sternber@max-wgs001]~% module load maxwell octave/5.2.0
[sternber@max-wgs001]~% octave --version
GNU Octave, version 5.2.0
For further information about ocate and the new version
the DESY directorate has decided to temporarily shift compute priorities on the Maxwell cluster in favor of urgent Petra4 computations.
As a consequence we have to make temporary adjustments to the maxwell partition in the following way:
- Starting Wednesday August 19th the maximum time-limit of jobs in the maxwell partition will be reduced to _4_ HOURS.
- Nodes in the maxwell partition will also become part of the petra4 partition, and will be prioritized in the petra4 partition.
What happens to your jobs after the change?
- Jobs already running in the maxwell partition (or any other partition) will not be affected.
- Jobs with a runtime of more than 4 hours and still waiting in the maxwell partition have to be removed after deploying the configuration changes next Wednesday. The jobs would never execute.
- Jobs submitted to the maxwell partition with a proper time-limit of 4h or less run unaffected. Due to the prioritization of petra4 you might however experience long queuing times. Please consider using the all-partition as well as other resources possibly available to you and your group.
How temporary is temporary?
- We have already purchased 40 new compute nodes which will arrive mid- to end-September. Once installed, the 40 nodes will become part of the petra4 partition.
- At this point, the maxwell partition will return to a normal schedule.
- In addition, the 40 nodes will also be made available for short running short (2-4 hours) for users of the maxwell partition.
- After the petra4 compute campaign, the nodes will be fully integrated into the maxwell partition and more than double the core-count.
So we expect that beginning of October (of course depending on the timely delivery by our vendor) the maxwell partition will be fully available again, and augmented by additional resources.
Be ensured that we do treat this matter with highest urgency trying to minimize the temporary regression.
We understand that the temporary adjustment will affect some users in rather a harsh way, but hope for your understanding.
Please contact us (email@example.com) for any questions or comments, and in case you have really urgent computational requests. Despite limited options we will do our best to mitigate effects.
between 15:34 and 16:47, some Maxwell storage was disturbed, notably Maxwell home directories.
The problem is resolved.
we've added the julia programming language to the maxwell software repository
Julia is a high-level, high-performance, dynamic programming language. While it is a general purpose language and can be used to write any application, many of its features are well-suited for numerical analysis and computational science. (Wikipedia)
To use it "module load maxwell julia" or append "/software/julia/default/bin/" to your path.
With the new OS version comes a small problem with the standard openmpi
implementation. With "module load mpi/openmpi-x86_64" you use the standard openmpi
from centos. If your job is crashing with "SEGVAULT" you have to add a
parameter for your mpirun. If possible use "ucx" as this is the new standard protokoll in MPI.
There are two ways to achieve this:
You can create a file in your homedirectory with one line "pml=ob1"
- You can add the parameter to your mpirun commando for example "mpirun --mca pml ob1 foo"
or "mpirun --mca pml ucx foo"
(foo := name of your programm)
There is a local root exploit for GPFS commands like mmlsquota, see https://www.ibm.com/support/pages/node/6151701?myns=s033&mynp=OCSTXKQY&mync=E&cm_sp=s033-_-OCSTXKQY-_-E.
As a temporary workaround the setuid-flag has been removed from GPFS commands, disabling regular users from running e.g. mmlsquota. Running mmlsquota now throws a rather misleading error:
That just means, that $USER is not privileged to run the command, not that there is something wrong with GPFS. It will be fixed with the next GPFS update on maxwell.
A number of conda packages have been updated or added:
Due to conflicting dependencies, SuRVoS has been moved to a conda environment (survos):
The problem was resolved at ~10:30
we are experience in the moment a serious problem in our GPFS infrastructure.
We are in the process to analyze which parts are involved and can't therefore can't give now any further details.
At the moment the maxwell home directories are not available so you can't login to the maxwell login nodes. The software folder is also not accessible.
The beegfs is not involved so it is clear that it is not the infiniband network.
As soon as we get new information we will update this post
docker had to be removed from maxwell login nodes due to severe security concerns. Running docker on batch nodes is not affected. To build docker images, please use your personal machines, or one of the batch nodes.
From 98.10.2019 9:00 till 14:00 o'clock we had severe problems with the infiniband network on the Maxwell cluster. The home directories and several other GPFS offline storages were not available.
So login to Maxwell was not possible and running jobs could be disturbed.
for another bugfix, minor configuration changes and addition of a few extensions we need to restart the jupyterhub (https://max-jhub.desy.de/) today at 19:00.
That will take only a few seconds, but it will most likely disconnect running kernels. In that case, you'd need to use the control panel to "Start My Server" and relaunch your notebook. The session (i.e. the slurm job) will persist.
Apologies for the inconvenience.