Blog

Blog

The changes in favor of Petra4-computing tasks announced Aug. 17th have been reverted, and new nodes and partitions have been added accordingly:

  • The maxwell partition has been restored with all nodes and a maximum job runtime of 7 days.
  • The petra4 partition has been increased by 40 new AMD EPYC-7402 nodes
  • A new short partition has been created.
    • The short partition also contains the 40 new AMD EPYC-7402 nodes
    • The maximum job runtime is 4 hours
    • jobs in the petra4 partition are prioritized. Jobs in the short partition might be delayed but will never be terminated (preempted).

A short overview of the relevant changes:

Partition  

# of nodesNodes/JobMax # of JobsDefault TimeMaximum TimeAllowed Groups
maxwell581-681:00:007-00:00:00maxwell-users
petra466no limitno limit1:00:0014-00:00:00max-petra4-sim-users
short401-481:00:004:00:00maxwell-users

For details of available hardware consult the hardware pages

Some useful commands to get more information about  the current setup:

/usr/local/bin/max-limits -a                       # show partitions and the limits applying 
/usr/local/bin/max-limits                          # show only partitions allowed
/usr/local/bin/my-partitions                       # list partitions indicating which ones can be used and which ones not

/usr/bin/sinfo                                     # show available nodes and partitions
/usr/bin/sinfo -p short -o '%20n %20f %10t %c %m'  # show nodes in the short partition, with state, features, cores...

/software/tools/bin/savail -p maxgpu               # show detailed information about available nodes taking into account preemptable jobs 

Hello!


you will find a new version of octave on maxwell

[sternber@max-wgs001]~% module load maxwell octave/5.2.0
[sternber@max-wgs001]~% octave --version
GNU Octave, version 5.2.0
...

For further information about ocate and the new version

https://www.gnu.org/software/octave/

https://www.gnu.org/software/octave/NEWS-5.1.html

Dear colleagues,

the DESY directorate has decided to temporarily shift compute priorities on the Maxwell cluster in favor of urgent Petra4 computations.
As a consequence we have to make temporary adjustments to the maxwell partition in the following way:

- Starting Wednesday August 19th the maximum time-limit of jobs in the maxwell partition will be reduced to _4_ HOURS.
- Nodes in the maxwell partition will also become part of the petra4 partition, and will be prioritized in the petra4 partition.

What happens to your jobs after the change?
- Jobs already running in the maxwell partition (or any other partition) will not be affected.
- Jobs with a runtime of more than 4 hours and still waiting in the maxwell partition have to be removed after deploying the configuration changes next Wednesday. The jobs would never execute.
- Jobs submitted to the maxwell partition with a proper time-limit of 4h or less run unaffected. Due to the prioritization of petra4 you might however experience long queuing times. Please consider using the all-partition as well as other resources possibly available to you and your group.

How temporary is temporary?
- We have already purchased 40 new compute nodes which will arrive mid- to end-September. Once installed, the 40 nodes will become part of the petra4 partition.
- At this point, the maxwell partition will return to a normal schedule.
- In addition, the 40 nodes will also be made available for short running short (2-4 hours) for users of the maxwell partition.
- After the petra4 compute campaign, the nodes will be fully integrated into the maxwell partition and more than double the core-count.

So we expect that beginning of October (of course depending on the timely delivery by our vendor) the maxwell partition will be fully available again, and augmented by additional resources.
Be ensured that we do treat this matter with highest urgency trying to minimize the temporary regression.

We understand that the temporary adjustment will affect some users in rather a harsh way, but hope for your understanding.

Please contact us (maxwell.service@desy.de) for any questions or comments, and in case you have really urgent computational requests. Despite limited options we will do our best to mitigate effects.

between  15:34 and 16:47, some Maxwell storage was disturbed, notably Maxwell home directories.
The problem is resolved.

we recently updated the following software

julia 1.5 (https://docs.julialang.org/en/v1/NEWS/)
singularity 3.6 (https://github.com/hpcng/singularity/releases)

Julia on Maxwell

we've added the julia programming language to the maxwell software repository

Julia is a high-level, high-performance, dynamic programming language. While it is a general purpose language and can be used to write any application, many of its features are well-suited for numerical analysis and computational science. (Wikipedia)

To use it "module load maxwell julia" or append "/software/julia/default/bin/" to your path.

Further documentation:
https://julialang.org/

OpenMPI problems

With the new OS version comes a small problem with the standard openmpi
implementation. With "module load mpi/openmpi-x86_64" you use the standard openmpi
from centos. If your job is crashing with "SEGVAULT" you have to add a
parameter for your mpirun.  If possible use "ucx" as this is the new standard protokoll in MPI.

There are two ways to achieve  this:

  1. You can  create a file in your homedirectory with one line "pml=ob1"
    or "pml=ucx"

    # mkdir ~/.openmpi
    # vi ~/.openmpi/mca-params.conf
    # cat ~/.openmpi/mca-params.conf
    pml=ob1
    
  2. You can add the parameter to your mpirun commando for example "mpirun --mca pml ob1 foo"
    or "mpirun --mca pml ucx foo"
    (foo := name of your programm)


There is a local root exploit for GPFS commands like mmlsquota, see https://www.ibm.com/support/pages/node/6151701?myns=s033&mynp=OCSTXKQY&mync=E&cm_sp=s033-_-OCSTXKQY-_-E.

As a temporary workaround the setuid-flag has been removed from GPFS commands, disabling regular users from running e.g. mmlsquota. Running mmlsquota now throws a rather misleading error:


mmlsquota -u $USER --block-size auto max-home
Failed to connect to file system daemon: No such process
mmlsquota: GPFS is down on this node.
mmlsquota: Command failed. Examine previous error messages to determine cause.

That just means, that $USER is not privileged to run the command, not that there is something wrong with GPFS. It will be fixed with the next GPFS update on maxwell.

Anaconda3 updates

A number of conda packages have been updated or added:

cudatoolkit               10.0.130                      0    upgraded from cuda 9.0 to 10.0
cudnn                     7.6.5                cuda10.0_0    new
cupy                      7.4.0            py36h273e724_1    new
dask                      2.15.0                     py_0    upgraded
dask-core                 2.15.0                     py_0    upgraded
dask-glm                  0.2.0                      py_1    upgraded
dask-jobqueue             0.7.1                      py_0    upgraded
dask-labextension         1.0.3                      py_0    upgraded
dask-ml                   1.4.0                      py_0    new
dask-mpi                  1.0.3                    py36_0    new
distributed               2.15.2           py36h9f0ad1d_0    upgraded
extra-data                1.1.0                    pypi_0    upgraded
extra-geom                0.9.0                    pypi_0    upgraded
ipyslurm                  1.5.0                      py_0    new
libblas                   3.8.0                    14_mkl    upgraded
libcblas                  3.8.0                    14_mkl    upgraded
liblapack                 3.8.0                    14_mkl    upgraded
nccl                      2.6.4.1              hd6f8bf8_0    new
pyfai                     0.19.0           py36hb3f55d8_0    new. used to live in a conda env
pytorch                   1.4.0             cuda100py36_0    upgraded from 1.0 cuda 9.0
torchvision               0.2.1                    py36_0    new
xarray                    0.11.2                   pypi_0    upgraded


Due to conflicting dependencies, SuRVoS has been moved to a conda environment (survos):

@max-wgs:~$ conda env list | grep survos
survos                   /software/anaconda3/5.2/envs/survos

@max-wgs:~$ module load maxwell survos

@max-wgs:~$ which SuRVoS
/software/anaconda3/5.2/envs/survos/bin/SuRVoS


The problem was resolved at ~10:30


we are experience in the moment a serious problem in our GPFS infrastructure.

We are in the process to analyze which parts are involved and can't therefore can't give now any further details.

At the moment the maxwell home directories are not available so you can't login to the maxwell login nodes. The software folder is also not accessible.

The beegfs is not involved so it is clear that it is not the infiniband network.

As soon as we get new information we will update this post

docker had to be removed from maxwell login nodes due to severe security concerns. Running docker on batch nodes is not affected. To build docker images, please use your personal machines, or one of the batch nodes.

Software Updates

During the last weeks we've updated all workgroup server and all compute nodes to Centos7.7. For details you may look at the release nodes.

Additionally we've updated singularity to the latest version 3.5.

From 98.10.2019 9:00 till 14:00 o'clock we had severe problems with the infiniband network on the Maxwell cluster. The home directories and several other GPFS offline storages were not available.
So login to Maxwell was not possible and running jobs could be disturbed.


Jupyterhub interruption

for another bugfix, minor configuration changes and addition of a few extensions we need to restart the jupyterhub (https://max-jhub.desy.de/) today at 19:00.
That will take only a few seconds, but it will most likely disconnect running kernels. In that case, you'd need to use the control panel to "Start My Server" and relaunch your notebook. The session (i.e. the slurm job) will persist.
Apologies for the inconvenience.