You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 35 Next »
Software & Applications
The Maxwell-Cluster is a resources dedicated to parallel and multi-threaded application, which can use at least some of the specific characteristics. In addition to serving as a medium scale High-Performance-Cluster, Maxwell incorporates resources for Photon Science data analysis, resources of CFEL, CSSB, Petra4, the European XFEL...
If you find the resource useful for your work, we would greatly appreciate to learn about publications, which have been substantially benefiting from the Maxwell-Cluster. Drop us a mail at maxwell.service@desy.de. Acknowledgement of the maxwell-resource would also be greatly appreciated. It'll help to foster the cluster, for example: "This research was supported in part through the Maxwell computational resources operated at Deutsches Elektronen-Synchrotron (DESY), Hamburg, Germany"
Search the compute space
max-display3 nodes (max-display004,5) are currently being upgraded to Centos 8, serving as a testbed for future upgrades of the cluster.
- Currently max-display3.desy.de is not available for any FastX sessions, we are working on that
- Expect many (even basic) things NOT to work
The changes in favor of Petra4-computing tasks announced Aug. 17th have been reverted, and new nodes and partitions have been added accordingly:
- The maxwell partition has been restored with all nodes and a maximum job runtime of 7 days.
- The petra4 partition has been increased by 40 new AMD EPYC-7402 nodes
- A new short partition has been created.
- The short partition also contains the 40 new AMD EPYC-7402 nodes
- The maximum job runtime is 4 hours
- jobs in the petra4 partition are prioritized. Jobs in the short partition might be delayed but will never be terminated (preempted).
A short overview of the relevant changes:
Partition | # of nodes | Nodes/Job | Max # of Jobs | Default Time | Maximum Time | Allowed Groups |
---|---|---|---|---|---|---|
maxwell | 58 | 1-6 | 8 | 1:00:00 | 7-00:00:00 | maxwell-users |
petra4 | 66 | no limit | no limit | 1:00:00 | 14-00:00:00 | max-petra4-sim-users |
short | 40 | 1-4 | 8 | 1:00:00 | 4:00:00 | maxwell-users |
For details of available hardware consult the hardware pages
Some useful commands to get more information about the current setup:
/usr/local/bin/max-limits -a # show partitions and the limits applying /usr/local/bin/max-limits # show only partitions allowed /usr/local/bin/my-partitions # list partitions indicating which ones can be used and which ones not /usr/bin/sinfo # show available nodes and partitions /usr/bin/sinfo -p short -o '%20n %20f %10t %c %m' # show nodes in the short partition, with state, features, cores... /software/tools/bin/savail -p maxgpu # show detailed information about available nodes taking into account preemptable jobs
Hello!
you will find a new version of octave on maxwell
[sternber@max-wgs001]~% module load maxwell octave/5.2.0
[sternber@max-wgs001]~% octave --version
GNU Octave, version 5.2.0
...
For further information about ocate and the new version
Dear colleagues,
the DESY directorate has decided to temporarily shift compute priorities on the Maxwell cluster in favor of urgent Petra4 computations.
As a consequence we have to make temporary adjustments to the maxwell partition in the following way:
- Starting Wednesday August 19th the maximum time-limit of jobs in the maxwell partition will be reduced to _4_ HOURS.
- Nodes in the maxwell partition will also become part of the petra4 partition, and will be prioritized in the petra4 partition.
What happens to your jobs after the change?
- Jobs already running in the maxwell partition (or any other partition) will not be affected.
- Jobs with a runtime of more than 4 hours and still waiting in the maxwell partition have to be removed after deploying the configuration changes next Wednesday. The jobs would never execute.
- Jobs submitted to the maxwell partition with a proper time-limit of 4h or less run unaffected. Due to the prioritization of petra4 you might however experience long queuing times. Please consider using the all-partition as well as other resources possibly available to you and your group.
How temporary is temporary?
- We have already purchased 40 new compute nodes which will arrive mid- to end-September. Once installed, the 40 nodes will become part of the petra4 partition.
- At this point, the maxwell partition will return to a normal schedule.
- In addition, the 40 nodes will also be made available for short running short (2-4 hours) for users of the maxwell partition.
- After the petra4 compute campaign, the nodes will be fully integrated into the maxwell partition and more than double the core-count.
So we expect that beginning of October (of course depending on the timely delivery by our vendor) the maxwell partition will be fully available again, and augmented by additional resources.
Be ensured that we do treat this matter with highest urgency trying to minimize the temporary regression.
We understand that the temporary adjustment will affect some users in rather a harsh way, but hope for your understanding.
Please contact us (maxwell.service@desy.de) for any questions or comments, and in case you have really urgent computational requests. Despite limited options we will do our best to mitigate effects.
between 15:34 and 16:47, some Maxwell storage was disturbed, notably Maxwell home directories.
The problem is resolved.
we recently updated the following software
julia 1.5 (https://docs.julialang.org/en/v1/NEWS/)
singularity 3.6 (https://github.com/hpcng/singularity/releases)
we've added the julia programming language to the maxwell software repository
Julia is a high-level, high-performance, dynamic programming language. While it is a general purpose language and can be used to write any application, many of its features are well-suited for numerical analysis and computational science. (Wikipedia)
To use it "module load maxwell julia" or append "/software/julia/default/bin/" to your path.
Further documentation:
https://julialang.org/
Julia in the Jupyter Notebook:
In order to use julia as a Kernel in the Jupyter Notebook
one has to first load julia (see above) and then start the julia interpreter
by entering the command "julia" in the command line.
Then one has to enter the following commands to install iJulia and make
the Kernel available for jupyter:
julia> using Pkg
julia> Pkg.add("IJulia")
Then one just has to refresh the main jupyter notebook page and
Julia should become available as a choice for a Kernel in the Notebook
along side various Anaconda Python variants that are installed on Maxwell.
With the new OS version comes a small problem with the standard openmpi
implementation. With "module load mpi/openmpi-x86_64" you use the standard openmpi
from centos. If your job is crashing with "SEGVAULT" you have to add a
parameter for your mpirun. If possible use "ucx" as this is the new standard protokoll in MPI.
There are two ways to achieve this:
You can create a file in your homedirectory with one line "pml=ob1"
or "pml=ucx"# mkdir ~/.openmpi # vi ~/.openmpi/mca-params.conf # cat ~/.openmpi/mca-params.conf pml=ob1
- You can add the parameter to your mpirun commando for example "mpirun --mca pml ob1 foo"
or "mpirun --mca pml ucx foo"
(foo := name of your programm)
There is a local root exploit for GPFS commands like mmlsquota, see https://www.ibm.com/support/pages/node/6151701?myns=s033&mynp=OCSTXKQY&mync=E&cm_sp=s033-_-OCSTXKQY-_-E.
As a temporary workaround the setuid-flag has been removed from GPFS commands, disabling regular users from running e.g. mmlsquota. Running mmlsquota now throws a rather misleading error:
mmlsquota -u $USER --block-size auto max-home Failed to connect to file system daemon: No such process mmlsquota: GPFS is down on this node. mmlsquota: Command failed. Examine previous error messages to determine cause.
That just means, that $USER is not privileged to run the command, not that there is something wrong with GPFS. It will be fixed with the next GPFS update on maxwell.
A number of conda packages have been updated or added:
cudatoolkit 10.0.130 0 upgraded from cuda 9.0 to 10.0 cudnn 7.6.5 cuda10.0_0 new cupy 7.4.0 py36h273e724_1 new dask 2.15.0 py_0 upgraded dask-core 2.15.0 py_0 upgraded dask-glm 0.2.0 py_1 upgraded dask-jobqueue 0.7.1 py_0 upgraded dask-labextension 1.0.3 py_0 upgraded dask-ml 1.4.0 py_0 new dask-mpi 1.0.3 py36_0 new distributed 2.15.2 py36h9f0ad1d_0 upgraded extra-data 1.1.0 pypi_0 upgraded extra-geom 0.9.0 pypi_0 upgraded ipyslurm 1.5.0 py_0 new libblas 3.8.0 14_mkl upgraded libcblas 3.8.0 14_mkl upgraded liblapack 3.8.0 14_mkl upgraded nccl 2.6.4.1 hd6f8bf8_0 new pyfai 0.19.0 py36hb3f55d8_0 new. used to live in a conda env pytorch 1.4.0 cuda100py36_0 upgraded from 1.0 cuda 9.0 torchvision 0.2.1 py36_0 new xarray 0.11.2 pypi_0 upgraded
Due to conflicting dependencies, SuRVoS has been moved to a conda environment (survos):
@max-wgs:~$ conda env list | grep survos survos /software/anaconda3/5.2/envs/survos @max-wgs:~$ module load maxwell survos @max-wgs:~$ which SuRVoS /software/anaconda3/5.2/envs/survos/bin/SuRVoS
The problem was resolved at ~10:30
we are experience in the moment a serious problem in our GPFS infrastructure.
We are in the process to analyze which parts are involved and can't therefore can't give now any further details.
At the moment the maxwell home directories are not available so you can't login to the maxwell login nodes. The software folder is also not accessible.
The beegfs is not involved so it is clear that it is not the infiniband network.
As soon as we get new information we will update this post
docker had to be removed from maxwell login nodes due to severe security concerns. Running docker on batch nodes is not affected. To build docker images, please use your personal machines, or one of the batch nodes.
During the last weeks we've updated all workgroup server and all compute nodes to Centos7.7. For details you may look at the release nodes.
Additionally we've updated singularity to the latest version 3.5.
From 98.10.2019 9:00 till 14:00 o'clock we had severe problems with the infiniband network on the Maxwell cluster. The home directories and several other GPFS offline storages were not available.
So login to Maxwell was not possible and running jobs could be disturbed.
- slurm_magic
- jupyterlab_slurm
- dask-labextension
- jupyterlab-server-proxy
- dask-jobqueue
- enable pdf-export via texlive/2019
The Maxwell-Cluster is composed of a core partition (maxwell) and group specific partitions. All compute nodes are however available for everyone!
The Maxwell-Cluster is primarily intended for parallel computation making best use of the multi-core architectures, the infiniband low-latency network, fast storage and available memory. The cluster is hence not suited for single-core computations or embarrassingly parallel jobs like Monte-Carlo productions. Use BIRD, Grid or your groups workgroup server (WGS) for this kind of tasks.
The entire cluster is managed by SLURM scheduler (with some notable exceptions). The SLURM scheduler essentially works on a "who comes first" basis. The group specific partitions however have slightly different rules: though everyone can run jobs on group specific nodes, members of the group will have a higher priority and will compete non-group jobs off the partition. See Groups and Partitions on Maxwell for details.
- To get started, please have a look at the Getting Started page!
- The Maxwell Hardware page provides a list of currently available nodes & configurations.
- The Maxwell Partitions page provides a quick overview of the nodes, capacities, features and limits of the individual partitions.
- Read the documentation! It should cover at least the essentials. If you come across incorrect or outdated information: please let us know!
Contact
For any questions, problems, suggestions please contact: maxwell.service@desy.de
All Announcements will be sent via maxwell-user@desy.de. Users with the maxwell-resource are automatically subscribed.
We strongly recommend that all maxwell-users without maxwell-resource self-subscribe even if you are using exclusively group-specific resources.
- No labels