What is the Maxwell cluster?
The Maxwell cluster is the unique high performance compute resource at DESY Hamburg. It combines a large number of compute nodes, high memory per cpu core, and GPUs with a very fast network storage. Everyone with a valid DESY account can use the resource under one condition: the applications to be used on the Maxwell cluster is capable of making use of many cores preferably across multiple nodes. Even if your application is not of that type you still might be entitled to use the cluster provided your group owns resources in the Maxwell cluster. Check the Getting Started page to verify if that's the case.
This page describes the rules applying to everyone, and it particular the generic resources, namely the so called maxwell and all partitions.
What is a batch job?
You might be used to start an application interactively, by invoking a command or by clicking on a button, and wait for the process to produce the results. A batch job is not fundamentally different: you invoke a command, a scheduler (slurm) queues the application on the most appropriate available machine. While waiting for the process you can continue working and run multiple applications at the same time - without interfering with your colleagues.
Though it's perfectly possible to run jobs interactively on the Maxwell cluster the batch mode has many advantages:
- use get exclusive node to one or multiple machines
- your jobs won't be affected by other users: some applications are very efficient in crashing a machine. All your processes will be lost in such a case.
- your jobs won't affect any other users. If your application puts a very high load or crashes a machine nobody else will notice.
- It allows to make the most efficient use of the compute cluster
So even if running a batch feels less convenient: please make an effort to design your jobs as batch rather than interactive jobs
How do I connect to the Maxwell cluster?
You can't just submit a job or run an application on the Maxwell cluster from your desktop. You need to connect to a so called submit host. That are dedicated machines for development, small interactive jobs and to submit compute intense jobs to dedicated compute nodes. There are special login nodes for various groups, please check out your groups page. Apart from that there are a few login nodes available for pretty much everyone. As a rule of thumb:
- If you are not in need of a graphical desktop login to max-wgs.desy.de. Under Windows 10, use the build-in ssh command on the console. Under linux or Mac OS just use a terminal and ssh to max-wgs.desy.de. If you are connecting from outside the DESY network you might need to create an ssh tunnel.
- Everyone else should use max-display.desy.de. You could simply use a terminal and open an ssh connection. But the simplest way of connecting to the Maxwell cluster is by opening https://max-display.desy.de:3443/ in your browser. It's a very powerful system supporting load balancing and GPU hardware acceleration. However, the max-display nodes are exposed to the outside world. We therefore have to deploy security updates as soon as possible. It might mean that nodes will be rebooted at very short notice. Please keep that in mind! Save your open files regularly and avoid long running jobs.
- There is on exception: members of CFEL, FS or external Petra3/FLASH-users without any Maxwell resource have to use the FS login nodes. See Maxwell for Photon Science users and see below how to verify which partitions to use.
- For more details please continue reading on the Remote Login page.
Please note: the interactive nodes are monitored: see Monitoring on Maxwell. Using the login nodes like max-display for compute or memory intense processes is not permitted. If you do so nevertheless you will first get an email. If you don't react your processes will be terminated. If you continue misusing the interactive nodes we might block access.
What is a partition?
A slurm partition is quite similar to a queue in a batch system. It defines a set of machines to be used by a certain group of users. The partition implies certain constraints like the maximum number of concurrent jobs, time limits and so on. Have a look at the page describing the limits per partition. Partitions can be overlapping, and they do on the Maxwell cluster. For example the exfel partition and the upex partition contain an almost identical set of nodes, and simply share the nodes. So jobs in the exfel partition compete with jobs in the upex partition for the same resources. The all partition is a special case. It contains all nodes of the Maxwell cluster, but jobs in the all partition are actually not competing. It's more of a parasitic utilisation of idle resource, see below for further details.
How to use the maxwell partition?
The maxwell partition is the default partition on the Maxwell cluster. If no partition is specified a job will be executed on the maxwell partition. If you don't have the permission to run jobs on the maxwell partition the job will be refused. To verify which partitions are enabled for your account use the my-partitions scriplet:
Everything else is explained under Running jobs on Maxwell.
How to use the all partition?
The all partition in Maxwell is special. The all partition contains ALL nodes of the cluster, the generic resources as well as any of the group-owned resources. It allows for very large jobs, much larger than on any other partition (see Partitions on Maxwell for a detailed list of job limits). Basically you could run up to 20 concurrent jobs using all available machines for 14 days. There is however a downside: jobs on the all partition will be preempted as soon as a job in any other partition requests some of the node allocated to a job in the all partition. Preemption means: your job will be terminated effectively immediately. The process is actual more involved, see Preemption and Checkpointing for more details.
Selecting a job in the all partition is simple:
Everything else is explained under Running jobs on Maxwell.
What you need to know about your home directory on Maxwell
The home directory on Maxwell is different from home directories anywhere else on the campus. The Maxwell home directory is GPFS hosted, so it's fast. But in contrast to an AFS based home directory, the Maxwell home directory
- has a hard limit of 20GB per user. It's not extendable and there are no exceptions.
- is not being backed up. There are snapshots to recover a very recent state, but don't rely on it. Assume that a deleted file or directory is simply gone.
- will not be archived. Once your account expires all files might be deleted without further notice. If there are data you would like to be used further on by your group make sure to create a copy in a group-readable, persistent space such as dCache.