Basics

What is the Maxwell cluster?

The Maxwell cluster is the unique high performance compute resource at DESY Hamburg. It combines a large number of compute nodes, high memory per cpu core, and GPUs with a very fast network storage. Everyone with a valid DESY account can use the resource under one condition: the applications to be used on the Maxwell cluster is capable of making use of many cores preferably across multiple nodes. Even if your application is not of that type you still might be entitled to use the cluster provided your group owns resources in the Maxwell cluster. Check the Getting Started page to verify if that's the case

How to get access to the Maxwell cluster?

Consult Getting Started page. Drop an email to maxwell.service@desy.de asking for the maxwell-resource. Please explain very briefly what kind of applications you intend to run.

What kind of applications are possible on the Maxwell cluster?

Preferably applications which can make use of the resources, namely the low-latency network (infiniband), cluster file systems (GPFS, BeeGFS), memory and multi-core architecture. Massively parallel MPI applications are typical examples; Monte-carlo simulations and applications alike are very much disfavored. However, if your group own resources in Maxwell, it's entirely up to the group to grant access and define policies.

How can I stay informed about updates, downtimes and things alike?

Please subscribe to the maxwell-user mailing list. maxwell-users are automatically subscribed. Self-subscription is possible, but is moderated and might take a moment. 

Can groups add their "own resources" to the Maxwell cluster?

Absolutely! We can support a spectrum of different operational modes. The hardware however needs to have specs similar to existing nodes. Get in touch with unix@desy.de.

Whom can I ask in case of problems, for missing software, further support and things alike?

Get in touch with maxwell.service@desy.de.

Interactive login

Where can I compile, test, debug and submit my jobs?

Use max-display if it's not very CPU or memory demaning. Login to max-wgs. Groups with own resources might use the groups login hosts. Check Interactive Login for group specifics.

Is graphical login available?

Use FastX for graphical login to max-display (or your groups display nodes). Display nodes are shared resources, so be gentle. From max-display connect to WGS nodes like max-wgs, max-fsc, ... if needed.

Jobs

What is a batch job?

You might be used to start an application interactively, by invoking a command or by clicking on a button, and wait for the process to produce the results. A batch job is not fundamentally different: you invoke a command, a scheduler (slurm) queues the application on the most appropriate available machine. While waiting for the process you can continue working and run multiple applications at the same time - without interfering with your colleagues.  

Though it's perfectly possible to run jobs interactively on the Maxwell cluster the batch mode has many advantages:

  • use get exclusive node to one or multiple machines
  • your jobs won't be affected by other users: some applications are very efficient in crashing a machine. All your processes will be lost in such a case.
  • your jobs won't affect any other users. If your application puts a very high load or crashes a machine nobody else will notice.
  • It allows to make the most efficient use of the compute cluster

So even if running a batch feels less convenient: please make an effort to design your jobs as batch rather than interactive jobs 

What is a partition?

A slurm partition is quite similar to a queue in a batch system. It defines a set of machines to be used by a certain group of users. The partition implies certain constraints like the maximum number of concurrent jobs, time limits and so on. Have a look at the page describing the limits per partition. Partitions can be overlapping, and they do on the Maxwell cluster. For example the exfel partition and the upex partition contain an almost identical set of nodes, and simply share the nodes. So jobs in the exfel partition compete with jobs in the upex partition for the same resources. The all partition is a special case. It contains all nodes of the Maxwell cluster, but jobs in the all partition are actually not competing. It's more of a parasitic utilisation of idle resource, see below for further details.  

What is an account? 

Not to be confused with a DESY account. In SLURM slang this is a kind of a bank account. All members of an account (usually a group) share "compute-credits". Once exceeded all account-members will suffer from the same low priorities. See also Introduction to SLURM jobs.

How can I allocate resources, run or submit jobs?

Check Introduction to SLURM jobs.

Settings & Limits

How many nodes can I use?

Depends very much on the partition and will continuously change. Have a look at Useful commands

How long are jobs allowed to run?

Depends again on the partition. Typically up to 7days. Have a look at Useful commands

Can I run different jobs in parallel on the same node?

Yes and No. If a job gets scheduled on a node, the node is - in most cases - entirely yours. Your job-script can fork as many processes as their are cores available, but it's up to the job-script to distribute the sub-processes on the node. Submitting another job or specifying multiple tasks are separate requests which can not be scheduled concurrently - except for a few partitions with over-subscription. Note however: while you have a job running on a node, you can ssh into that node (which doesn't count as a job), and do whatever you like on that node. Pretty much identical to salloc.

Can I request a specific node? 

Yes but don't unless absolutely unavoidable.

Can I specify the number of tasks?

Not really . Specifying a number of tasks means to ask for ntask nodes. Use --nodes or -N instead.

Troubleshooting

My job is pending with reason QOSNotAllowed?

[@max-wgs001]~% sbatch --partition=cfel myjob.sh
Submitted batch job 1555
[@max-wgs001]~% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1555 cfel myjob school00 PD 0:00 1 (QOSNotAllowed)

On the cfel-partition only the cfel-Account resp. the cfel-QOS are allowed. The default QOS is however the generic "normal" QOS.

My job is pending with reason Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions?

It just means that all nodes in that partition are busy (and some might not be available for the reasons given). Use /software/tools/bin/savail to find out the status of partiton.

Why can't change the account of a pending job!

You can't. You can however alter the partition or the QOS:

[@max-wgs001]~% scontrol update JobId=1555 QOS=cfel           # will get it running on the cfel partition
[@max-wgs001]~% scontrol update JobId=1555 Partition=maxcpu   # will get it running on the maxwell partitio

My job is pending with reason PartitionNodeLimit? 

[@max-wgs001]~% sbatch --partition=maxcpu --nodes=6 myjob.sh
Submitted batch job 1575
[@max-wgs001]~% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1575 maxwell myjob school00 PD 0:00 6 (PartitionNodeLimit)

You exceeded the limit of nodes per job. Alter the number of nodes of the pending job:

[@max-wgs001]~% scontrol update JobId=1575 NumNodes=1


My job is rejected with reason "There are not enough slots available"? 

#SBATCH -n 2 # request resources for 2 tasks
#SBATCH -N 1 # ask for exactly 1 node
#SBATCH --nodefile=my-favorite-host # Use this host

The job is asking for one node, and the nodefile contains a single entry. However -n 2 requests two tasks and slurm schedules per default 1 task per node, so it's effectively asking for 2 nodes and the nodefile provides only 1, so the job gets rejected. Don't specify the number of tasks, just the partition, number of nodes and run-time.

What are the LD_PRELOAD error messages about?

Submitting a job from max-display (or one of the other display nodes) generates warnings of the type:

ERROR: ld.so: object 'libdlfaker.so' from LD_PRELOAD can not be preloaded: ignored

LD_PRELOAD is set on max-display nodes to enable GPU support via VirtualGL for FastX (and all graphical applications running inside FastX). A batch-job will inherit the local environment and (correcly) complain about the libraries not being available on the compute nodes. The error message is meaningless for job-execution and can be ignored. To get rid of (most of) the LD_PRELOAD error messages reset LD_PRELOAD in your job-script or on the terminal used to submit the job.

You can also force slurm to ignore the local environment or to reset LD_PRELOAD. For example

#SBATCH --export=NONE             # use a clean environment
#SBATCH --export=LD_PRELOAD=""    # keep the environment but redefine LD_PRELOAD to an empty variable

 Please note: When compiling and linking an application, LD_PRELOAD libraries might get linked into your binary, which would then not run a compute node. For compilation always unset LD_PRELOAD!


Opening XFCE or other desktop environments fails ? 

Gnome: ** (process:182218): WARNING **: Could not make bus activated clients aware of XDG_CURRENT_DESKTOP=GNOME environment variable: Could not connect: Connection refused

KDE: Could not start D-Bus. Can you call qdbus?

XFCE: Unable to contact settings server. Failed to connect to socket /tmp/dbus-IsZBNH6nnH: Connection refused

If you see one of the error messages above it's most likely because you added conda to your environment. conda comes with its own, incompatible DBUS installation. As soon as you add conda to your PATH (for example by adding a module load conda to your .bashrc) none of the window managers will work.

Solution:

  • Don't add/remove the conda setup to/from your login environment OR
  • add auto_activate_base: false to ~/.condarc OR
  • run conda config --set auto_activate_base false once (which will just add auto_activate_base: false to ~/.condarc)