Page tree

FAQs: Basics - Jobs - Settings&LimitsTroubleshooting


Basics

(question) How to get access to the Maxwell cluster?
Drop an email to maxwell.service@desy.de asking for the maxwell-resource. Please explain very briefly what kind of applications you intend to run.

(question) What kind of applications are possible on the Maxwell cluster?

Preferably applications which can make use of the resources, namely the low-latency network (infiniband), cluster file systems (GPFS, BeeGFS), memory and multi-core architecture. Massively parallel MPI applications are typical examples; Monte-carlo simulations and applications alike kind of much disfavored.

(question) How can I stay informed about updates, downtimes and things alike?
Please subscribe to the maxwell-user mailing list. maxwell-users are automatically subscribed. Self-subscription is possible, but is moderated and might take a moment. 

(question) Can groups add their "own resources" to the Maxwell cluster?
Absolutely! We can support a spectrum of different operational modes. The hardware however has to be more or less identical to existing nodes. Get in touch with unix@desy.de.

Jobs

(question) Where can I compile, test, debug and submit my jobs?
Login to max-wgs or max-display. Groups with own resources might use the groups login hosts. Check Groups and Partitions on Maxwell for group specifica.

(question) Is graphical login available?
Use FastX for graphical login to max-display.desy.de. Check Groups & Partitions for alternative and group specific login options.

(question) How can I allocate resources, run or submit jobs?
Check Running Jobs on Maxwell.

(question) What is an account? 
Not to be confused with a DESY account. In the slurm slang this is a kind of a bank account. All members of an account (usually a group) share "compute-credits". Once exceeded all account-members will suffer from the same low priorities. See also Running Jobs on Maxwell.

 (question) Who can I ask in case of problems, for missing software, further support and things alike?
Get in touch with maxwell.service@desy.de.

Settings & Limits

(question) How many nodes can I use?
Depends very much on the partition and will continuously change. Check the list of limits or use sinfo to see the current limits, for example  sinfo -a -o "%10P %6D %8s %5c %10m %12l %12L".

(question) How long are jobs allowed to run?
Depends again on the partition. Typically up to 7days. Check the list of limits or use sinfo to see the current limits.

(question) Can I run different jobs in parallel on the same node?
Yes and No. If a job gets scheduled on a node, the node is entirely yours. Your job-script can fork as many processes as their are cores available, but it's up to the job-script to distribute the sub-processes on the node. Submitting another job or specifying multiple tasks are separate requests which can not be scheduled concurrently.

(question) Can I request a specific node? 
Yes but don't unless absolutely unavoidable.

 (question) Can I specify the number of tasks?
Usually there wouldn't be a need to, since a job gets a full node anyway. Preferably don't.

Troubleshooting

(question) My job is pending with reason QOSNotAllowed?

[school00@max-wgs001]~% sbatch --partition=cfel myjob.sh
Submitted batch job 1555
[school00@max-wgs001]~% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1555 cfel myjob school00 PD 0:00 1 (QOSNotAllowed)

On the cfel-partition only the cfel-Account resp. the cfel-QOS are allowed. The default QOS is however the generic "normal" QOS.



(question) I can't change the account of a pending job!
(error) scontrol update JobId=1555 Account=cfel # will not work

No, you can't. You can however alter the partition or the QOS:

[school00@max-wgs001]~% scontrol update JobId=1555 QOS=cfel           # will get it running on the cfel partition
[school00@max-wgs001]~% scontrol update JobId=1555 Partition=maxwell  # will get it running on the maxwell partition


(question) My job is pending with reason PartitionNodeLimit? 

[school00@max-wgs001]~% sbatch --partition=maxwell --nodes=6 myjob.sh
Submitted batch job 1575
[school00@max-wgs001]~% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1575 maxwell myjob school00 PD 0:00 6 (PartitionNodeLimit)

You exceeded the limit of nodes per job. Alter the number of nodes of the pending job:

[school00@max-wgs001]~% scontrol update JobId=1575 NumNodes=1


(question) My job is rejected with reason "There are not enough slots available"? 

#SBATCH -n 2 # request resources for 2 tasks
#SBATCH -N 1 # ask for exactly 1 node
#SBATCH --nodefile=my-favorite-host # Use this host

The job is asking for one node, and the nodefile contains a single entry. However -n 2 requests two tasks and slurm schedules per default 1 task per node, so it's effectively asking for 2 nodes and the nodefile provides only 1, so the job gets rejected. Don't specify the number of tasks, just the partition, number of nodes and run-time.



(question) What are the LD_PRELOAD error messages about? 

Submitting a job from max-display (or one of the other display nodes) generates errors of the type:

ERROR: ld.so: object 'libdlfaker.so' from LD_PRELOAD can not be preloaded: ignored

LD_PRELOAD is set on max-display nodes to enable GPU support via VirtualGL for FastX (and all graphical applications running inside FastX). A batch-job will inherit the local environment and (correcly) complain about the libraries not being available on the compute nodes. The error message is meaningless for job-execution and can be ignored. To get rid of (most of) the LD_PRELOAD error messages reset LD_PRELOAD in your job-script or on the terminal used to submit the job.

You can also force slurm to ignore the local environment or to reset LD_PRELOAD. For example

#SBATCH --export=NONE             # use a clean environment
#SBATCH --export=LD_PRELOAD=""    # keep the environment but redefine LD_PRELOAD to an empty variable

(warning) Please note: When compiling and linking an application, LD_PRELOAD libraries might get linked into your binary, which would then not run a compute node. For compilation always reset LD_PRELOAD!



(question) Opening XFCE or other desktop environments via FastX fails ? 

Gnome: ** (process:182218): WARNING **: Could not make bus activated clients aware of XDG_CURRENT_DESKTOP=GNOME environment variable: Could not connect: Connection refused

KDE: Could not start D-Bus. Can you call qdbus?

XFCE: Unable to contact settings server. Failed to connect to socket /tmp/dbus-IsZBNH6nnH: Connection refused

If ypu see one of error messages above it's most likely because you added anaconda to your environment. anaconda comes with its own, incompatible DBUS installation. As soon as you add anaconda to your PATH (for example by adding a module load anaconda to your .bashrc) none of the window managers will work.

(warning) Don't add anaconda to your login environment!




(question) ... to be continued ?




(plus)   (question)  (tick)  (warning)  (error)


  • No labels