Blog

Blog

OpenMPI problems

after upgrading to centos 7.8 we are experience problems on some nodes
with openmpi code.

At the moment we don't understand why only some nodes
are failing. My first guess was the cpu type is the reason but this seems not the only problem.
The network should is also not involved since the test code fails even locally.


A list of nodes which are working, a small test program and
a lists of working and not working cpu types (some cpu types are in both lists!) 
mpi_problem.tar.gz

There is also an bug report for centos 8 which seems to be related to our problem
https://bugs.centos.org/view.php?id=17417

WORKAROUND:
* Use mpich
* Only submit to hosts on the list

We will further inform you.

There is a local root exploit for GPFS commands like mmlsquota, see https://www.ibm.com/support/pages/node/6151701?myns=s033&mynp=OCSTXKQY&mync=E&cm_sp=s033-_-OCSTXKQY-_-E.

As a temporary workaround the setuid-flag has been removed from GPFS commands, disabling regular users from running e.g. mmlsquota. Running mmlsquota now throws a rather misleading error:


mmlsquota -u $USER --block-size auto max-home
Failed to connect to file system daemon: No such process
mmlsquota: GPFS is down on this node.
mmlsquota: Command failed. Examine previous error messages to determine cause.

That just means, that $USER is not privileged to run the command, not that there is something wrong with GPFS. It will be fixed with the next GPFS update on maxwell.

Anaconda3 updates

A number of conda packages have been updated or added:

cudatoolkit               10.0.130                      0    upgraded from cuda 9.0 to 10.0
cudnn                     7.6.5                cuda10.0_0    new
cupy                      7.4.0            py36h273e724_1    new
dask                      2.15.0                     py_0    upgraded
dask-core                 2.15.0                     py_0    upgraded
dask-glm                  0.2.0                      py_1    upgraded
dask-jobqueue             0.7.1                      py_0    upgraded
dask-labextension         1.0.3                      py_0    upgraded
dask-ml                   1.4.0                      py_0    new
dask-mpi                  1.0.3                    py36_0    new
distributed               2.15.2           py36h9f0ad1d_0    upgraded
extra-data                1.1.0                    pypi_0    upgraded
extra-geom                0.9.0                    pypi_0    upgraded
ipyslurm                  1.5.0                      py_0    new
libblas                   3.8.0                    14_mkl    upgraded
libcblas                  3.8.0                    14_mkl    upgraded
liblapack                 3.8.0                    14_mkl    upgraded
nccl                      2.6.4.1              hd6f8bf8_0    new
pyfai                     0.19.0           py36hb3f55d8_0    new. used to live in a conda env
pytorch                   1.4.0             cuda100py36_0    upgraded from 1.0 cuda 9.0
torchvision               0.2.1                    py36_0    new
xarray                    0.11.2                   pypi_0    upgraded


Due to conflicting dependencies, SuRVoS has been moved to a conda environment (survos):

@max-wgs:~$ conda env list | grep survos
survos                   /software/anaconda3/5.2/envs/survos

@max-wgs:~$ module load maxwell survos

@max-wgs:~$ which SuRVoS
/software/anaconda3/5.2/envs/survos/bin/SuRVoS


The problem was resolved at ~10:30


we are experience in the moment a serious problem in our GPFS infrastructure.

We are in the process to analyze which parts are involved and can't therefore can't give now any further details.

At the moment the maxwell home directories are not available so you can't login to the maxwell login nodes. The software folder is also not accessible.

The beegfs is not involved so it is clear that it is not the infiniband network.

As soon as we get new information we will update this post

docker had to be removed from maxwell login nodes due to severe security concerns. Running docker on batch nodes is not affected. To build docker images, please use your personal machines, or one of the batch nodes.

Software Updates

During the last weeks we've updated all workgroup server and all compute nodes to Centos7.7. For details you may look at the release nodes.

Additionally we've updated singularity to the latest version 3.5.

From 98.10.2019 9:00 till 14:00 o'clock we had severe problems with the infiniband network on the Maxwell cluster. The home directories and several other GPFS offline storages were not available.
So login to Maxwell was not possible and running jobs could be disturbed.


Jupyterhub interruption

for another bugfix, minor configuration changes and addition of a few extensions we need to restart the jupyterhub (https://max-jhub.desy.de/) today at 19:00.
That will take only a few seconds, but it will most likely disconnect running kernels. In that case, you'd need to use the control panel to "Start My Server" and relaunch your notebook. The session (i.e. the slurm job) will persist.
Apologies for the inconvenience.

on 19th Sep in the time from ~5:00 to 9:00 the home filesystem was on several nodes in the maxwell cluster not available.
The problem is solved. For further questions send an email to maxwell.service@desy.de

Python3 update

with the update last week (2.9.2019) we removed all remaining python34 packages, because
python3.4 has reached end of life (https://www.python.org/downloads/release/python-3410/)

And so the third party repository also don't offer it anymore.

DESY also provided some packages for 3.4 and not all of them are
rebuilded for 3.6 now. So if you miss a package give us an email
at maxwell.service@desy.de

with the update last week to Slurm 19.05 the syntax in sbatch command
files changed. The parameter "--workdir" is renamed in "--chdir"
like in all other commands.

For details:
https://slurm.schedmd.com/sbatch.html

Updated GIT

We provide an updated GIT client in the software section

% git --version
git version 1.8.3.1
% module load maxwell
% module load git    
% git --version  
git version 2.23.0

The problems regarding the all and allgpu partition are solved. If you still have issue regarding the schedule of your batch jobs please send us a mail to maxwell.service@desy.de


Original Message:
After the slurm update last week we see some problems regarding the "all" and "allgpu" partitions. Jobs from "privileged" partitions (exfl,cssb,upex ...) preempting (killing) jobs which were submitted to the all* partitions. Even if the privileged jobs can't use the preempted nodes afterwards due to constaints in the job definition. (see https://confluence.desy.de/display/IS/Running+Jobs+on+Maxwell) The privileged job will "kill" a job in the all* partition every 3 minutes until a matching node is found and the "privileged" job starts. As this bug is only triggered by pending jobs in the privileged partitions with extra constraints , not all jobs in the all* queues will fail. So for example the last 10h no job was preempted in the all* queue We filed a bug report to SchedMD (the company we have a SLURM support contract with) and looking forward for a solution.