Maxwell : gamess

Summary

Source: https://www.msg.chem.iastate.edu/gamess/index.html

License: Free but you must agree to the license terms: https://www.msg.chem.iastate.edu/gamess/License_Agreement.html

Path: /software/gamess

Documentation: https://www.msg.chem.iastate.edu/gamess/documentation.html

Citation: if results obtained with GAMESS are published in the scientific literature, you will reference the program from the article M.W.Schmidt, K.K.Baldridge, J.A.Boatz, S.T.Elbert, M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga, K.A.Nguyen, S.J.Su, T.L.Windus, M.Dupuis, J.A.Montgomery J. Comput. Chem. 14, 1347-1363 (1993). Using specific methods included in GAMESS may require citing additional articles, as described in the manual. I agree to honor the request to cite additional papers, as appropriate.

The General Atomic and Molecular Electronic Structure System (GAMESS) is a general ab initio quantum chemistry package.

GAMESS is maintained by the members of the Gordon research group at Iowa State University.

Using gamess with an alternative launch script
Using gamess with the original launch script
- Sample batch script:

Using gamess with an alternative launch script

The original rungms-dev script is a 1800 lines long c-shell-beast, invoking a few other scripts on the way to executing the binary. To make handling a bit easier and more transparent we offer a simplified bash-script named rgms. For a description of the original script see below.

module load maxwell gamess
rgms
   rgms -i input [-c cores] [-t threads] [-l logn] [-h] [-v]

       -i|--input   name of input file with or without .inp extension

       -c|--cores   cores (MPI-threads) used per node (default: all physical cores) 
       -l|--logn    gamess logical nodes (default: 0)
       -t|--threads used per node. Identical to setting OMP_NUM_THREADS (default: 1)

       -h|--help    display this
       -v|--verbose display debug information 

    Defaults:
        number of nodes: 1 or number of SLURM nodes
        USERSCR:         /home/schluenz/gamess/restart
        TMPDIR:          /scratch/schluenz
        OMP_NUM_THREADS: 1

Some notes:

There are two options to run gamess on Maxwell

you run it interactively. In this case, the script forces to run on a single node, without slurm-support. The script will per default use all physical cores.
you run it as a slurm batch-job. The script uses again all physical cores by default, and all nodes allocated to this job. It assume that all nodes have an identical number of cores!

The number of threads is controlled by '-t' OR OMP_NUM_THREADS. rgms will by default assume a single thread. Running some simple tests (see below for some benchmarks), it appears best to use a smaller number of cores and rather increase the number of threads.

Note however, that the use of GDDI NGROUP>0 does not seem to work for OMP_NUM_THREADS>1. rgms does not enforce it, but if you run into MPI errors while running GDDI-jobs with OMP_NUM_THREADS>1, try to run it with OMP_NUM_THREADS=1 instead.

Changes in defaults:

rungms-dev uses $HOME/restart per default as USERSCR. rgms uses $HOME/gamess/restart
rungms-dev set OMP_NUM_THREADS=8 per default, rgms sets OMP_NUM_THREADS=1 per default
rgms stores small temporary files in ~/.gamess to redistribute files with srun
rgms is per default much more silent than rungms-dev unless invoked with verbose-flag
rgms prints some job-information at the end of run

Running on EPYC 75F3

Using all 128 cores as mpi-threads on EPYC 75F3 nodes fails with too many open files error messages. Intel recommends using fewer cores per node. Even with 98 cores per node, intel mpi complains about memory allocation on IB HCAs. Simply limit the number of cores to physical cores 64 or even less and increase the number of threads. See below.

Benchmarks

After lots of trial and errors it appears best to use a small number of MPI-threads (cores) and rather scale up the number of OMP-threads. That might vary substantial depending on the input-file!

In an initial benchmark I used 4 nodes of EPYC 75F3 or 4 nodes of EPYC 7402, and changed number of MPI-threads and OMP-threads over certain ranges of values:

CPU	Nodes	Cores	MPI-T	OMP-T	C*T	Time (s)	CPU	Nodes	Cores	MPI-T	OMP-T	C*T	Time (s)
E-7402	4	96	1				E-75F3	4	128	1	16	16	833
				32	32	532					32	32	430
				48	48	388					64	64	269
				96	96	225					128	128	163
E-7402	4	96	2				E-75F3	4	128	2	16	32	430
				32	64	298					32	64	262
				48	96	215					64	128	152
				96	192	271					128	258	208
E-7402	4	96	4	1	4	4286	E-75F3	4	128	4	1	4	3493
				2	8	2173					2	8	1770
				4	16	1097					4	16	891
				8	32	572					8	32	460
				16	64	318					16	64	281
				32	128	239					32	128	156
E-7402	4	96	8	1	8	2464	E-75F3	4	128	8	1	8	2020
				2	16	1274					2	16	1031
				4	32	658					4	32	528
				8	64	369					8	64	314
				16	128	263					16	128	175
				32	256	264					32	256	180
E-7402	4	96	16	1	16	1661	E-75F3	4	128	16	1	16	1350
				2	32	880					2	32	701
				4	64	519					4	64	399
				8	128	378					8	128	235
				16	256	308					16	256	208
				32	512	310					32	512	-
Nodes: number of slurm nodes Cores: number of cores per slurm node (incl. hyperthreading) MPI-T: MPI threads (# of cores per node used for MPI) OMP-T: OMP_NUM_THREADS *CT:** MPI-ThreadsOMP-Threads. Best choice if CT=Cores Time (s): elapsed time to run the job

Conclusion: for this particular test it's obviously best to use 2 or 4 MPI-threads and a corresponding number of OMP-threads.

Based on the results of that benchmark I ran a few tests for various number of nodes, using 4 or 8 MPI-threads:

				Runtime (s) for # of nodes
CPU	Cores	MPI-T	OMP-T	1	2	3	4	5	6	7	8	9
E-75F3	128	4	32	521	272	195	156	132	117	106	98	95
		8	16	526	294	214	178	151	134	125	120	120
E-7402	96	4	24	783	404	287	228	192	169	153	141	140
		8	12	802	445	326	265	227	201	188	182	182

Conclusion: the gain going from 4 to 8 nodes is about 38%, which is not bad. In view of the limitations of the all-partition it's absolutely not worth it. Going from 4 to 6 nodes (which is still possible in the maxcpu partition) the runtime gets reduced by 25%. Running a single job - and assuming that 6 nodes are free, which is actually never the case - it might be worth it, but I'd recommend to stay with a maximum of 4 nodes.

Setting GDDI NGROUP=i with i=0,1,2,3,4 had no effect on the runtime whatsoever. Not knowing anything about gamess and GDDI I have the impression that GDDI has advantages for real shared memory machines, but that there is no advantage (but rather disadvantages) on the Maxwell cluster. We would be grateful for any indications that this impression is misleading (maxwell.service@desy.de)!

Recommended batch-script

In view of the benchmarks and limitations of the partitions, I'd recommend to use max 4 nodes, 4 MPI-threads, and a corresponding number of OMP-threads. A simple batch-script would then look like this:

!/bin/bash
#SBATCH --partition=maxcpu
#SBATCH --constraint='[75F3|7402|(V4&E5-2698)|Gold-6240]'
#SBATCH --nodes=4
unset LD_PRELOAD
unset TMPDIR

# total number of available cores per node:
np=$(nproc)

# mpi-threads 
cores=4

# omp-threads
threads=$(( $np / $cores ))

# input/target to work on. $target.inp has to exist
target=my-gamess-run

/software/gamess/2021.09/rgms -i $target -c $cores -t $threads > $target.log 2>&1

Note: INTEL V4 E5-2640 are too old. Jobs will fail with MPI errors when using the old hardware (IB HCAs).

Using gamess with the original launch script

To initialize the environment use the module command:

# intel oneapi 2021 compiled gamess supporting MPI:
[max]% module load maxwell gamess
[max]% which gamess
/software/gamess/2021.09/gamess

# older single-node versions compiled with gcc 8.2 supporting OPENMP:
[max]% module load maxwell gamess/2019.07
 
 Provided by module(s)
   ... module load maxwell gamess; which gamess: /software/gamess/gamess

 Documentation: https://confluence.desy.de/display/IS/gamess
 URL:           https://www.msg.chem.iastate.edu/gamess/index.html
 Manual:        https://www.msg.chem.iastate.edu/gamess/documentation.html
 License:       https://www.msg.chem.iastate.edu/gamess/License_Agreement.html

Sample batch script:

#!/bin/bash
#SBATCH --partition=maxcpu
#SBATCH --job-name=gamess
#SBATCH --output=gamess.out
#SBATCH --time=0-04:00:00  
#SBATCH --constraint='[75F3|7402]'
#SBATCH --nodes=6
unset LD_PRELOAD
unset TMPDIR
export USERSCR=$PWD/restart
mkdir -p $USERSCR

job=gms.sample
input=$job.inp

np=$(( $(nproc) / 2 ))
nt=$(( $np * $SLURM_NNODES ))

/software/gamess/2020.09/gamess $input 00 $nt $np 1 > $job.log 2>&1
#----------------------------------------------------------------------------------------------------------------------------------------
# notes:

#SBATCH --constraint='[75F3|7402]'
#SBATCH --nodes=6
# that picks 6 identical nodes, either of type EPYC 75F3 or EPYC 7402. Both are well suited for the task I believe.

unset TMPDIR
# just in case. It forces gamess to use slurms local TMPDIR. The rungms-dev wrapper takes care of creating input-files on local disks.

export USERSCR=$PWD/restart
# per default gamess stores datafiles in ~/restart

np=$(( $(nproc) / 2 ))
nt=$(( $np * $SLURM_NNODES ))

# np is the number of physical cores per node. mpi jobs don't gain anything from hyperthreaded cores.
# nt is the total number of cores to be used by gamess
# Seems the best way for gamess to distribute tasks.

/software/gamess/2021.09/rungms-dev $input 00 $nt $np 1 > $job.log 2>&1
# does the run. 2>&1 redirect also errors to $job.log, which you might or might not prefer.