Summary
Source: https://www.msg.chem.iastate.edu/gamess/index.html
License: Free but you must agree to the license terms: https://www.msg.chem.iastate.edu/gamess/License_Agreement.html
Path: /software/gamess
Documentation: https://www.msg.chem.iastate.edu/gamess/documentation.html
Citation: if results obtained with GAMESS are published in the scientific literature, you will reference the program from the article M.W.Schmidt, K.K.Baldridge, J.A.Boatz, S.T.Elbert, M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga, K.A.Nguyen, S.J.Su, T.L.Windus, M.Dupuis, J.A.Montgomery J. Comput. Chem. 14, 1347-1363 (1993). Using specific methods included in GAMESS may require citing additional articles, as described in the manual. I agree to honor the request to cite additional papers, as appropriate.
The General Atomic and Molecular Electronic Structure System (GAMESS) is a general ab initio quantum chemistry package.
GAMESS is maintained by the members of the Gordon research group at Iowa State University.
Using gamess with an alternative launch script
The original rungms-dev script is a 1800 lines long c-shell-beast, invoking a few other scripts on the way to executing the binary. To make handling a bit easier and more transparent we offer a simplified bash-script named rgms. For a description of the original script see below.
module load maxwell gamess rgms rgms -i input [-c cores] [-t threads] [-l logn] [-h] [-v] -i|--input name of input file with or without .inp extension -c|--cores cores (MPI-threads) used per node (default: all physical cores) -l|--logn gamess logical nodes (default: 0) -t|--threads used per node. Identical to setting OMP_NUM_THREADS (default: 1) -h|--help display this -v|--verbose display debug information Defaults: number of nodes: 1 or number of SLURM nodes USERSCR: /home/schluenz/gamess/restart TMPDIR: /scratch/schluenz OMP_NUM_THREADS: 1
Some notes:
There are two options to run gamess on Maxwell
- you run it interactively. In this case, the script forces to run on a single node, without slurm-support. The script will per default use all physical cores.
- you run it as a slurm batch-job. The script uses again all physical cores by default, and all nodes allocated to this job. It assume that all nodes have an identical number of cores!
The number of threads is controlled by '-t' OR OMP_NUM_THREADS. rgms will by default assume a single thread. Running some simple tests (see below for some benchmarks), it appears best to use a smaller number of cores and rather increase the number of threads.
Note however, that the use of GDDI NGROUP>0 does not seem to work for OMP_NUM_THREADS>1. rgms does not enforce it, but if you run into MPI errors while running GDDI-jobs with OMP_NUM_THREADS>1, try to run it with OMP_NUM_THREADS=1 instead.
Changes in defaults:
- rungms-dev uses $HOME/restart per default as USERSCR. rgms uses $HOME/gamess/restart
- rungms-dev set OMP_NUM_THREADS=8 per default, rgms sets OMP_NUM_THREADS=1 per default
- rgms stores small temporary files in ~/.gamess to redistribute files with srun
- rgms is per default much more silent than rungms-dev unless invoked with verbose-flag
- rgms prints some job-information at the end of run
Running on EPYC 75F3
Using all 128 cores as mpi-threads on EPYC 75F3 nodes fails with too many open files error messages. Intel recommends using fewer cores per node. Even with 98 cores per node, intel mpi complains about memory allocation on IB HCAs. Simply limit the number of cores to physical cores 64 or even less and increase the number of threads. See below.
Benchmarks
After lots of trial and errors it appears best to use a small number of MPI-threads (cores) and rather scale up the number of OMP-threads. That might vary substantial depending on the input-file!
In an initial benchmark I used 4 nodes of EPYC 75F3 or 4 nodes of EPYC 7402, and changed number of MPI-threads and OMP-threads over certain ranges of values:
CPU | Nodes | Cores | MPI-T | OMP-T | C*T | Time (s) | CPU | Nodes | Cores | MPI-T | OMP-T | C*T | Time (s) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
E-7402 | 4 | 96 | 1 | E-75F3 | 4 | 128 | 1 | 16 | 16 | 833 | ||||
32 | 32 | 532 | 32 | 32 | 430 | |||||||||
48 | 48 | 388 | 64 | 64 | 269 | |||||||||
96 | 96 | 225 | 128 | 128 | 163 | |||||||||
E-7402 | 4 | 96 | 2 | E-75F3 | 4 | 128 | 2 | 16 | 32 | 430 | ||||
32 | 64 | 298 | 32 | 64 | 262 | |||||||||
48 | 96 | 215 | 64 | 128 | 152 | |||||||||
96 | 192 | 271 | 128 | 258 | 208 | |||||||||
E-7402 | 4 | 96 | 4 | 1 | 4 | 4286 | E-75F3 | 4 | 128 | 4 | 1 | 4 | 3493 | |
2 | 8 | 2173 | 2 | 8 | 1770 | |||||||||
4 | 16 | 1097 | 4 | 16 | 891 | |||||||||
8 | 32 | 572 | 8 | 32 | 460 | |||||||||
16 | 64 | 318 | 16 | 64 | 281 | |||||||||
32 | 128 | 239 | 32 | 128 | 156 | |||||||||
E-7402 | 4 | 96 | 8 | 1 | 8 | 2464 | E-75F3 | 4 | 128 | 8 | 1 | 8 | 2020 | |
2 | 16 | 1274 | 2 | 16 | 1031 | |||||||||
4 | 32 | 658 | 4 | 32 | 528 | |||||||||
8 | 64 | 369 | 8 | 64 | 314 | |||||||||
16 | 128 | 263 | 16 | 128 | 175 | |||||||||
32 | 256 | 264 | 32 | 256 | 180 | |||||||||
E-7402 | 4 | 96 | 16 | 1 | 16 | 1661 | E-75F3 | 4 | 128 | 16 | 1 | 16 | 1350 | |
2 | 32 | 880 | 2 | 32 | 701 | |||||||||
4 | 64 | 519 | 4 | 64 | 399 | |||||||||
8 | 128 | 378 | 8 | 128 | 235 | |||||||||
16 | 256 | 308 | 16 | 256 | 208 | |||||||||
32 | 512 | 310 | 32 | 512 | - | |||||||||
|
Conclusion: for this particular test it's obviously best to use 2 or 4 MPI-threads and a corresponding number of OMP-threads.
Based on the results of that benchmark I ran a few tests for various number of nodes, using 4 or 8 MPI-threads:
Runtime (s) for # of nodes | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CPU | Cores | MPI-T | OMP-T | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
E-75F3 | 128 | 4 | 32 | 521 | 272 | 195 | 156 | 132 | 117 | 106 | 98 | 95 |
8 | 16 | 526 | 294 | 214 | 178 | 151 | 134 | 125 | 120 | 120 | ||
E-7402 | 96 | 4 | 24 | 783 | 404 | 287 | 228 | 192 | 169 | 153 | 141 | 140 |
8 | 12 | 802 | 445 | 326 | 265 | 227 | 201 | 188 | 182 | 182 |
Conclusion: the gain going from 4 to 8 nodes is about 38%, which is not bad. In view of the limitations of the all-partition it's absolutely not worth it. Going from 4 to 6 nodes (which is still possible in the maxcpu partition) the runtime gets reduced by 25%. Running a single job - and assuming that 6 nodes are free, which is actually never the case - it might be worth it, but I'd recommend to stay with a maximum of 4 nodes.
Setting GDDI NGROUP=i with i=0,1,2,3,4 had no effect on the runtime whatsoever. Not knowing anything about gamess and GDDI I have the impression that GDDI has advantages for real shared memory machines, but that there is no advantage (but rather disadvantages) on the Maxwell cluster. We would be grateful for any indications that this impression is misleading (maxwell.service@desy.de)!
Recommended batch-script
In view of the benchmarks and limitations of the partitions, I'd recommend to use max 4 nodes, 4 MPI-threads, and a corresponding number of OMP-threads. A simple batch-script would then look like this:
!/bin/bash #SBATCH --partition=maxcpu #SBATCH --constraint='[75F3|7402|(V4&E5-2698)|Gold-6240]' #SBATCH --nodes=4 unset LD_PRELOAD unset TMPDIR # total number of available cores per node: np=$(nproc) # mpi-threads cores=4 # omp-threads threads=$(( $np / $cores )) # input/target to work on. $target.inp has to exist target=my-gamess-run /software/gamess/2021.09/rgms -i $target -c $cores -t $threads > $target.log 2>&1
Note: INTEL V4 E5-2640 are too old. Jobs will fail with MPI errors when using the old hardware (IB HCAs).
Using gamess with the original launch script
To initialize the environment use the module command:
# intel oneapi 2021 compiled gamess supporting MPI: [max]% module load maxwell gamess [max]% which gamess /software/gamess/2021.09/gamess # older single-node versions compiled with gcc 8.2 supporting OPENMP: [max]% module load maxwell gamess/2019.07 Provided by module(s) ... module load maxwell gamess; which gamess: /software/gamess/gamess Documentation: https://confluence.desy.de/display/IS/gamess URL: https://www.msg.chem.iastate.edu/gamess/index.html Manual: https://www.msg.chem.iastate.edu/gamess/documentation.html License: https://www.msg.chem.iastate.edu/gamess/License_Agreement.html
Sample batch script:
#!/bin/bash #SBATCH --partition=maxcpu #SBATCH --job-name=gamess #SBATCH --output=gamess.out #SBATCH --time=0-04:00:00 #SBATCH --constraint='[75F3|7402]' #SBATCH --nodes=6 unset LD_PRELOAD unset TMPDIR export USERSCR=$PWD/restart mkdir -p $USERSCR job=gms.sample input=$job.inp np=$(( $(nproc) / 2 )) nt=$(( $np * $SLURM_NNODES )) /software/gamess/2020.09/gamess $input 00 $nt $np 1 > $job.log 2>&1 #---------------------------------------------------------------------------------------------------------------------------------------- # notes: #SBATCH --constraint='[75F3|7402]' #SBATCH --nodes=6 # that picks 6 identical nodes, either of type EPYC 75F3 or EPYC 7402. Both are well suited for the task I believe. unset TMPDIR # just in case. It forces gamess to use slurms local TMPDIR. The rungms-dev wrapper takes care of creating input-files on local disks. export USERSCR=$PWD/restart # per default gamess stores datafiles in ~/restart np=$(( $(nproc) / 2 )) nt=$(( $np * $SLURM_NNODES )) # np is the number of physical cores per node. mpi jobs don't gain anything from hyperthreaded cores. # nt is the total number of cores to be used by gamess # Seems the best way for gamess to distribute tasks. /software/gamess/2021.09/rungms-dev $input 00 $nt $np 1 > $job.log 2>&1 # does the run. 2>&1 redirect also errors to $job.log, which you might or might not prefer.