Summary

Sourcehttps://www.msg.chem.iastate.edu/gamess/index.html

License:  Free but you must agree to the license terms: https://www.msg.chem.iastate.edu/gamess/License_Agreement.html

Path:  /software/gamess

Documentationhttps://www.msg.chem.iastate.edu/gamess/documentation.html

Citation: if results obtained with GAMESS are published in the scientific literature, you will reference the program from the article M.W.Schmidt, K.K.Baldridge, J.A.Boatz, S.T.Elbert, M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga, K.A.Nguyen, S.J.Su, T.L.Windus, M.Dupuis, J.A.Montgomery J. Comput. Chem. 14, 1347-1363 (1993). Using specific methods included in GAMESS may require citing additional articles, as described in the manual. I agree to honor the request to cite additional papers, as appropriate.

The General Atomic and Molecular Electronic Structure System (GAMESS) is a general ab initio quantum chemistry package.

GAMESS is maintained by the members of the Gordon research group at Iowa State University.

Using gamess with an alternative launch script

The original rungms-dev script is a 1800 lines long c-shell-beast, invoking a few other scripts on the way to executing the binary. To make handling a bit easier and more transparent we offer a simplified bash-script named rgms. For a description of the original script see below.

module load maxwell gamess
rgms
   rgms -i input [-c cores] [-t threads] [-l logn] [-h] [-v]

       -i|--input   name of input file with or without .inp extension

       -c|--cores   cores (MPI-threads) used per node (default: all physical cores) 
       -l|--logn    gamess logical nodes (default: 0)
       -t|--threads used per node. Identical to setting OMP_NUM_THREADS (default: 1)

       -h|--help    display this
       -v|--verbose display debug information 

    Defaults:
        number of nodes: 1 or number of SLURM nodes
        USERSCR:         /home/schluenz/gamess/restart
        TMPDIR:          /scratch/schluenz
        OMP_NUM_THREADS: 1        

Some notes:

There are two options to run gamess on Maxwell

  1. you run it interactively. In this case, the script forces to run on a single node, without slurm-support. The script will per default use all physical cores.
  2. you run it as a slurm batch-job. The script uses again all physical cores by default, and all nodes allocated to this job. It assume that all nodes have an identical number of cores!

The number of threads is controlled by '-t' OR OMP_NUM_THREADS. rgms will by default assume a single thread. Running some simple tests (see below for some benchmarks), it appears best to use a smaller number of cores and rather increase the number of threads.

Note however, that the use of GDDI NGROUP>0 does not seem to work for OMP_NUM_THREADS>1. rgms does not enforce it, but if you run into MPI errors while running GDDI-jobs with OMP_NUM_THREADS>1, try to run it with OMP_NUM_THREADS=1 instead.

Changes in defaults:

  • rungms-dev uses $HOME/restart per default as USERSCR. rgms uses $HOME/gamess/restart
  • rungms-dev set OMP_NUM_THREADS=8 per default, rgms sets OMP_NUM_THREADS=1 per default
  • rgms stores small temporary files in ~/.gamess to redistribute files with srun
  • rgms is per default much more silent than rungms-dev unless invoked with verbose-flag
  • rgms prints some job-information at the end of run

Running on EPYC 75F3

Using all 128 cores as mpi-threads on EPYC 75F3 nodes fails with too many open files error messages. Intel recommends using fewer cores per node. Even with 98 cores per node, intel mpi complains about memory allocation on IB HCAs. Simply limit the number of cores to physical cores 64 or even less and increase the number of threads. See below.

Benchmarks

After lots of trial and errors it appears best to use a small number of MPI-threads (cores) and rather scale up the number of OMP-threads. That might vary substantial depending on the input-file!

In an initial benchmark I used 4 nodes of EPYC 75F3 or 4 nodes of EPYC 7402, and changed number of MPI-threads and OMP-threads over certain ranges of values:

CPUNodesCoresMPI-TOMP-TC*TTime (s)
CPUNodesCoresMPI-TOMP-TC*TTime (s)
E-74024961



E-75F3412811616833




3232532




3232430




4848388




6464269




9696225




128128163
E-74024962



E-75F3412821632430




3264298




3264262




4896215




64128152




96192271




128258208
E-74024964144286
E-75F341284143493




282173




281770




4161097




416891




832572




832460




1664318




1664281




32128239




32128156
E-74024968182464
E-75F341288182020




2161274




2161031




432658




432528




864369




864314




16128263




16128175




32256264




32256180
E-7402496161161661
E-75F34128161161350




232880




232701




464519




464399




8128378




8128235




16256308




16256208




32512310




32512-
  • Nodes: number of slurm nodes
  • Cores: number of cores per slurm node (incl. hyperthreading)
  • MPI-T: MPI threads (# of cores per node used for MPI)
  • OMP-T: OMP_NUM_THREADS
  • C*T: MPI-Threads*OMP-Threads. Best choice if C*T=Cores
  • Time (s): elapsed time to run the job

Conclusion: for this particular test it's obviously best to use 2 or 4 MPI-threads and a corresponding number of OMP-threads.

Based on the results of that benchmark I ran a few tests for various number of nodes, using 4 or 8 MPI-threads:





Runtime (s) for # of nodes
CPUCoresMPI-TOMP-T123456789
E-75F31284325212721951561321171069895


816526294214178151134125120120
E-740296424783404287228192169153141140


812802445326265227201188182182

Conclusion: the gain going from 4 to 8 nodes is about 38%, which is not bad. In view of the limitations of the all-partition it's absolutely not worth it. Going from 4 to 6 nodes (which is still possible in the maxcpu partition) the runtime gets reduced by 25%. Running a single job - and assuming that 6 nodes are free, which is actually never the case - it might be worth it, but I'd recommend to stay with a maximum of 4 nodes.

Setting GDDI NGROUP=i with i=0,1,2,3,4 had no effect on the runtime whatsoever. Not knowing anything about gamess and GDDI I have the impression that GDDI has advantages for real shared memory machines, but that there is no advantage (but rather disadvantages) on the Maxwell cluster. We would be grateful for any indications that this impression is misleading (maxwell.service@desy.de)!

Recommended batch-script

In view of the benchmarks and limitations of the partitions, I'd recommend to use max 4 nodes, 4 MPI-threads, and a corresponding number of OMP-threads. A simple batch-script would then look like this:

!/bin/bash
#SBATCH --partition=maxcpu
#SBATCH --constraint='[75F3|7402|(V4&E5-2698)|Gold-6240]'
#SBATCH --nodes=4
unset LD_PRELOAD
unset TMPDIR

# total number of available cores per node:
np=$(nproc)

# mpi-threads 
cores=4

# omp-threads
threads=$(( $np / $cores ))

# input/target to work on. $target.inp has to exist
target=my-gamess-run

/software/gamess/2021.09/rgms -i $target -c $cores -t $threads > $target.log 2>&1 


Note: INTEL V4 E5-2640 are too old. Jobs will fail with MPI errors when using the old hardware (IB HCAs).

Using  gamess with the original launch script

To initialize the environment use the module command:

# intel oneapi 2021 compiled gamess supporting MPI:
[max]% module load maxwell gamess
[max]% which gamess
/software/gamess/2021.09/gamess

# older single-node versions compiled with gcc 8.2 supporting OPENMP:
[max]% module load maxwell gamess/2019.07
 
 Provided by module(s)
   ... module load maxwell gamess; which gamess: /software/gamess/gamess

 Documentation: https://confluence.desy.de/display/IS/gamess
 URL:           https://www.msg.chem.iastate.edu/gamess/index.html
 Manual:        https://www.msg.chem.iastate.edu/gamess/documentation.html
 License:       https://www.msg.chem.iastate.edu/gamess/License_Agreement.html

Sample batch script:

#!/bin/bash
#SBATCH --partition=maxcpu
#SBATCH --job-name=gamess
#SBATCH --output=gamess.out
#SBATCH --time=0-04:00:00  
#SBATCH --constraint='[75F3|7402]'
#SBATCH --nodes=6
unset LD_PRELOAD
unset TMPDIR
export USERSCR=$PWD/restart
mkdir -p $USERSCR

job=gms.sample
input=$job.inp

np=$(( $(nproc) / 2 ))
nt=$(( $np * $SLURM_NNODES ))

/software/gamess/2020.09/gamess $input 00 $nt $np 1 > $job.log 2>&1
#----------------------------------------------------------------------------------------------------------------------------------------
# notes:

#SBATCH --constraint='[75F3|7402]'
#SBATCH --nodes=6
# that picks 6 identical nodes, either of type EPYC 75F3 or EPYC 7402. Both are well suited for the task I believe.

unset TMPDIR
# just in case. It forces gamess to use slurms local TMPDIR. The rungms-dev wrapper takes care of creating input-files on local disks.

export USERSCR=$PWD/restart
# per default gamess stores datafiles in ~/restart

np=$(( $(nproc) / 2 ))
nt=$(( $np * $SLURM_NNODES ))

# np is the number of physical cores per node. mpi jobs don't gain anything from hyperthreaded cores.
# nt is the total number of cores to be used by gamess
# Seems the best way for gamess to distribute tasks.

/software/gamess/2021.09/rungms-dev $input 00 $nt $np 1 > $job.log 2>&1
# does the run. 2>&1 redirect also errors to $job.log, which you might or might not prefer.