Maxwell : namd benchmarks

Just to get a quick idea about namd performance of different compute node, a few test-runs based on the apoa1 benchmark (https://www.ks.uiuc.edu/Research/namd/utilities/) were done. The setup is not tuned at all, and the submission jobs quite naive (and probably simply not well suited). The numbers are not very consistent, so better tailor your own benchmarks to your computational problem. The verbs-smp version seems a good choice. For AMD use all physical cores (w/o hyperthreading), On INTEL it seems that using all cores (incl. hyperthreading) is preferably. For this particular run, the use of GPUs is not adding much.

CPU Job Templates

To iterate over different CPU-types available in the all partition, I used the following job-submission scriplet:

benchmark-cpu.sh
#!/bin/bash
unset LD_PRELOAD
# iterate over number of nodes
for n in 1 ; do 
    # iterate over cpu constraints
    for c in 'EPYC&7402' 'EPYC&7642' Gold-6126 Gold-6140 Gold-6226 Gold-6230 Gold-6240 Silver-4114 'V4&E5-2640' 'V4&E5-2698' ; do
	   C=$(echo $c | sed 's|&|-|g')
	   DDIR="${C}-${n}"
	   mkdir -p "$DDIR"
	   perl -p -e "s|NUM_NODES|$n|g" namd_template.sh | perl -p -e "s|WD|$DDIR|g" | perl -p -e "s|CONSTRAINT|$C|g" > $DDIR/namd_$DDIR.sh
	   sbatch -p all -C "$c" $DDIR/namd_$DDIR.sh
    done
done
exit
#
# full set of cpu-types: 'EPYC&7402' 'EPYC&7642' Gold-6126 Gold-6140 Gold-6226 Gold-6230 Gold-6240 Silver-4114 'V4&E5-2640' 'V4&E5-2698'
# only few nodes for Gold-6226 Gold-6230 'EPYC&7642'
# reduced set of cpus: 'EPYC&7402' Gold-6126 Gold-6140 Gold-6240 Silver-4114 'V4&E5-2640' 'V4&E5-2698'

The template:

namd_template.sh
#!/bin/bash
#SBATCH --nodes=NUM_NODES
#SBATCH --time=01:00:00
#SBATCH --job-name=namd-WD
#SBATCH --output=WD.out
#SBATCH --chdir=WD
unset LD_PRELOAD

# get the sample input
tar xf /beegfs/desy/user/schluenz/namd/apoa1.tar.gz --strip-components=1

# clean modules
source /etc/profile.d/modules.sh
module purge

# create nodelist
NODELIST=nodelist.$SLURM_JOBID
 rm -f $NODELIST
for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do
  echo "host $n" >> $NODELIST
done

# use cpu version
SSH="ssh -o PubkeyAcceptedKeyTypes=+ssh-dss -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR"

if [[ $PWD =~ smp ]]; then
    export PATH=/software/namd/NAMD_2.14_Linux-x86_64-verbs-smp:$PATH
else
    export PATH=/software/namd/NAMD_2.14_Linux-x86_64-verbs:$PATH
fi

np=$(($(nproc) * NUM_NODES ))

# run namd
for P in 8 16 32 $(( $np / 2 )) $np ; do
    PPN=$(( $P / NUM_NODES ))
    charmrun ++p $P ++ppn $PPN ++nodelist $NODELIST ++remote-shell "$SSH" $(which namd2) apoa1.namd > namd.$P.out

    speed=$(grep WallClock namd.$P.out)

    echo "Nodes: NUM_NODES  Procs: $P   Constraint: CONSTRAINT $speed"
    
done

One of the resulting batch scripts

batch script
#SBATCH --job-name=namd-Gold-6240-1
#SBATCH --output=Gold-6240-1.out
#SBATCH --chdir=Gold-6240-1
unset LD_PRELOAD

# get the sample input
tar xf /beegfs/desy/user/schluenz/namd/apoa1.tar.gz --strip-components=1

# avoid module
source /etc/profile.d/modules.sh
module purge

# create nodelist
NODELIST=nodelist.$SLURM_JOBID
 rm -f $NODELIST
for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do
  echo "host $n" >> $NODELIST
done
 
np=$(($(nproc) * 1 ))

# run namd
for P in 8 16 32 $(( $np / 2 )) $np ; do
    SSH="ssh -o PubkeyAcceptedKeyTypes=+ssh-dss -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR"
    export PATH=/software/namd/NAMD_2.14_Linux-x86_64-verbs:$PATH
    charmrun ++p $P ++nodelist $NODELIST ++remote-shell "$SSH" $(which namd2) apoa1.namd > namd.$P.out

    speed=$(grep WallClock namd.$P.out)

    echo "Nodes: 1  Procs: $P   Constraint: Gold-6240 $speed"
    
done


GPU Job Templates

Job templates for GPUs are very similar:

benchmark-gpu.sh
#!/bin/bash
unset LD_PRELOAD
# only run on single node
for n in 1 ; do 
    for c in 'Gold-5115&V100&GPUx2' 'Gold-6234&V100&GPUx2' 'Gold-6248&V100&GPUx4' 'Silver-4114&P100&GPUx1' 'Silver-4114&V100&GPUx1' 'Silver-4210&V100&GPUx1' 'Silver-4216&V100&GPUx1' 'Silver-4216&V100&GPUx2' 'V4&E5-2640&K40X&GPUx2' 'V4&E5-2640&P100&GPUx1' 'V4&E5-2640&P100&GPUx2' 'V4&E5-2640&P100&GPUx4' 'V4&E5-2640&V100&GPUx1' 'V4&E5-2695&K40X&GPUx1' 'V4&E5-2698&P100&GPUx2' ; do
	   C=$(echo $c | sed 's|&|-|g')
	   DDIR="${C}-${n}"
	   mkdir -p "$DDIR"
	   perl -p -e "s|NUM_NODES|$n|g" namd_template_gpu.sh | perl -p -e "s|WD|$DDIR|g" | perl -p -e "s|CONSTRAINT|$C|g" > $DDIR/namd_$DDIR.sh
	   sbatch -p allgpu -C "$c" $DDIR/namd_$DDIR.sh
    done
done
exit

The GPU job template

namd_template_gpu.sh
#!/bin/bash
#SBATCH --nodes=NUM_NODES
#SBATCH --time=01:00:00
#SBATCH --job-name=namd-WD
#SBATCH --output=WD.out
#SBATCH --chdir=WD
unset LD_PRELOAD

# get the sample input
tar xf /beegfs/desy/user/schluenz/namd/apoa1.tar.gz --strip-components=1

# clean modules
source /etc/profile.d/modules.sh
module purge

# use cuda version
export PATH=/software/namd/NAMD_2.14_Linux-x86_64-multicore-CUDA:$PATH

num_gpus=$(nvidia-smi -L | wc -l)

np=$(($(nproc) * NUM_NODES ))

# run namd

for p in 1 2 4 6 8; do
    P=$(( $p * $num_gpus ))
    charmrun +p $P ++local $(which namd2) apoa1.namd > namd.$P.out
    
    speed=$(grep WallClock namd.$P.out)

    echo "GPUs: $num_gpus  Procs: $P   Constraint: CONSTRAINT $speed"
    
done


One of the resulting batch scripts

batch script
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=01:00:00
#SBATCH --job-name=namd-Silver-4114-P100-GPUx1-1
#SBATCH --output=Silver-4114-P100-GPUx1-1.out
#SBATCH --chdir=Silver-4114-P100-GPUx1-1
unset LD_PRELOAD

# get the sample input
tar xf /beegfs/desy/user/schluenz/namd/apoa1.tar.gz --strip-components=1

# clean modules
source /etc/profile.d/modules.sh
module purge

# use cuda version
export PATH=/software/namd/NAMD_2.14_Linux-x86_64-multicore-CUDA:$PATH

num_gpus=$(nvidia-smi -L | wc -l)

np=$(($(nproc) * 1 ))

# run namd

for p in 1 2 4 6 8; do
    P=$(( $p * $num_gpus ))
    charmrun +p $P ++local $(which namd2) apoa1.namd > namd.$P.out
    
    speed=$(grep WallClock namd.$P.out)

    echo "GPUs: $num_gpus  Procs: $P   Constraint: Silver-4114-P100-GPUx1 $speed"
    
done


Results for CPU nodes

CPU#cores (ht)#nodes

Wallclock for # of threads (++p) per node (s)

verbs-smp: ++ppn=#threads/#nodes




81632ncorencore+htversion
EPYC-740248 (96)180.041.928.914.537.7verbs


179.541.622.811.417.8verbs-smp


242.022.714.88.6-verbs


2


15.97.1verbs-smp


4


14.6
verbs


4


27.37.1verbs-smp


8


12.5
verbs


8


23.919.0verbs-smp
EPYC-764296 (192)181.442.623.116.610.9verbs
Gold-612624 (48)174.639.331.425.217.9verbs


274.238.220.812.020.9verbs


2


24.613.9verbs-smp


4


9.4
verbs


4


16.77.7verbs-smp


8


8.4
verbs


8


15.07.1verbs-smp

Gold-6140

36 (72)179.641.523.018.125.4verbs


241.222.0
17.213.4verbs


2


18.79.1verbs-smp


4


14.411.6verbs


4


15.56.8verbs-smp


8


13.114.8verbs


8


13.04.3verbs-smp
Gold-622624 (48)171.237.530.524.016.0verbs
Gold-6230
1





Gold-624036 (72)175.039.121.914.623.2verbs


2


16.912.2verbs


2


17.87.7verbs-smp


4


15.430.0verbs


4


15.516.0verbs-smp


8


13.217.1verbs


8


12.55.0verbs-smp
Silver-411440 (80)190.547.539.143.327.0verbs


247.625.8
16.025.2verbs


2


19.322.4verbs-smp


4


16.911.3verbs


4


21.19.8verbs-smp


8




verbs


8


17.87.6verbs-smp

V4-E5-2640

20 (40)182.544.834.131.521.9verbs


2


13.020.5verbs


2


16.419.8verbs-smp


4


15.98.7verbs


4


18.97.3verbs-smp


8


13.86.8verbs


8


15.66.2verbs-smp
V4-E5-269840 (80)182.443.524.417.621.7verbs


2


19.610.1verbs


2


18.98.4verbs-smp


4


14.39.1verbs


4


15.26.6verbs-smp


8


12.78.6verbs


8


13.06.7verbs-smp

Best single node runs: EPYC-7402 with 48 threads, or EPYC-7642 with 192 threads

Best 2-node runs: EPYC-7402 with 2*96 cores (verbs-smp)

Fastest time: Gold-6140, 6240 with all cores (incl. HT) on 8 nodes

Results for GPU nodes

CPUGPU#GPUsWallclock for # of threads (++p) per GPU (s)

12468
Gold-5115V100230.913.420.311.819.6
Gold-6234V100227.911.919.410.918.7
Gold-6248V1004




Silver-4114P100136.416.414.113.421.8
Silver-4114V100137.417.515.214.522.6
Silver-4210V100134.615.313.112.420.0
Silver-4216V100134.215.113.012.219.9
Silver-4216V100230.713.420.311.819.6
V4-E5-2640K40X231.715.223.415.324.0
V4-E5-2640P100135.315.913.813.421.5
V4-E5-2640P100232.114.922.113.121.5
V4-E5-2640P1004




V4-E5-2640V1001




V4-E5-2695K40X136.420.518.819.628.4
V4-E5-2698P1002