Maxwell : Astra benchmarks - ompi 1.6

Running tests on AMD EPYC 7402





runtime (s) for mpi parameters


CPUIB HCACoresNodes123456PMLZSTOPrepetitions
EPYC 7402ConnectX-448/96/481696563677596657556ob11.03
EPYC 7402ConnectX-448/96/482525459503475516459ob11.03
EPYC 7402ConnectX-448/96/484478431486403459405ob11.03
EPYC 7402ConnectX-448/96/481383441753750342542603829ob17.01
EPYC 7402ConnectX-448/96/482309630052969279728862822ob17.01
EPYC 7402ConnectX-448/96/484266425362633248926992529ob17.01
#mpi parameter
1

-npernode 48 --mca btl self,openib --mca pml $PML --mca btl_openib_device_param_files /beegfs/desy/user/schluenz/ASTRA.bench/mca-btl-openib-device-params.ini

2

-npernode 48 --mca pml $PML --mca btl_openib_device_param_files /beegfs/desy/user/schluenz/ASTRA.bench/mca-btl-openib-device-params.ini

3

-npernode 48 --mca btl self,openib --mca btl_openib_device_param_files /beegfs/desy/user/schluenz/ASTRA.bench/mca-btl-openib-device-params.ini

4

-npernode 48 --mca btl_openib_device_param_files /beegfs/desy/user/schluenz/ASTRA.bench/mca-btl-openib-device-params.ini

5-npernode 48  --mca btl self,openib --mca pml $PML
6-npernode 48

Remarks:

  • Cores: phyiscal/physical+logical/cores/used
  • Multi-node jobs will NOT work on modern hardware (ConnectX-6). The openmpi-version used is simply too old
  • There is hardly a point in using multiple nodes, the codes scales not well enough.

Running tests on different hardware

CPUIB HCACoresNodes46PMLZSTOP#runs
EPYC 7402ConnectX-648/96/481588604ucx1.03
EPYC 7642ConnectX-696/192/961599603ucx1.03
EPYC 7542ConnectX-664/128/641560545ucx1.03
EPYC 7F52ConnectX-632/64/321564571ucx1.03
EPYC 7H12ConnectX-6128/256/1281568636ucx1.03
Gold 6140ConnectX-436/72/361845850ucx1.03
Gold 6240ConnectX-636/72/361788802ucx1.03
#mpi parameter
4

-npernode $(( $(nproc)/2)) --mca btl_openib_device_param_files /beegfs/desy/user/schluenz/ASTRA.bench/mca-btl-openib-device-params.ini

6-npernode $(( $(nproc)/2))

Remarks:

  • Cores: phyiscal/physical+logical/cores used
  • Using a modified device parameter file seems to help a little in most cases, but is not essential
  • performance differences across EPYC CPUs is marginal, Intel Gold are considerably slower

Running tests vs number of cores





runtime (s) for number of cores used (ZSTOP=1.0)ZSTOP=7.0

CPUIB HCACoresNodes12162432486496128256#runs#cores=#physical

#runs

PML
EPYC 7F52ConnectX-632/64113881027746550638576


336541ob1
EPYC 7402ConnectX-448/96116171322928744540673658

339191ob1
EPYC 7542ConnectX-664/128116371257895729559564556668
334191ob1
EPYC 7642ConnectX-696/1921

944827612
586

1
1ob1
EPYC 7H12ConnectX-6128/256117471386930741511505499594-334241ob1

Remarks:

  • Cores: physical/physical+logical cores
  • numbers for EPYC 7642 are not very reliable, lack of resources
  • number for ZSTOP=7.0 are also not very reliable, multiple runs take too long to survive in the all partition
  • choosing the number of physical cores is always the best choice
  • performance difference are marginal
  • The old openmpi version is not capable of using 256 cores on a single node

Concurrent processes on EPYC 7H12, 7542

CPUIB HCACores#concurrent jobsruntime (s)runtime (s) / #jobsmpi parsZSTOP
EPYC 7H12
EPYC 7H12ConnectX-6128/256/32174174161.0
EPYC 7H12ConnectX-6128/256/32273336761.0
EPYC 7H12ConnectX-6128/256/324110027561.0
EPYC 7H12ConnectX-6128/256/328146018261.0
EPYC 7H12ConnectX-6128/256/64150550561.0
EPYC 7H12ConnectX-6128/256/64276538361.0
EPYC 7H12ConnectX-6128/256/644112028061.0
EPYC 7H12ConnectX-6128/256/168184023061.0
EPYC 7H12ConnectX-6128/256/16165200-6400325-40061.0
EPYC 7542
EPYC 7542ConnectX-664/128/1611257125761.0
EPYC 7542ConnectX-664/128/162127063561.0
EPYC 7542ConnectX-664/128/163136445561.0
EPYC 7542ConnectX-664/128/164146436661.0
EPYC 7542ConnectX-664/128/168204025561.0
EPYC 7542ConnectX-664/128/32172972961.0
EPYC 7542ConnectX-664/128/32292046061.0
EPYC 7542ConnectX-664/128/323102034061.0
EPYC 7542ConnectX-664/128/324108627261.0

Remarks:

  • Cores: physical/physical+logical cores/cores used
  • Very inefficient to use 128 cores spread over 4 processes. openmpi is too old to properly identify sockets. so threads are presumably quickly competing ...