Running tests on AMD EPYC 7402
runtime (s) for mpi parameters | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CPU | IB HCA | Cores | Nodes | 1 | 2 | 3 | 4 | 5 | 6 | PML | ZSTOP | repetitions |
EPYC 7402 | ConnectX-4 | 48/96/48 | 1 | 696 | 563 | 677 | 596 | 657 | 556 | ob1 | 1.0 | 3 |
EPYC 7402 | ConnectX-4 | 48/96/48 | 2 | 525 | 459 | 503 | 475 | 516 | 459 | ob1 | 1.0 | 3 |
EPYC 7402 | ConnectX-4 | 48/96/48 | 4 | 478 | 431 | 486 | 403 | 459 | 405 | ob1 | 1.0 | 3 |
EPYC 7402 | ConnectX-4 | 48/96/48 | 1 | 3834 | 4175 | 3750 | 3425 | 4260 | 3829 | ob1 | 7.0 | 1 |
EPYC 7402 | ConnectX-4 | 48/96/48 | 2 | 3096 | 3005 | 2969 | 2797 | 2886 | 2822 | ob1 | 7.0 | 1 |
EPYC 7402 | ConnectX-4 | 48/96/48 | 4 | 2664 | 2536 | 2633 | 2489 | 2699 | 2529 | ob1 | 7.0 | 1 |
# | mpi parameter |
---|---|
1 | -npernode 48 --mca btl self,openib --mca pml $PML --mca btl_openib_device_param_files /beegfs/desy/user/schluenz/ASTRA.bench/mca-btl-openib-device-params.ini |
2 | -npernode 48 --mca pml $PML --mca btl_openib_device_param_files /beegfs/desy/user/schluenz/ASTRA.bench/mca-btl-openib-device-params.ini |
3 | -npernode 48 --mca btl self,openib --mca btl_openib_device_param_files /beegfs/desy/user/schluenz/ASTRA.bench/mca-btl-openib-device-params.ini |
4 | -npernode 48 --mca btl_openib_device_param_files /beegfs/desy/user/schluenz/ASTRA.bench/mca-btl-openib-device-params.ini |
5 | -npernode 48 --mca btl self,openib --mca pml $PML |
6 | -npernode 48 |
Remarks:
- Cores: phyiscal/physical+logical/cores/used
- Multi-node jobs will NOT work on modern hardware (ConnectX-6). The openmpi-version used is simply too old
- There is hardly a point in using multiple nodes, the codes scales not well enough.
Running tests on different hardware
CPU | IB HCA | Cores | Nodes | 4 | 6 | PML | ZSTOP | #runs |
---|---|---|---|---|---|---|---|---|
EPYC 7402 | ConnectX-6 | 48/96/48 | 1 | 588 | 604 | ucx | 1.0 | 3 |
EPYC 7642 | ConnectX-6 | 96/192/96 | 1 | 599 | 603 | ucx | 1.0 | 3 |
EPYC 7542 | ConnectX-6 | 64/128/64 | 1 | 560 | 545 | ucx | 1.0 | 3 |
EPYC 7F52 | ConnectX-6 | 32/64/32 | 1 | 564 | 571 | ucx | 1.0 | 3 |
EPYC 7H12 | ConnectX-6 | 128/256/128 | 1 | 568 | 636 | ucx | 1.0 | 3 |
Gold 6140 | ConnectX-4 | 36/72/36 | 1 | 845 | 850 | ucx | 1.0 | 3 |
Gold 6240 | ConnectX-6 | 36/72/36 | 1 | 788 | 802 | ucx | 1.0 | 3 |
# | mpi parameter |
---|---|
4 | -npernode $(( $(nproc)/2)) --mca btl_openib_device_param_files /beegfs/desy/user/schluenz/ASTRA.bench/mca-btl-openib-device-params.ini |
6 | -npernode $(( $(nproc)/2)) |
Remarks:
- Cores: phyiscal/physical+logical/cores used
- Using a modified device parameter file seems to help a little in most cases, but is not essential
- performance differences across EPYC CPUs is marginal, Intel Gold are considerably slower
Running tests vs number of cores
runtime (s) for number of cores used (ZSTOP=1.0) | ZSTOP=7.0 | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CPU | IB HCA | Cores | Nodes | 12 | 16 | 24 | 32 | 48 | 64 | 96 | 128 | 256 | #runs | #cores=#physical | #runs | PML |
EPYC 7F52 | ConnectX-6 | 32/64 | 1 | 1388 | 1027 | 746 | 550 | 638 | 576 | 3 | 3654 | 1 | ob1 | |||
EPYC 7402 | ConnectX-4 | 48/96 | 1 | 1617 | 1322 | 928 | 744 | 540 | 673 | 658 | 3 | 3919 | 1 | ob1 | ||
EPYC 7542 | ConnectX-6 | 64/128 | 1 | 1637 | 1257 | 895 | 729 | 559 | 564 | 556 | 668 | 3 | 3419 | 1 | ob1 | |
EPYC 7642 | ConnectX-6 | 96/192 | 1 | 944 | 827 | 612 | 586 | 1 | 1 | ob1 | ||||||
EPYC 7H12 | ConnectX-6 | 128/256 | 1 | 1747 | 1386 | 930 | 741 | 511 | 505 | 499 | 594 | - | 3 | 3424 | 1 | ob1 |
Remarks:
- Cores: physical/physical+logical cores
- numbers for EPYC 7642 are not very reliable, lack of resources
- number for ZSTOP=7.0 are also not very reliable, multiple runs take too long to survive in the all partition
- choosing the number of physical cores is always the best choice
- performance difference are marginal
- The old openmpi version is not capable of using 256 cores on a single node
Concurrent processes on EPYC 7H12, 7542
CPU | IB HCA | Cores | #concurrent jobs | runtime (s) | runtime (s) / #jobs | mpi pars | ZSTOP |
---|---|---|---|---|---|---|---|
EPYC 7H12 | |||||||
EPYC 7H12 | ConnectX-6 | 128/256/32 | 1 | 741 | 741 | 6 | 1.0 |
EPYC 7H12 | ConnectX-6 | 128/256/32 | 2 | 733 | 367 | 6 | 1.0 |
EPYC 7H12 | ConnectX-6 | 128/256/32 | 4 | 1100 | 275 | 6 | 1.0 |
EPYC 7H12 | ConnectX-6 | 128/256/32 | 8 | 1460 | 182 | 6 | 1.0 |
EPYC 7H12 | ConnectX-6 | 128/256/64 | 1 | 505 | 505 | 6 | 1.0 |
EPYC 7H12 | ConnectX-6 | 128/256/64 | 2 | 765 | 383 | 6 | 1.0 |
EPYC 7H12 | ConnectX-6 | 128/256/64 | 4 | 1120 | 280 | 6 | 1.0 |
EPYC 7H12 | ConnectX-6 | 128/256/16 | 8 | 1840 | 230 | 6 | 1.0 |
EPYC 7H12 | ConnectX-6 | 128/256/16 | 16 | 5200-6400 | 325-400 | 6 | 1.0 |
EPYC 7542 | |||||||
EPYC 7542 | ConnectX-6 | 64/128/16 | 1 | 1257 | 1257 | 6 | 1.0 |
EPYC 7542 | ConnectX-6 | 64/128/16 | 2 | 1270 | 635 | 6 | 1.0 |
EPYC 7542 | ConnectX-6 | 64/128/16 | 3 | 1364 | 455 | 6 | 1.0 |
EPYC 7542 | ConnectX-6 | 64/128/16 | 4 | 1464 | 366 | 6 | 1.0 |
EPYC 7542 | ConnectX-6 | 64/128/16 | 8 | 2040 | 255 | 6 | 1.0 |
EPYC 7542 | ConnectX-6 | 64/128/32 | 1 | 729 | 729 | 6 | 1.0 |
EPYC 7542 | ConnectX-6 | 64/128/32 | 2 | 920 | 460 | 6 | 1.0 |
EPYC 7542 | ConnectX-6 | 64/128/32 | 3 | 1020 | 340 | 6 | 1.0 |
EPYC 7542 | ConnectX-6 | 64/128/32 | 4 | 1086 | 272 | 6 | 1.0 |
Remarks:
- Cores: physical/physical+logical cores/cores used
- Very inefficient to use 128 cores spread over 4 processes. openmpi is too old to properly identify sockets. so threads are presumably quickly competing ...