Maxwell : srw

Summary

Source: https://github.com/ochubar/SRW

License: Open Source https://github.com/ochubar/SRW/blob/master/COPYRIGHT.txt

Path: /software/oasys as part of the oasys installation

Documentation: https://wpg.readthedocs.io/en/latest/tutorials/2/Tutorial_case_2.html

Related: oasys

SRW (Synchrotron Radiation Workshop") is a physical optics computer code for calculation of detailed characteristics of Synchrotron Radiation (SR) generated by relativistic electrons in magnetic fields of arbitrary configuration and for simulation of the radiation wavefront propagation through optical systems of beamlines.

Using srw

srw is available as part of the oasys installation and initialied with a custom oasys-mpi module:

[max]% module load maxwell oasys-mpi

[max]% mpiexec -np 16 --mca pml ucx python SRWLIB_Example12.py -m 100

To simplify batch-jobs there is a minimal batch-script available under /software/oasys/bin/srw.sh. Simple examples to run the script:

# this ensures that srw.sh in your PATH. 
# only helpful for salloc, for sbatch one needs to specify the full path - or use $(which srw.sh) -
[max]% module load maxwell oasys-mpi 

# run srw on the "local" maxwell machine. 
# on max-display, max-wgs the script sets the default number of course to 4, but can be overridden:
[max]% NP=2 srw.sh "SRWLIB_Example12.py -m 100"

# salloc can used to run srw on batch-node, while still being kind of interactive, for example
[max]% NP=48 salloc --partition=all --time=0-04:00:00 --constraint='EPYC&7402' --nodes=2 srw.sh "SRWLIB_Example12.py -m 100"

# NP=48: use 48 cores per node. EPYC 7402 are equipped with 48 physical cores.
# --partition=all ... options specified on the command-line override the ones specified in srw.sh (--partition=maxcpu --time?=0-04:00)
# "SRWLIB_Example12.py -m 100" all parameters for srw have to specified in a single string, the quotes are required.

# using sbatch all nodes are exclusive, so no need to specify the number of cores
[max]% sbatch --partition=short --nodes=4 /software/oasys/bin/srw.sh "SRWLIB_Example12.py -m 10000"

# for partitions with mixed hardware it's helpful to ensure all nodes used are identical, for example
[max]% sbatch --partition=maxcpu,allcpu --nodes=4 --constraint="[Gold-6240|7402|E5-2640]" $(which srw.sh) "SRWLIB_Example12.py -m 10000"

Note: NP defines the number of cores pro node used by srw.sh. The defaults (see benchmarks below):

For 1-4 nodes use all cores (physical+logical)
For >4 nodes use only physical cores
For interactive jobs use only 4 cores

The defaults can be overridden by setting NP=<your-favorite-number-of-cores>

Benchmarking

# submit jobs for 1,2,4,8,16 nodes
# give each output file a unique name
# --dependency=singleton ensures that only one job is running at a time
[max]% for nodes in 1 2 4 8 16 ; do 
   sbatch --dependency=singleton --partition=all --output=srw.nodes-$nodes.%j.out \
  --nodes=$nodes --constraint=7402 /software/oasys/bin/srw.sh "SRWLIB_Example12.py -m 10000" 
done

# check for status of jobs
[max]% squeue -u $USER -n srw.sh

# once all jobs are done collect some information
[max]% for out in srw*.out ; do 
   x=($(echo $out | tr '\-.' ' '))
   echo "nodes: ${x[2]} time: $(sacct --noheader -X -j ${x[3]} --format=elapsed)  jobid: ${x[3]}"
done
# sacct: 
#       --noheader         don't display header
#       -X                 don't show job-steps but only the entire job
#       --format=elapsed   time used by job
# output
nodes: 1 time:   00:26:44   jobid: 8499227
nodes: 2 time:   00:12:56   jobid: 8499228
nodes: 4 time:   00:06:51   jobid: 8499229
nodes: 8 time:   00:04:44   jobid: 8499230

Benchmarking results

		nodes
CPU-Type	cores used / available /physical	1	2	4	8	16
AMD EPYC 7402	48 / 96 / 48	26:44	12:56	06:51	04:44	02:34
... use hyperthreaded cores	96 / 96 / 48	18:53	10:01	06:52	03:40	03:31
AMD EPYC 7542	64 / 128 / 64	19:59	10:31	05:46	03:40	03:31
... use hyperthreaded cores	128 / 128 / 64	14:30	08:05	05:11	04:33	-
Intel Gold-6240	36 / 72 / 36	43:58	19:53	10:40	06:31	04:02
... use hyperthreaded cores	72 / 72 / 36	33:46	17:14	10:56	06:50	05:10
Intel E5-2640	20 / 40 / 20	83:24	41:14	20:59	10:04	05:38
... use hyperthreaded cores	40 / 40 / 20	68:31	34:35	16:35	*	08:08

When running on more than 4 nodes, use of hyperthreaded cores was almost always slower than running on physical cores only. * tends to crash.