Occasionally one needs to run a larger number of fairly identical single threaded jobs. Running each of this jobs as a slurm job is - in our environment - very inefficient, since each of these jobs uses only a single core, leaving all other cores idle. Generally, BIRD would be much better suited, but is not always an option due to data access requirements.
GNU parallel offers handy possibilities to parallelize single-threaded processes. The documentation (https://www.gnu.org/software/parallel/parallel_tutorial.html) provides a number of examples. It's an amazingly powerful tool, and the following "real life" examples only scratches the surface of it's capabilities in a rather primitive way.
Citation for GNU parallel
O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
Location of scriplets: /beegfs/desy/user/schluenz/Process_with_GNU_parallel
parallel listening on fifo
It's possible to let parallel listen to a named FIFO socket. So one could use it at as a kind of daemon watching for events coming in and parse on the input to a scriplet. The input can be a single word, but can also contain an entire input file, A most basic filewatcher could be composed of a slurm batch script and file-handling scripts. The slurm batch script:
#!/bin/bash #SBATCH --time=0-00:20:00 unset LD_PRELOAD # # output file # rm -f /beegfs/desy/user/$USER/processing/processed.out touch /beegfs/desy/user/$USER/processing/processed.out # # setup fifo and let parallel listen to it # mkdir -p /tmp/$USER FIFO=/tmp/$USER/pifo rm -f $FIFO mkfifo $FIFO while true; do cat $FIFO; done | parallel -j 60 --ungroup --gnu --eof "end_of_file" /beegfs/desy/user/$USER/processing/process_files.sh "{}" & # # watch for files. # any new file will be sent to parallel for processing # as soon as a file name X_DONE_X appears the job will terminate (after all processing is done) # IMG_DIR=/beegfs/desy/user/$USER/processing/Images nfiles=0 while true; do # # send all files existing at startup to parallel # and all newly occuring files # find is cheap as it doesn't stat files # if [[ ! -f /dev/shm/img.lst ]]; then find $IMG_DIR -type f | sort > /dev/shm/img.lst to_proc=$(cat /dev/shm/img.lst) else mv /dev/shm/img.lst /dev/shm/img.old find $IMG_DIR -type f | sort > /dev/shm/img.lst to_proc=$(comm -13 /dev/shm/img.old /dev/shm/img.lst) fi for new in $to_proc ; do echo $new > $FIFO nfile=$(($nfile+1)) done # # if X_DONE_X is a file it shouldn't be counted! # nfile=$(($nfile-1)) # # X_DONE_X signals end of processing, but wait for processes to finish # if [[ "$to_proc" =~ "X_DONE_X" ]]; then test=999 # parallel catches sigterm nprocessed=$(cat /beegfs/desy/user/$USER/processing/processed.out | wc -l) while [[ $test -gt 1 || $nfile -gt $nprocessed ]]; do test=$(ps -alef | grep process_files | grep -v grep | wc -l) nprocessed=$(cat /beegfs/desy/user/$USER/processing/processed.out | wc -l) sleep 1 done echo "stopping this JOB at $(date)" scancel $SLURM_JOB_ID fi sleep 0.1 done
The file-processing script:
#!/bin/bash input=$1 if [[ $input == "X_DONE_X" ]]; then echo "not processing DONE signal" # # could add a termination scriplet here # else echo new file $input at $(date) sleep $((1 + $RANDOM % 100)) fi echo $input >> /beegfs/desy/user/$USER/processing/processed.out
One of the downsides of this primitive setup: it's not possible to send an "EOF" through the fifo to terminate parallel. So one needs a kind of signal to terminate the batch-job (or do an scancel once it's done).
Note: parallel won't start any processing before it has received as many lines of input as number of parallel jobs (60 in the example above)! If that's too slow, feed parallel with dummy jobs beforehand!
Running dozor over cbf files
Pretty much the same setup could be used to process cbf-images generated by a grid-scan. To simulate that behavior, a small scriplet copies 1000 images into an empty folder:
#!/bin/bash for file in /beegfs/desy/user/$USER/TEST/_*.cbf ; do cp -v $file /beegfs/desy/user/$USER/processing/Images done touch /beegfs/desy/user/$USER/processing/Images/X_DONE_X_000001.cbf
During the tests the file-copy was started after the batch-job started running, but it basically doesn't matter. The "filewatcher" scripts looks very similar to the one above:
#!/bin/bash #SBATCH --time=0-00:20:00 unset LD_PRELOAD source /etc/profile.d/modules.sh module load xray # # output file # procdir=/beegfs/desy/user/$USER/processing/ procfile=/beegfs/desy/user/$USER/processing/dozor.out rm -f $procfile touch $procfile # # setup fifo # mkdir -p /tmp/$USER FIFO=/tmp/$USER/pifo rm -f $FIFO mkfifo $FIFO # let parallel listen to fifo nparallel=$(nproc) while true; do cat $FIFO; done | parallel -j $nparallel --ungroup --gnu --eof "end_of_file" /beegfs/desy/user/$USER/processing/process_dozor.sh "{}" & # # watch for files. # any new file will be sent to parallel for processing # as soon as a file name X_DONE_X appears the job will terminate (after all processing is done) # IMG_DIR=/beegfs/desy/user/$USER/processing/Images # # for a test, I started with an empty dir # copy one h5 file per ~0.1s # run $nparallel dozor jobs at a time # for simplicity, parameters obtained from a master_file is omitted, but rather parsed on to the dozor runs # cat << EOF > $procdir/template.dat nx 2463 ny 2527 pixel 0.172 detector_distance 200.0 X-ray_wavelength 1.03315 orgx 1259 orgy 1240 ix_min 0 ix_max 0 iy_min 0 iy_max 0 first_image_number _START_ number_images _NIMAGES_ name_template_image $IMG_DIR/TEST_?????.cbf library /opt/xray/xdsapp3.0/plugins/xds-zcbf.so EOF nfiles=0 while true; do # # send all files existing at startup to parallel # and all newly occuring files # find is cheap as it doesn't stat files # if [[ ! -f /dev/shm/img.lst ]]; then find $IMG_DIR -type f | sort > /dev/shm/img.lst to_proc=$(cat /dev/shm/img.lst) else mv /dev/shm/img.lst /dev/shm/img.old find $IMG_DIR -type f | sort > /dev/shm/img.lst to_proc=$(comm -13 /dev/shm/img.old /dev/shm/img.lst) fi for new in $to_proc ; do NIMAGES=1 basename=$(echo $new | rev | cut -d/ -f1|rev) START=$(echo $basename | rev | cut -d_ -f1|rev | cut -d. -f1 | bc) echo "$basename $START $NIMAGES $(date)" perl -p -e "s|_START_|$START|" $procdir/template.dat | perl -p -e "s|_NIMAGES_|$NIMAGES|" > $procdir/dozor/$basename.dat echo "$procdir/dozor/$basename.dat $procfile" > $FIFO nfile=$(($nfile+1)) done # # a file named *X_DONE_X* signals end of processing # doesn't need to be a file, could also simply send it to FIFO # if [[ "$to_proc" =~ "X_DONE_X" ]]; then running=9999 nprocessed=$(cat $procfile | wc -l) # # parallel catches sigterm. So when the slurm job is cancelled, it will keep running processes, but not start new ones # gives an extra time of up to 300s to finish processing # but don't cancel the job while still processes are running # while [[ $running -gt 1 || $nfile -gt $nprocessed ]]; do running=$(ps -alef | grep process_files | grep -v grep | wc -l) nprocessed=$(cat $procfile | wc -l) sleep 1 done echo "stopping this JOB at $(date)" scancel $SLURM_JOB_ID fi sleep 0.1 done
The dozor-scriplet doing the actual processing:
#!/bin/bash input=($1) if [[ ${input[0]} =~ "X_DONE_X" ]]; then echo "not processing DONE signal" # # could add a termination scriplet here # else echo running /opt/xray/bin/dozor ${input[0]} at $(date) /opt/xray/bin/dozor ${input[0]} >> ${input[1]} fi exit
The test run was for 1000 cbf images. The time to "generate" the images was about 90 seconds. The time to process was also 90 seconds, so it's - for this particular scenario - essentially realtime.
When images were already present, processing took about 30s.
Feeding the FIFO from remote
In principle it's possible to feed the FIFO socket through ssh or netcat from remote. Here is a simple recipe ...
The following scripts submits the batch script and feeds a remote listener with the processing information
#!/bin/bash # # submit the batch script # j=$(sbatch listener_v2.sh) jobid=$(echo $j | awk '{print $4}') # # wait for the job to run # while [[ $(sacct -j $jobid -X --noheader -o state | grep RUNNING | wc -l) -eq 0 ]]; do sleep 10 done node=$(sacct -j $jobid -X --noheader -o nodelist | awk '{print $1}') # # sometimes the listener on the compute nodes isn't immediately ready, so wait a second # sleep 1 echo start submission for job $jobid to node $node at $(date) # # run 1000 jobs # export output=/beegfs/desy/user/$USER/processing/grid_scan.out for i in {00001..01000} ; do pars="nx=2463 ny=2527 pixel=0.172 detector_distance=200.0 X-ray_wavelength=1.03315 orgx=1259 orgy=1240 ix_min=0 ix_max=0 iy_min=0 iy_max=0 first_image_number=$i number_images=1 name_template_image=/beegfs/desy/user/$USER/processing/Images/TEST_?????.cbf library=/opt/xray/xdsapp3.0/plugins/xds-zcbf.so" # send all input parameters and the name of the output file to the compute node echo "$output $pars" | ncat $node.desy.de 12345 done exit
The actual batch-job (listener_v2.sh) launches gnu parallel and a netcat listener which feeds the input to gnu parallel
#SBATCH --time=0-00:30:00 #SBATCH --job-name=dozor-listener #SBATCH --partition=all unset LD_PRELOAD rm -f out $output # # create FIFO # mkdir -p /tmp/$USER mkfifo /tmp/$USER/loopFF # # launch gnu parallel listening to fifo # while true; do cat /tmp/$USER/loopFF ; done | parallel -j $(nproc) --keep-order ./process_listener_v2.sh "{}" >> out & # # start netcat listener on port 12345. netcat will write all input to fifo # ncat -k -t -l -p 12345 > /tmp/$USER/loopFF
The final script (process_listener_v2.sh) does the actual dozor run for each individual image:
#!/bin/bash input=($1) output=${input[0]} # # create a temporary file serving as input to dozor # tmp_dozor=$(mktemp) for var in ${input[@]:1} ; do echo $var | tr '=' ' ' >> $tmp_dozor done # # run dozor # /opt/xray/bin/dozor $tmp_dozor | grep -a '|' | grep '[0-9]' >> $output exit
This setup took a little less than 30s to process the 1000 images. The batch-job won't terminate, but it would be easy enough to send a signal to terminate once all processes are done.
Using the same procedure, it would also be possible to receive the output from the batch-job through netcat and FIFOs on the batch submit node. Haven't tried it though ...