Computing : Prerequisites

NAF/BIRD is the central DESY facility for batch/high throughput  computing using HTCondor as a job broker (migrated from previous setup with SGE in 2017, see Migration from SGE to HTCondor for hints if you are more used to the SGE-world).

There is no need for any special setup, the environment will be prepared for you once you login on submit node, before you do so though you need to:

  • get the registry ressource 'BATCH' (speak to your local admin or UCO to get this done)
  • get the registry ressource 'NAFGPU' if GPUs is what you are looking for (equally speak to your local admin to get this done)
  • NAF batch computing is relying on KERBEROS & AFS, hence it is mandatory that both krb-ticket and afs-token are valid, you can check that using 'klist' and 'tokens' command
  • read the basic rules for NAF computing in order to use the pool effectively and without blocking/annoying yourself or others


Basic Rules for NAF batch usage

- Test first: Before running bigger chunks of jobs please run a single or couple of testjobs to make sure to avoid typos, missing environments etc.

- Make sure you understand the concept of shared filesystems: We did put some effort into providing the usual data sources that are typically used at DESY like AFS, DUST, CVMFS, PNFS and some workgroupserver. Hence your jobs should not need to ship big amounts of data around because the data will be already on the workernodes in terms of NFS/AFS mounts and so on.

- Try to start with an interactive test: use 'condor_submit –ior  'condor_submit –i <submitfile>’ to get an interactive session on a worker node. Once you are connected you can have a look at the data mountpoints or on your program startup. If you are in doubt and contact us before you decide to move around big chunks of data in the NAF.

- Use job arrays to ease the work of the scheduler: HTCondor is much more forgiving when you produce bigger chunks of jobs using the 'queue <number>' statement in the submit file instead of going through a for-loop in your submit-directory and produce an individual 'condor_submit' process for every single job. Use $(PROCESS) to identify your sub tasks in your array. If you want to submit more than 10.000 jobs at a time use 'max_materialize =1000' as an option in the submit file to keep the scheduler responsive for everybody.

- Think about your job runtimes: The default runtime of a slot is 3 hours, but you can choose between a few minutes and seven days. Lite jobs with default run time and less than 2 GByte memory have a larger quota (when the cluster is not full) than bide jobs with larger resource request. Please keep in mind that jobs running only a few seconds have a huge impact on the performance of the scheduler and we are prepared to lock users in case of problems.

- Be aware of the filesystem: AFS does have a limit on max the number of files in a directory (64.000 depending on filename-length) and a limit of max filesize (not bigger than your quota). By the way, HTCondor as any other batchsystems will happily create files for you but never directories, hence if you rely on a directory structure for your outputfiles this need to be created by yourself in forehand.

- In case something goes wrong: If your jobs or some of your jobs do not run as you did expect them to please let us now and provide job numbers and the submit hostname (!) We do have a job-turnaround of roughly 1 million jobs per 24 hours, hence it is essential for us to get those information rather quick to be able to re-engineer what was going on before log files get rotated into backup etc.