Computing : Belle II Re-calibration on the NAF

General Re-calibration Patterns

How might in the future might look Belle's re-calibration workflows look like? Or: how can we@NAF integrate/consolidate them most efficient with the other job cases?

Are the re-calibration tasks:

  • constant flow? I.e., changes in the number of jobs etc. in the order of days/weeks?
  • spike-like? I.e., Waves of jobs with calm periods in between
  • chaotic?

Access

How will re-calibration jobs probably be submitted in the long term?

In principle, a user could submit jobs like any other ordinary condor job - or a job could enter through a CondorCE in a grid-like workflow (or a mix of both - or in the longer long term just using tokens...).

Or asked the other way round, what authentication/authorization mechanism would be used

Re-calibration Job Requirements

How will re-calibration jobs probably look like? Probably they will be in a range of different tasks?

Ideally, during job submission sufficient accurate estimates of the job requirements (run time, #cores, memory) should be set. Else the scheduler will have to assume the worst case assumption for runtime and scheduling might become inefficient (especially for draining/reserving slots for multi core jobs).

The better a job is known a priori, the better the LRMS can reserve its resources and the faster the job can start.

  • core count
  • max memory
  • maximum runtime
  • local disk space per job
    • are files staged from disk/tape onto the local job disk space?
    • are input files (stream) read over LAN and output written to local/remote?
  • minimum job upstart? - more of a meta-parameter, that correlates with the others
  • maximum runtime?

Storage

What storage systems might be used for what in which way?

Probably files are read (stream? random?)  over LAN from dCache or other storage, i.e.,

  • which protocols will be used (correlates probably with authz)
  • what namespaces/mounts will be necessary (which probably reduced to the question, if protocols like NFS are assumed)
  • which storage systems are needed read/write and which might be fine as read-only?

Probably related would be, how the re-calibration would be integrated with the data management?

I.e., if in the mid/long term automatized workflows are planed? E.g., a dataset is replicated with RUCIO to the dCache and an automatic job is submitted, when the dataset transfer has completed.

Tape Stage Strategy

If tape is envisaged to be used actively in production, a well planned staging strategy is advisable for an efficient data flow. Since space is limited on the staging pools and Belle SE, data would need to be recalled on time before their reprocessing without wasting space and tape drive bandwidth by recalling unnecessary/not-to-be-processed data.

For example, ATLAS works on their concept of a tape 'carousel', where they envisage a process cycle for reading data from tape, processing it, freeing the space for the next batch of data and so on.

Since data on tape needs to be brought from nearline to online storage, some especial limiting factors are tape drives (limiting the number of drives) and, respectively, their throughput. ideally, related data should be close together in large files for an efficient reading as small, scattered files are highly inefficient to read back from tape.

While the staging REST API by dCache, that got selected by WLCG as the preferred solution, is still in development, SRM is the protocol of choice to control bringing files from one storage class to another.

Current Workflows on the NAF

Currently, we se various job/tasks workflows on the NAF and run our own specialized jobs. Depending on the re-calibration job patterns, such jobs can be integrated in various ways in the general NAF.

Lite Jobs

By default each group has a dedicated share of the NAF resources. Drawback is, that if a group does not utilize their share, the resources would stay idle as no user group would be allowed to use these.

To avoid such a scenario, default jobs run as lite jobs, with a runtime of maximal 3hours. Such < 3h jobs can overflow to unused resources of another group. Since their run time is limited, in the worst case the donating group would have to wait at max 3h to get all their nominal resources.

Jupyter Notebook as dedicated Condor Jobs

For example: Jupyter notebooks are started in dedicated Condor jobs. For these Condor jobs we reserve (Hyper-threaded) cores on the nodes, that we do not assign to general user jobs. Since notebook jobs are most of the time quite idle, the hyper-threaded cores can be utilized by normal jobs when a notebook process is idle on a core (and when running, does not affect the overall system performance significantly).
Obviously, memory cannot be overbooked reasonably in the same way.


Depending on the re-calibration job task flow, the NAF could be adapted in several ways

  • as ordinary jobs
    • compete with user jobs in the belle group
    • compete on the group-level with the other jobs
  • dedicated sub cluster : Belle buys own nodes, that except only specific jobs (limited users, group,...)
    • inefficient, wasting cores if not utilized 24/7