Staging Files vs. Files over Network File Systems

Files living on remote file systems as

  • /nfs/...  (dust/ fast scratch space)
  • /pnfs/... (long term storage)
  • /afs/... (slow, don't use for anything asking for performance or parallel access)

are easily available on the workgroup servers and compute nodes.

BUT

as all access goes over the network: latency, thoughput, load etc. depend heavily on your and on the overall network usage by your colleagues.

Staging-out files

If you use files in your workflows for temporary results or do a lot of operations on them during your jobs, it might be better performing to stage such files in/out to the job's local directory on the batch node.

  • for example: your job creates a file and does a lot of operations on it → use the local directory on the batch node.
    And when the actual work has finished, let the job copy the final file to the target directory on DUST.

Staging-in Files

if you want to stage a file into a job at the beginning, you can do it with Condor's 'should_transfer_file' feature in the job submission file - which is fine for smaller files.

But for large files it would mean, that a lot of heavy copying take place - especially when the file is already on DUST. So your submit node would read it over the network to its local scratch and put it again over the network to the batch node for the job.

  • not good: DUST → Submit Node → Batch Node → ...
  • better: copy in the job the large file to the local path and start the actual work on it locally (and copy the results back to DUST at the end of the job)

Staging-in files only make sence, if the file will be read more or less in its whole or gets a lot of access operations. If you read just a few events in a large tuple and ignore most of the rest, copying the file would be overdone.


Obviously, this all works only as long as there is sufficient local space on a batch node - if your files are 'too large' in total (~10GB) there might be the risk (averaged over all active jobs), that the nodes runs of local space and has to kill too greedy jobs.