When storing files on the infrastructure a few points should be considered to ensure optimal performance and efficient use of teh resources.
Large Files:
Files should not be overly large, especially if they are to be archived via dCache to tape (this applies to files in the /asap3, /gpfs/exfel and at least parts of the /gpfs/cfel space, apart from the more obvious /pnfs/desy.de/.... direct dCache spaces).
In fact there is a hard limit in the archiver that doesn't allow files bigger than 1TiB.
- Avoid creating files larger than 1TiB, they cannot be archived (and their presence creates problems in the workflow, needing manual interventions).
- If you need them temporarily during analysis be sure to remove them before archival is triggered.
- Ideally files should be less than 300GiB in size.
Small Files:
Small files are not a problem in themself, only if they come in large numbers (tens of thousands), especially if all cluttered into one directory (the worst directory has 2·403·870 entries...).
Too many files make handling them slow, that particularly hits archival to and from tape, so if possible the information contained in many small files should be combined in to few bigger files.
File names:
Avoid bad patterns in file names. While it is technically possible to use any UTF-8 encoded string (without the '/' character) of up to 255 bytes in length as a file name, some names should be avoided. These include (examples of all of them can be found in our data, so this is not a theoretical idea but based on real-world findings) :
- Names with control characters in them, like 'ub%01.mat' (where the %01 is the URL-Encoded rendering of the SOH ASCII character), this also includes file names with newlines or carriage return in them.
- Names with spaces at the beginning or end. While spaces in filenames are not nice anyway, at the beginning or end of a name they are plain evil.
- Names with any of the more fancy UTF items like Non-Characters, Items from the private use pages, direction marks or invisible spaces.
- Names that are too long, best keep the names shorter than 200 bytes. 'dummy_00011___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________m02.nxs' is not a good name!
If you encode a date in a file or folder name us the correct endiannes to allow proper sorting of the names, i.e. avoid names like 28082023, use instead 20230828 (%Y%m%d in strftime terms....)