Analysing transfer and telemetry data collected at dCache installations at DESY


This summer-student project would focus on analysing transfer and telemetry data collected at all dCache installations at DESY. The dCache storage system is a distributed storage system designed for high throughput data transfers and to be easy horizontal scaling. Overall about 100PiB of scientific data are stored on the different installations serving all scientific communities on site and many users off-site. We collect about 20 million data transfer operations each day and need to combine these with the telemetry data collected on the storage nodes themselves to get the complete status of the installations. The candidates would have access to these and their task would be to learn and apply data analytics including Machine Learning to the problem. The system in place uses state of the art industry software such as Apache Kafka, Apache Spark as well as the Elastic tool kit. Access to these applications would be done through Jupyter Notebooks.

Suggested Qualifications

General Python knowledge, experience with Jupyter Notebooks and visualisation of data in Python are of advantage

