Enabling data-driven science with optimized high-throughput data processing combined with long-term storage

Prof. Manuel Delfino, Director, Port d'Informació Científica (PIC), Barcelona, Spain

The last 10 years have witnessed an acceleration in the generation of data by all sectors of our society. This acceleration in the scientific sector can be exemplified by the fact that the plan to produce 10 PB per year of data at CERN's Large Hadron Collider was considered singular in 1999, whereas such data rates are common in 2013, with sources as diverse as robotized telescopes, genetic sequencers and supercomputers. The development of technologies and methodologies to do something useful with all these data is what we currently term "Big Data". Turning these "Big Science Data" into reliable scientific results requires the tight coupling of high performance clusters to high throughput storage systems in order to enable good turnaround for analysis jobs. In addition, there are increasing requirements for sharing data amongst scientists scattered worldwide, for archiving data enabling the possibility of re-analysis many years after the first scientific results have been published, and to make scientific data produced with public-funding accessible in an open but controlled manner. Addressing all of these requirements involves a number of techniques, ranging from questioning deeply rooted concepts such as File Systems or RAID arrays, to re-discovering mainframe techniques from the 1970s such as recall-optimized tape systems. But technology alone cannot solve the "Big Science Data" problem, therefore it is essential for scientists and computer specialists to work together to develop working methodologies to ensure that the technological solutions will be effective, efficient and economic.
Manuel Delfino is professor of Physics at the Universitat Autònoma de Barcelona and Director of the Port d'Informació Científica (PIC), a scientific support center for data-driven science. He is advisor for data processing of the Astroparticle Physics European Coordination consortium and of the European Gravitational Observatory. He was the head of the CERN Information Technology Division from 1999 to 2002, contributing to the technological and methodological basis of the first worldwide tightly integrated multi-PB distributed data processing system, the Worldwide Large Hadron Collider Grid. Throughout his career he has focused on the integration of computing technologies into particle physics experiments, working at CERN, the Supercomputing Computations Research Institute in Tallahassee, Florida, the Stanford Linear Accelerator Center and the University of Wisconsin. Prof. Delfino has degrees in Applied Mathematics, Mechanical Engineering, Computer Science and Physics, all from the University of Wisconsin-Madison.

Revideret 22/12/15