Section: Overall Objectives

Context: the need for scalable data management

We are witnessing a rapidly increasing number of application areas generating and processing very large volumes of data on a regular basis. Such applications are called data-intensive. Governmental and commercial statistics, climate modeling, cosmology, genetics, bio-informatics, high-energy physics are just a few examples. In these fields, it becomes crucial to efficiently store and manipulate massive data, which are typically shared at a large scale and concurrently accessed. In all these examples, the overall application performance is highly dependent on the properties of the underlying data management service. With the emergence of recent infrastructures such as cloud computing platforms and post-Petascale architectures, achieving highly scalable data management has become a critical challenge.

The KerData project-team is namely focusing on scalable data storage and processing on clouds and post-Petascale platforms, according to the current needs and requirements of data-intensive applications. We are especially concerned by the applications of major international and industrial players in Cloud Computing and post-Petascale High-Performance Computing (HPC), which shape the longer-term agenda of the Cloud Computing and Exascale HPC research communities.

Our research activities focus on data-intensive high-performance applications that exhibit the need to handle:

  • massive data BLOBs (Binary Large OBjects), in the order of Terabytes,

  • stored in a large number of nodes, thousands to tens of thousands,

  • accessed under heavy concurrency by a large number of processes, thousands to tens of thousands at a time,

  • with a relatively fine access grain, in the order of Megabytes.

Examples of such applications are:

  • Massively parallel cloud data-mining applications (e.g., MapReduce-based data analysis).

  • Advanced Platform-as-a-Service (PaaS) cloud data services requiring efficient data sharing under heavy concurrency.

  • Advanced concurrency-optimized, versioning-oriented cloud services for virtual-machine-image storage and management at IaaS (Infrastructure-as-a-Service) level.

  • Scalable storage solutions for I/O-intensive HPC simulations for post-Petascale architectures.

  • Storage and I/O stacks for big data analysis in applications that manipulate structured scientific data (e.g. very large multi-dimensional arrays).