Section: Overall Objectives
General context and our focus
We are witnessing a rapidly increasing number of application areas generating and processing very large volumes of data on a regular basis. Such applications are called data-intensive. Governmental and commercial statistics, climate modeling, cosmology, genetics, bio-informatics, high-energy physics are just a few examples. In these fields, it becomes crucial to efficiently store and manipulate massive data, which are typically shared at a large scale and concurrently accessed. In all these examples, the overall application performance is highly dependent on the properties of the underlying data management service. With the emergence of recent infrastructures such as cloud computing platforms and post-Petascale architectures, achieving highly scalable data management has become a critical challenge.
Our research activities focus on data-intensive high-performance applications that exhibit the need to handle:
-
massive unstructured data, BLOBs (Binary Large OBjects), in the order of Terabytes,
-
stored in a large number of nodes, thousands to tens of thousands,
-
accessed under heavy concurrency by a large number of processes, thousands to tens of thousands at a time,
-
with a relatively fine access grain, in the order of Megabytes.
Examples of such applications are:
-
Massively parallel cloud data-mining applications (e.g., MapReduce-based data analysis).
-
Advanced Platform-as-a-Service (PaaS) cloud data services requiring efficient data sharing under heavy concurrency.
-
Advanced concurrency-optimized, versioning-oriented cloud services for virtual machine image storage and management at IaaS (Infrastructure-as-a-Service) level.
-
Scalable storage solutions for I/O-intensive HPC simulations for post-Petascale architectures.