Section: Research Program

Our research agenda

Three examples of motivating application scenarios will be described in detail in the next section:

  • Joint genetic and neuroimaging data analysis on Azure clouds;

  • Structural protein analysis on Nimbus clouds;

  • I/O-intensive atmospheric simulations for the Blue Waters post-petascale machine.

They illustrate the above challenges in some specific ways. They all exhibit a common scheme: massively concurrent processes which access massive data at a fine granularity, where data is shared and distributed at a large scale. To address the aforementioned challenges efficiently, we have are exploring two main approaches:

  • the BlobSeer approach, which stands at the center of some of our main research efforts in the area of cloud storage for Big Data processing. This approach relies on the design and implementation of scalable distributed algorithms for data storage and access. They combine advanced techniques for decentralized metadata and data management, with versioning-based concurrency control to optimize the performance of applications under heavy access concurrency.

  • the Damaris approach (that is totally independent of BlobSeer), which exploits multicore parallelism in post-petascale supercomputers to enable jitter-free, low-overhead I/O management and non intrusive in situ visualization for large-scale simulations.

Our short- and medium-term research plan is devoted to storage challenges in two main contexts: clouds and post-petascale HPC architectures. Consequently, our research plan is split in two main themes, which correspond to their respective challenges. For each of those themes, we have initiated several actions through collaborative projects coordinated by KerData, which define our current research agenda.

Based on very promising results demonstrated by BlobSeer in preliminary experiments  [34] , we have initiated several collaborative projects in the area of cloud data management, e.g., the MapReduce ANR project (aiming to improve both the performance and the fault-tolerance of the storage component of MapReduce processing frameworks to better support highly-concurrent data analytics applications); the A-Brain Microsoft-Inria project (that leverages these improvements on Microsoft Azure clouds to the benefit of joint neuroimaging and genetics analysis); the Z-CloudFlow Microsoft-Inria project (exploring how to efficiently manage metadata for geographically-distributed workflows). Such frameworks are for us concrete and efficient means to work in close connection with strong partners already well positioned in the area of cloud computing research.

Similarly, Damaris is the fruit of a very successful collaborative work within the Joint Inria-Illinois-ANL-BSC-JSC-RIKEN/AICS Laboratory for Extreme-Scale Computing (JLESC, formerly called JLPC). It has become a reference framework illustrating the usage of a dedicated-core approach for scalable I/O and non-intrusive in situ visualization on post-petascale HPC systems. It led to the creation of the particularly active Data@Exascale Associate Team between Inria, ANL and UIUC, an excellent framework for an enlarged research activity involving a large number of young researchers and students of the KerData team and of its partners. This Associate Team serves as a basis for extended research activities based on our approaches (including Damaris and Omnisc'IO), carried out beyond the frontiers of our team. Our team is playing a leading role in the Big Data and I/O research activities in the JLESC lab. This joint lab facilitates high-quality collaborations and access to some of the most powerful supercomputers, an important asset which already helped us produce and transfer some results of our team (e.g. Damaris).

Thanks to these projects, we are now enjoying a visible scientific positioning at the international level.