EN FR
EN FR


Section: Application Domains

Core challenges for scalable data-intensive storage and processing

Although they are issued from different application areas, the above three examples of data-intensive applications illustrate common requirements with respect to the need for data storage and I/O processing. These requirements lead to several core challenges discussed below.

Challenges related to cloud storage

In the area of cloud data management, a significant milestone is the emergence of the Map-Reduce  [47] parallel programming paradigm. Today, it is successfully used on most cloud platforms, following the trend set up by Amazon  [28] . The key strength of this model is its inherently high degree of potential parallelism. Actually, it has been demonstrated that it enables processing Petabytes of data in a couple of hours, on large clusters consisting of several thousand nodes. At the core of the Map-Reduce frameworks stays a key component, which must meet a series of specific requirements.

First, since data is stored in huge files, the computation must efficiently process small parts of these huge files concurrently. Thus, the storage layer is expected to provide efficient fine-grain access to the files. Second, the storage layer must be able to sustain a high throughput in spite of heavy access concurrency to the same file, as thousands of clients simultaneously access data, while preserving fault-tolerance and security requirements.

Our goal is precisely to address these challenges by proposing scalable data management techniques meeting those properties to support Map-Reduce-based, data-intensive applications. Thanks to partnerships with leading teams in the area of cloud computing, both in the Academia (the Nimbus team at Argonne National Lab) and Industry (the Microsoft Azure team), we anticipate our contributions can have a high potential impact.

Challenges related to data-intensive HPC applications

The requirements exhibited by the climate simulation described above specifically highlights a major, more general research topic. It has been clearly identified by international panels of experts like IESP  [34] and EESI  [30] , in the context of HPC simulations running on post-Petascale supercomputers. It aims to explore how to store and analyze massive outputs of data during and after the simulation without impacting the overall performance.

A jump of one order of magnitude in the size of numerical simulations is required to address some of the fundamental questions in several communities such as climate modeling, solid earth sciences or astrophysics. Scientists, codes and computing infrastructure are in an advanced stage for this, but the lack of data-intensive infrastructure and methodology to analyze huge simulations is a growing limiting factor. Our goal is to contribute to the removal of this bottleneck through innovative data storage techniques.