The KerData Team has been officially created on July 1st, 2009. It is a spinoff of the Paris Project-Team. It corresponds to the former “Data management” activity of the Paris Project-Team.
More and more applications today generate and handle very large volumes of data on a regular basis. Such applications are called data-intensive. Governmental and commercial statistics, climate modeling, cosmology, genetics, bio-informatics, high-energy physics are just a few examples of fields where it becomes crucial to efficiently manipulate massive data, which are typically sharedat a large scale. With the emergence of the recent infrastructures (cloud computing platforms, post-Petascale architectures), achieving highly scalable data management is a critical challenge, as the overall application performance is highly dependent on the properties of the data management service.
On Infrastructure-as-a-Service (IaaS) cloud infrastructures, computing resources are exploited on a per-need basis: instead of buying and managing hardware, users rent virtual machines and storage space. One important issue is thus the support for storing and processing data on externalized, virtual storage resources. Such needs require simultaneous investigation of important aspects related to performance, scalability, security and quality of service. Moreover, the impact of physical resource sharing also needs careful consideration.
In parallel with the emergence of cloud infrastructures,
considerable efforts are now under way to build
Petascale computing systems, such as Blue Waters (
http://
Our research activities address the area of distributed data management at challenging scales on various distributed systems, with a particular focus on clouds, and Post-Petascale infrastructures. We target data-oriented high-performance applications that exhibit the need to handle massive non structured data - BLOBs: binary large objects (in the order of Terabytes) - stored in a large number of nodes (thousands to tens of thousands), accessed under heavy concurrency by a large number of clients (thousands to tens of thousands at a time) with a relatively fine access grain (in the order of Megabytes). Examples of such applications are:
Cloud data-mining applications (e.g., based on the MapReduce paradigm) handling massive data distributed at a large scale.
Advanced (e.g., concurrency-optimized, versioning-oriented) cloud services both for user-level data storage and for virtual machine image storage and management at IaaS level.
Distributed storage for Petaflop computing applications.
Data storage for desktop grid applications with high write throughput requirements.
The KerData Team is led by G. Antoniu since July 2010. His mission is to submit a proposal to become a fully-fledged Project-Team within the year.
A new project, led by G. Antoniu, has been accepted by the ANR ARPEGE 2010 Program on embedded systems and large infrastructures. This project is devoted to using MapReduce programming paradigm on clouds and hybrid infrastructures.
A new project, led by G. Antoniu and B. Thirion (Parietal Project-Team, Inria Saclay – Île-de-France), has started in collaboration with Microsoft Research. This project conducted within the framework the Microsoft Research - INRIA Joint Research Center involves Microsoft's Azurecloud computing platform.
A new Associate Team led by
G. Antoniu (DataCloud@work,
http://
We have set up a partnership with the INRIA-UIUC Joint Laboratory for Petascale Computing at Urbana-Champaign. Several mutual visits and internships were organized in this framework and numerous collaborations are on track in the context of the Blue Waters Project, expected to become one of the world's most powerful supercomputers when it comes online in 2011, with sustained Petaflop performance.
It has been awarded to B. Nicolae at the IPDPS 2010 conference in Atlanta, GA, USA.
G. Antoniu and L. Bougé have initiated the proposal and creation of the ENS-INRIA Prize of excellence. It is targeted to Romanian high-school students who won the National Olympiad of Informatics. A first group of 5 students have been hosted in Rennes and Paris in June 2010. They visited the two sites of ENS Cachan (Cachan and Bruz) and two INRIA centers ( Inria Rennes – Bretagne Atlantiqueand Inria Paris – Rocquencourt).
The first PhD thesis of the KerData Team (B. Nicolae) was defended on 30 November 2010.
Managing data at large scales is paramount nowadays. Governmental and commercial statistics, climate modeling, cosmology, genetics, bio-informatics, etc. are just a few examples of fields routinely generating huge amounts of data. It becomes crucial to efficiently manipulate these data, which are typically shared at the global scale. In such a context, one important goal is to provide mechanisms allowing to manage massive data blocks (e.g., of several terabytes), while providing efficient fine-grain access to small parts of the data. Several application areas exhibit such a need for efficient scaling to huge data sizes: data mining applications , multimedia applications , database-oriented applications ( , , ), bioinformatic applications, etc.
.
The management of massive data blocks naturally requires the use of data fragmentation and of distributed storage. Grid infrastructures, typically built by aggregating distributed resources that may belong to different administration domains, were built during the last years with the goal of providing an appropriate solution. When considering the existing approaches to grid data management, we can notice that most of them heavily rely on explicitdata localization and on explicittransfers of large amounts of data across the distributed architecture: GridFTP , LDR , Chirp , IBP , NeST , etc. Managing huge amounts of data in such an explicit way at a very large scale makes the design of grid application much more complex. One key issue to be addressed is therefore the transparencywith respect to data localization and data movements. Such a transparency is highly suitable, as it liberates the user from the need to handle data localization and transfers.
Several approaches to grid data management acknowledge that providing a transparent data access model is important. They integrate this idea at the early stages of their design. Grid file systems, for instance, provide a familiar, file-oriented API allowing to transparently access physically distributed data through globally unique, logical file paths. The applications simply open and access such files as if they were stored on a local file system. A very large distributed storage space is thus made available to those existing applications that usually use file storage, with no need for modifications. This approach has been taken by a few projects like GFarm , GridNFS , LegionFS , etc.
On the other hand, the transparent data access model is equally defended by the concept of grid data-sharing service , illustrated for instance by the JuxMem platform . Such a service provides the grid applications with the abstraction of a globally shared memory, in which data can be easily stored and accessed through global identifiers. To meet this goal, the design of JuxMem leverages the strengths of several building blocks: consistency protocols inspired by Distributed Shared Memory (DSM) systems; algorithms for fault-tolerant distributed systems; protocols for scalability and volatility support from peer-to-peer (P2P) systems.
Studies show more than 80% of data globally in circulation is unstructured. On the other hand, data sizes increase at a dramatic level with more than 1 TB of data gathered per week in common scenarios for some production applications (e.g., medical experiments ). Finally, on Post-Petascale HPC machines, the use of huge storage objects is also currently being considered as a promising alternative to today's dominant approaches to data management. Indeed, these approaches rely on very large numbers of small files, and using huge storage objects reduces the corresponding metadata overhead of the file system. Such huge unstructured data are stored as binary large objects (BLOBs)that may continuously be updated by applications. However, traditional databases or file systems can hardly cope in an efficient way with BLOBs which grow to huge sizes.
To address the scalability issue, specialized abstractions like MapReduce and Pig-Latin propose high-level data processing frameworks intended to hide the details of parallelization from the user. Such platforms are implemented on top of huge object storage platforms. They target high performance by optimizing the parallel execution of the computation. This leads to heavy access concurrencyto the BLOBs, thus the need for the storage layer to offer support in this regard. Parallel and distributed file systems also consider using objects for low-level storage (see next subsection , , ). In other application areas, huge BLOBs need to be used concurrently at the highest level layers of applications directly: high-energy physics, multimedia processing or astronomy.
When addressing the problem of storing and efficiently accessing very large unstructured data objects , in a distributed environment, a challenging case is the one where data is mutableand potentially accessed by a very large number of concurrent, distributed processes. In this context, versioningis an important feature. Not only it allows to roll back data changes when desired, but it also enables cheap branching (possibly recursively): the same computation may proceed independently on different versions of the BLOB. Versioning should obviously not impact access performance to the object significantly, given that objects are under constant heavy access concurrency. On the other hand, versioning leads to increased storage space usage and becomes a major concern when the data size itself is huge. Versioning efficiency thus refers to both access performance under heavy load and reasonably acceptable overhead of storage space.
Recent research emphasizes a clear move currently in progress from a block-based interface to a object-based interface in storage architectures. The goal is to enable scalable, self-managed storage networks by moving low-level functionalities such as space management to storage devices or to storage server, accessed through a standard object interface. This move has a direct impact on the design of today's distributed file systems: object-based file system would then store data rather as objects than as unstructured data blocks. According to , this move may eliminate nearly 90% of management workload which was the major obstacle limiting file systems' scalability and performance.
Two approaches exploit this idea. In the first approach, the data objects are stored and manipulated directly by a new type of storage device called object-based storage device(OSD). This approach requires an evolution of the hardware, in order to allow high-level object operations to be delegated to the storage device. The standard OSD interface was defined in the Storage Networking Industry Association (SNIA) OSD working group. The protocol is embodied over SCSI and defines a new set of SCSI commands. Recently, a second generation of the command set, Object-Based Storage Devices - 2 (OSD-2) has been defined. The distributed file systems taking the OSD approach assume the presence of such an OSD in the near future and currently rely on a software module simulating its behavior. Examples of parallel/distributed file systems following this approach are Lustre and Ceph . Recently, research efforts have explored the feasibility and the possible benefits of integrating OSDs into parallel file systems, such as PVFS .
The second approach does not rely on the presence of OSDs, but still tries to benefit from an object-based approach to improve performance and scalability: files are structured as a set of objects that are stored on storage servers. Google File System , and HDFS ( Hadoop File System) illustrate this approach.
During the last few years, research and development in the area of large-scale distributed computing led to the clear emergence of several types of physical execution infrastructures for large-scale distributed applications.
The cloud computing model
,
,
is gaining serious interest
from both industry and academia in the area of large-scale
distributed computing. It provides a new paradigm for
managing computing resources: instead of buying and
managing hardware, users rent virtual machines and storage
space. Various cloud software stacks have been proposed by
leading industry companies, like Google, Amazon or Yahoo!.
They aim at providing fully configurable virtual machines
or virtual storage (IaaS:
Infrastructure-as-a-Service), higher-level services
including programming environments such as MapReduce
(
PaaS: Platform-as-a-Service
,
) or community-specific
applications (
SaaS: Software-as-a-Service
,
). On the academic side, two of
the most visible projects in this area are Nimbus
,
from the Argonne National Lab
(USA) and OpenNebula
, which aim at providing a
reference implementation for a IaaS. In parallel to these
trends, other research efforts focused on the concept of
grid operating system: a distributed operating system for
large-scale wide-area dynamic infrastructure spanning
multiple administrative domains. XtreemOS
,
is such a grid operating
system, which provides native support for virtual
organizations. Since both the cloud approach and the grid
operating system approach deal with resource management on
large-scale distributed infrastructures, the relative
positioning of these two approaches with respect to each
other are currently subject to on-going investigation
within the PARIS/MYRIADS Project-Team (
http://
In the context of the emerging cloud infrastructures, some of the most critical open issues relate to data management. Providing the users with the possibility to store and process data on externalized, virtual resources from the cloud requires simultaneously investigating important aspects related to security, efficiency and quality of service. To this purpose, it clearly becomes necessary to create mechanisms able to provide feedback about the state of the storage system along with the underlying physical infrastructure. The information thus monitored, can further be fed back into the storage system and used by self-managing engines, in order to enable an autonomic behavior , , , possibly with several goals such as self-configuration, self-optimization, or self-healing. Exploring ways to address the main challenges raised by data storage and management on cloud infrastructures is the major factor that motivated the creation of the KerData research team Inria Rennes – Bretagne Atlantique. These topics are at the heart of our involvement in several projects that we are leading in the area of cloud storage: MapReduce (see Section ), AzureBrain (see Section ), DataCloud@work (see Section ).
In 2011, a new NSF-funded Petascale computing system, Blue Waters, will go online at the University of Illinois. Blue Waters is expected to be the most powerful supercomputer in the world for open scientific research when it comes online. It will be the first system of its kind to sustain one-Petaflop performance on a range of science and engineering applications. The goal of this facility is to open up new possibilities in science and engineering. It provides unheard computational capability. It makes it possible for investigators to tackle much larger and more complex research challenges across a wide spectrum of domains: predict the behavior of complex biological systems, understand how the cosmos evolved after the Big Bang, design new materials at the atomic level, predict the behavior of hurricanes and tornadoes, and simulate complex engineered systems like the power distribution system and airplanes and automobiles.
To reach sustained-Petascale performance, machines like Blue Waters relies on advanced, dedicated technologies at several levels: processor, memory subsystem, interconnect, operating system, programming environment, system administration tools. In this context, data management is again a critical issue that highly impacts the application behavior and its overall performance. Petascale supercomputers exhibit specific architectural features (e.g., a multi-level memory hierarchy scalable to tens to hundreds of thousands of codes) that needs to be specifically taken into account. Providing scalable data throughput on such unprecedented scales is clearly an open challenge today. In this context, we are investigating techniques to achieve concurrency-optimized I/O in collaboration with teams from the National Center for Supercomputing Applications (NCSA/UIUC) in the framework of the Joint INRIA-UIUC for Petascale Computing (see Section ).
During the recent years, Desktop grids have been extensively investigated as an efficient way to build cheap, large-scale virtual supercomputers by gathering idle resources from a very large number of users. A possible approach is to rely on clusters of workstations belonging to institutions and interconnected through dedicated, high-throughput wide-area interconnect, which is the typical physical infrastructure for Grid Computing. In contrast, Desktop grids rely on desktop computers from individual users, interconnected through Internet, provided by volunteer users. The initial, widely-spread usage of Desktop grids for parallel applications consisting in non-communicating tasks with small input/output parameters is a direct consequence of the physical infrastructure. Actually, volatile nodes and low bandwidth are not suitable for communication-intensive parallel applications with high input or output requirements. However, the increasing popularity of volunteer computing projects has progressively lead to enlarge the set of application classes that might benefit of Desktop Grid infrastructures. If we consider distributed applications where tasks need very large input data, it is no longer feasible to rely on regular centralized server-based Desktop Grid architectures. Actually, the input data is there typically embedded in the job description and sent to workers. Such a strategy could lead to significant bottlenecks as the central server gets overwhelmed by download requests. To cope with such data-intensive applications, alternative approaches based on P2P techniques and Content Distribution Networks have been proposed, with the goal of offloading the transfer of the input data from the central servers to the other nodes participating to the system, with potentially under-used bandwidth.
In the general case, Desktop Grids rely on resources contributed by volunteers. Enterprise Desktop Grids are a particular case of Desktop Grids leveraging unused processing cycles and storage space available within the enterprise. The emergence of cloud infrastructures has opened new perspectives to the development of Desktop Grids, as new types of usage may benefit from a hybrid, simultaneous use of these two types of infrastructures. In a typical scenario of this kind, an enterprise would not use dedicated, on-site hardware resources for a particular need for data-intensive analysis, e.g., to process commercial statistics. It would rather rely on free unused internal resources using the Enterprise Desktop Grid model, and, in extension to them, would rent resources from the cloud. Both architectures are suitable for massively parallel processing and this is why we intend to explore the potential advantages of using such hybrid infrastructures in the framework of the MapReduce project (see Section ).
MapReduce is a parallel programming paradigm successfully used by large Internet service providers to perform computations on massive amounts of data. A computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of a MapReduce library expresses the computation as two functions: map, that processes a key/value pair to generate a set of intermediate key/value pairs, and reduce, that merges all intermediate values associated with the same intermediate key. The framework takes care of splitting the input data, scheduling the jobs' component tasks, monitoring them and re-executing the failed ones. After being strongly promoted by Google, it has also been implemented by the open source community through the Hadoop project, maintained by the Apache Foundation and supported by Yahoo! and even by Google itself. This model is currently getting more and more popular as a solution for rapid implementation of distributed data-intensive applications. The key strength of the MapReduce model is its inherently high degree of potential parallelism that should enable processing of Petabytes of data in a couple of hours on large clusters consisting of several thousand nodes.
At the core of the MapReduce frameworks stays a key component: the storage layer. To enable massively parallel data processing to a high degree over a large number of nodes, the storage layer must meet a series of specific requirements. Firstly, since data is stored in huge files, the computation will have to efficiently process small parts of these huge files concurrently. Thus, the storage layer is expected to provide efficient fine-grain accessto the files. Secondly, the storage layer must be able to sustain a high throughputin spite of heavy access concurrencyto the same file, as thousands of clients simultaneously access data.
These critical needs of data-intensive distributed applications have not been addressed by classical, POSIX-compliant distributed file systems. Therefore, specialized file systems have been designed, such as HDFS, the default storage layer of Hadoop. HDFS has however some difficulties in sustaining a high throughput in the case of concurrent accesses to the same file. Amazon's cloud computing initiative, Elastic MapReduce, employs Hadoop on their Elastic Compute Cloud infrastructure (EC2) and inherits these limitations. The storage back-end used by Hadoop is Amazon's Simple Storage Service (S3), which provides limited support for concurrent accesses to shared data. Moreover, many desirable features are missing altogether, such as the support for versioning and for concurrent updates to the same file. Finally, another important requirement for the storage layer is its ability to expose an interface that enables the application to be data-location aware. This is critical in order to allow the scheduler to use this information to place computation tasks close to the data and thus reduce network traffic, contributing to a better global data throughput. These topics are at the core of KerData's contribution to the MapReduce ANR project and to the Hemera large wingspan project (both started in 2010, see Section ).
The research carried out within the KerData team targets applications that handle massive data that are fragmented, distributed, shared and accessed under heavy concurrency at a large scale.
Massively parallel data-mining applications (e.g., MapReduce-based data analysis).
Advanced PaaS-level cloud data services requiring efficient data sharing under heavy concurrency.
I/O-intensive scientific simulations for Post-Petascale infrastructures.
Desktop grid applications with high write throughput requirements.
In the current projects started in 2010 we specifically work on providing concurrency-optimized data storage and management for the following applications.
In the framework of the MapReduce ANR project lead by KerData (started in October 2010) we will validate our techniques for concurrency-optimized data management with an application study from the bioinformatics field. It will focus on the SuMo application proposed by Institute for Biology and Chemistry of the Proteins from Lyon (a partner of the MapReduce project). This application performs structural protein analysis by comparing a set of protein structures against a very large set of structures stored in a huge database. This is a typical data-intensive application that can leverage the MapReduce model for a scalable execution on large-scale distributed platforms.
If the results are convincing, then they can immediately be applied to the derivative version of this application for drug design in industrial context called MED-SuMo, managed by the MEDIT SME (also a partner of this project). Regarding pharmaceutical and biotech industries, such a scalable implementation run over a cloud computing facility opens new perspectives for drug design. Rather than searching for 3D similarity into biostructural data, it will become possible to classify the entire biostructural space and to periodically update all derivative predictive models with new experimental data. The applications of that complete chemo-proteomic vision address the identification of new druggable protein target, the detection of new allosteric binding site suitable to increase the selectivity of a drug compound, the generation of new drug candidates by a fragment-based approach over protein-ligand biostructural data, and other new protocols under development at MEDIT.
The AzureBrain Project started in October 2010 within the Microsoft Research-INRIA Joint Research Center. In this framework, we focus on a data-analysis application whose goal is to find statistically relevant correlations across two huge sets containing genetic data and neuroimaging data respectively, for large cohorts of subjects. In the genome dimension, genotyping DNA chips allow to record several hundreds of thousands of values per subject, whereas in the imaging dimension a fMRI volume may contain hundreds of thousands to millions of voxels. Finding the brain and genome regions that may be involved in this link entails a huge number of hypotheses, hence a drastic correction of the statistical significance of pairwise relationships, which in turn crucially reduces the sensitivity of statistical procedures that aims at detecting the association.
We collaborate with the PARIETAL team from
Inria Saclay –
Île-de-France, who works on such optimized techniques
for joint genetic and neuroimaging analysis. We plan to
redesign the application using a cloud-oriented programming
model such as MapReduce, and then to adapt and evaluate the
whole software stack (application, programming engine,
BlobSeer-based storage components) on Microsoft's Azure
platform. The input application data will be taken from the
Imagen FP6 project (
http://
The Blue Waters machine (
http://
Such simulations usually require to be coupled with visualization tools. On supercomputers, previous studies already showed the need of adapting the I/O path from data generation to visualization. In the framework of the JLPC we started to investigate concurrency-optimized I/O techniques to achieve this goal. We focus on a particular tornado simulation called CM1, which is intended to be run on the BlueWaters machine. This simulation currently generates large amount of data in many files, in a way that is not adapted for later visualization. We started to explore the use of BlobSeer, a large-scale data management service designed by the KerData team, as an intermediate layer between the simulation, the filesystem and visualization tools. Concurrency control optimizations enabled by BlobSeer will be tuned to ensure efficient access to the files managed by the underlying file system. A preliminary study done by Matthieu Dorier (Master student at ENS Cachan - Brittany) during a 3-month internship at UIUC, co-advised by Marc Snir, Franck Cappello and G. Antoniu, has demonstrated the benefits of a new approach using dedicated I/O cores (see Section ).
Bogdan.Nicolae@inria.fr, Gabriel.Antoniu@inria.fr
GNU Lesser General Public License (LGPL) version 3.
This software is available on INRIA's forge. Registration of version 1.0 (released late 2010) with APP is in progress.
BlobSeer is a data storage service specifically designed to deal with the requirements of large-scale data-intensive distributed applications that abstract data as huge sequences of bytes, called BLOBs (Binary Large OBjects). It exports a simple, yet versatile versioning interface to manipulate BLOBs that enables reading, writing and appending to them. BlobSeer offers both scalability and performance with respect to a series of issues typically associated with the data-intensive context: scalable aggregation of storage spacefrom the participating nodes with minimal overhead, ability to store huge data objects, efficient fine-grain accessto data subsets, high throughput in spite of heavy access concurrency, as well as fault-tolerance.
Development started in January 2008. The implementation is built on top of the Boost collection of C++ libraries, Berkeley DB and libconfig. Additional scripting in Perl/Python handles deployment on Grid'5000, which is done through the OARresource scheduler. Benchmarking so far has proven correctness and scalable performance with up to 400 nodes from 3 different sites.
The latest stable version of BlobSeer, v1.0, brings a large set of new features and improvements whose usefulness was experimentally validated during the course of 2010. Of particular importance to the user are two new features: (1) the support to efficiently clone BLOBs by using a new, dedicated primitive that was added to the access interface; and (2) a POSIX access interface to BLOBs (implemented over FUSE) that enables applications to access BLOBs using standard I/O calls, while retaining the ability to perform BLOB-specific manipulations (such as access to past versions and cloning) through ioctls.
Several contributions were achieved that relate directly to the core functionality of BlobSeer.
First, we refined the design principles behind BlobSeer and placed them in the context of scalable distributed storage systems: if combined together, these principles can help designers of distributed storage systems to meet the need for highly scalable data management. In particular, we focused on the potentially large benefits of using versioning to improve application data access performance under heavy concurrency. In this context, we extended the versioning-based access interface of BlobSeer with new primitives that further enhance the potential to exploit the inherent parallelism of data workflows efficiently.
Second, we proposed a generalization for a set of versioning algorithms for data management originally implemented in BlobSeer and published in the previous years. We have introduced new data structures and redesigned several aspects to account for better decentralized metadata management, fine-grain access at arbitrary offsets, asynchrony, fault tolerance and last but not least allow the user to explicitly control written data layout such that it is optimally distributed for reading.
Third, we extended the scope of our experimental evaluation and performed synthetic benchmarks that push the system to its limits. It demonstrated a high throughput under heavy access concurrency, even when metadata is replicated in order to provide fault tolerance. Furthermore, we extended the evaluation of BlobSeer as a storage back-end for Hadoop MapReduce and highlighted a series of improvements in the context of MapReduce data-intensive applications.
These contributions materialized in a reference publication about BlobSeer that provides a complete view over its design principles, algorithms, consistency and fault tolerance considerations, as well as experimental evaluations. A more compact overview of BlobSeer was also published in the PhD Forum of IPDPS'10, where the corresponding poster, presented during the conference, won the TCPP Best PhD Student Poster Award.
Complementary to these results, further work was undertaken to improve the usability of BlobSeer in the context of cloud computing. More specifically, we evaluated the trade-off resulting from transparently applying data compression to save storage space and bandwidth at the cost of slight computational overhead. The aim is to reduce the storage space and bandwidth needs with minimal impact on I/O throughput when under heavy access concurrency. To this end, we introduced a generic sampling-based compression technique that dynamically adapts to the heterogeneity of data and applied it to BlobSeer. It led to significant improvement over the original implementation: almost no performance overhead when dealing with incompressible data, as well as significant saving in storage space and bandwidth for compressible data, with the added benefit of improved aggregated read throughput. These results were obtained as a consequence of extensive experiments on the Grid'5000 testbed and were published in .
Finally, B. Nicolae successfully defended his PhD thesis on November 30, 2010. The thesis document details the contributions that relate to the core of BlobSeer since the beginning of the project.
The features exhibited by BlobSeer meet the storage needs of MapReduce applications. To evaluate the benefits of using BlobSeer as the storage back-end in such a context, we used Hadoop - Yahoo!'s implementation of the MapReduce framework. We substituted the original data storage layer of Hadoop, the Hadoop Distributed File System(HDFS) with our BlobSeer-based file system - BSFS. To measure the impact of our approach, we performed experiments both with synthetic microbenchmarks and real MapReduce applications. The results showed that BSFS is capable to deliver a higher throughput than HDFS, and to sustain it when the number of clients significantly increases. This work on integrating BlobSeer with Hadoop brought up various issues that could be improved in the Hadoop framework .
One of these aspects concerns the append operation for which HDFS does not offer support. In we show how providing the functionality of concurrently appending data to existing files, can bring substantial benefits to MapReduce applications as well as to other classes of applications. Since BlobSeer efficiently supports concurrent appends, we modified the Hadoop MapReduce framework to use the append operation in the “reduce” phase of the application. Our experiments showed that massively concurrent append and read operations have a low impact on each other; furthermore, measurements with an application available with Hadoop showed that the support for concurrent appends to shared files is introduced with no extra cost, whereas the number of files managed by the MapReduce framework is substantially reduced.
We also addressed the problem of managing intermediate data, which is data generated during MapReduce computations. In the original Hadoop MapReduce framework, intermediate data (data produced as output of the “map” phase and transferred as input to the “reduce” phase) is stored on the local file system of the machines executing the “map” function; in case of failures, the data is lost and the map computation is re-executed on another machine. Our approach was to store the intermediate data in a distributed file system, so that, when a failure occurs, the computation can resume on another machine; moreover, by using BSFS as storage for intermediate data, the execution time is reduced due to the high throughput BSFS delivers. These issues have been developed with the Master thesis of Lan Trieu .
This work has been done in collaboration with Jing (Tylor) Cai, Master student at the City University of Hong Kong, and Mihaela-Camelia Vlad, Master student at the Polytechnic University of Bucharest. Both of them visited the KerData Team in 2009–2010 for several months, supported by the INRIA Internship program.
The goal of this research direction is to enable autonomic
storage for BlobSeer-based cloud services. This work has been
carried out in the framework of the DataCloud@work Associated
Team between KerData and the Computer Science Department from
Politehnica University of Bucharest - PUB (
http://
The first step towards an autonomic data-sharing system was to equip the BlobSeer platform with introspection capabilities, which can serve as input data for a self-adaptive engine deployed on top of the system, possibly with several goals such as self-configuration, self-optimization, self-healing or self-protection. This work has been published in .
Further, we implemented a distributed architecture for storing and processing monitoring data. Our solution was designed as a new BlobSeer component that does not interfere with its efficient data-access primitives. Instead, it builds a distributed user-activity history to obtain real-time information about the users in the system. Then we proposed a preliminary approach for enabling self-protection for the BlobSeer system, through a malicious client detection component, which analyzes protocol breaches specific to BlobSeer. These results have been published as INRIA research reports , .
We developed the self-protection direction within a generic security management framework allowing providers of Cloud data management systems to define and enforce complex security policies. In addition, we designed an expressive policy description language so as to be able to define a wide range of security attacks and to detect them in a security violation detection engine. We integrated our security framework with BlobSeer and we showed that we can provide a secure environment for data management systems without any significant overhead, while being able to define and detect complex attack scenarios. These results have been published in .
Moreover, we developed a specific security mechanism which assigns a trust level to each client by continually monitoring and analyzing the client activity and the state of the system to detect security threats, malicious activity or other kinds of intrusions. Additionally, we addressed the problem of securely running web services on top of BlobSeer. We implemented mechanisms that handle authentication and authorization of the users, as well as secure data transfers for web services that use BlobSeer as a storage back-end.
Another direction was to introduce self-management and self-adaptation facilities in BlobSeer. We enhanced BlobSeer with self-adaptive features by dynamically changing and maintaining the replication factors of the data. When a specific BLOB is under a heavy load (in terms of read operations), the system automatically increases its replication factor and handles all the necessary data transfers. In contrast, when some data is less (or never) used, its replication factor is transparently reduced. Moreover, we developed a component able to dynamically contract and expand the pool of storage providers based on the system's load, so as to adapt the resource usage to the needs of the clients accessing the data. Several Master research internships and Bachelor theses at PUB focused on these tasks.
This work has been done in collaboration with Matthieu Dorier, student at ENS Cachan, Brittany Campus, during his summer 2010 internship at the INRIA-UIUC Joint Laboratory for Petascale Computing (JLPC) at Urbana-Champaign.
High-performance concurrent I/O accesses are a major requirement of data-intensive scientific applications, particularly for those applications deployed on Petascale infrastructures. The larger the scale of the execution infrastructure, the higher the potential performance bottlenecks that could be caused by a lack of performance of the data input/output (I/O) layers. We focused on specific scenarios that exhibit the need for efficient access to huge, shared data under heavy concurrency workload. We identified two main issues that require closer consideration.
First, there is still a trade-off between high-performance data communication and atomic I/O capabilities of concurrent overlapped updates in the context of scientific applications. Current lock-based approaches mainly perform locking around the operations, imposing lock overhead and slowing down the overall performance. In this context, we aim to exploit the potential benefits of BlobSeer. By leveraging a versioning-based scheme, an atomic I/O operation is expected to be done in a lock-free manner, even when overlapped accesses occur. Following this direction, we conducted several experimental evaluations on Grid'5000 and obtained very promising results described in .
In the second direction, our major research topic comes from the context of HPC, and targets scientific simulations running on Petascale machines. The goal is to explore how to efficiently record and visualize data during the simulation without impacting the performance of the corresponding computation generating that data. Conventional practice of storing data on disk, moving it off-site, reading it into a workflow, and analyzing it to produce scientific data becomes increasingly harder to use, due to large data volumes generated at fast rates compared to limited back-end speeds. Therefore, scalable approaches to deal with these I/O limitations are of utmost importance. We propose to adapt concurrency control techniques introduced in BlobSeer in order to optimize the level of parallelization between visualization and simulation with respect to I/O. It allows periodic data backup and online visualization to proceed without blocking computation, and vice versa.
A first step has been taken in this direction by studying the behavior, with respect to I/O, of a tornado simulation code called CM1, targeting the next IBM supercomputer BlueWaters. This behavior induces large overheads due to the generation of many small files at the same time. We proposed a first solution using dedicated I/O cores as staging areas in order to overlap I/O with computation at the simulation level. Such a solution has demonstrated to be capable of bringing a better balance in throughput and to avoid overheads in I/O phases, as well as an ability to perform efficiently data preprocessing. Coupled with the BlobSeer approach, we intend to provide a full solution for efficiently coupling simulations with visualization tools for very large scales. This work has been initiated during Matthieu Dorier's master intership at JLPC.
Providing efficient virtual machine image storage solutions is crucial in the context of Infrastructure-as-a-Service (IaaS) cloud computing, as users rent resources in terms of virtual machines that are instantiated from virtual machine images. One of those challenges in this context is the need to deploy a large number (hundreds or even thousands) of VM instances simultaneously. Once the VM instances are deployed, another challenge is to simultaneously take a snapshot of many images and transfer them to persistent storage to support management tasks, such as suspend-resume and migration.
During a 2-month visit at Argonne National Lab, USA, B. Nicolae adapted BlobSeer to address these needs. More specifically, a series of optimization techniques were proposed that minimize resource consumption (execution time, network traffic and storage space) which translate into lower end-user costs. While conventional approaches transfer the whole VM image contents between the persistent storage service and the computing nodes, we proposed a lazy transfer scheme based on object-versioning that transfers only the needed content on-demand: this greatly reduces total time for execution time, network traffic and storage space. The benefits of this approach were demonstrated through extensive experiments operating on hundreds of nodes, showing improvements in time to boot virtual machines from a shared image by a factor of up to 25, while at the same time reducing storage and bandwidth usage by as much as 90% when compared with conventional approaches. This work is described in .
Furthermore, the cloud users need mechanisms to upload Virtual Machine (VM) images into a Cloud storage service, before they are deployed to the physical nodes. We investigated this issue for the Nimbus Cloud environment, by replacing its default repository with Blobseer. This work has been published in .
In the context of the Associated Team between KerData and the Computer Science Department from Politehnica University of Bucharest, we made available BlobSeer as a storage service on the Cloud, by integrating it within the Nimbus Cloud. We added mechanisms for bringing BlobSeer to a consistent state before stopping it and then for starting/stopping/restarting BlobSeer inside the Nimbus Cloud, while preserving the data it stored during previous runs. Additionally, we investigated the advantages of using BlobSeer as a storage system for XtreemOS, by conducting a series of performance evaluations targeted towards MapReduce applications. We experimented with Hadoop applications deployed on top of HDFS, BlobSeer and XtreemFS, the default file system of XtreemOS.
MapReduce is emerging as a highly scalable programming paradigm that enables high-throughput data-intensive processing as a cloud service. However, the associated performance is highly dependent on the underlying storage service, responsible to efficiently support massively parallel data accesses by guaranteeing a high throughput under heavy access concurrency. In this context, quality of service plays a crucial role: the storage service needs to sustain a stable throughput regarding each access individually, in addition to achieving a high aggregated throughput under concurrency.
We propose a technique to address this problem using component monitoring, application-side feedback and behavior pattern analysis. It allows to automatically infer useful knowledge about the causes of a poor quality of service, and to provide an guidelines toward potential improvements. We apply our proposal to BlobSeer, as a representative data storage service specifically designed to achieve high aggregated throughputs. Through an extensive experimentation, we demonstrated substantial improvements in the stability of individual data read accesses under MapReduce workloads. Within the SCALUS Marie-Curie project (see Section ) we plan to refine this work using the OpenNebula as a IaaS cloud environment.
Joint genetic and neuroimaging data analysis on large
cohorts of subjects is a new approach used to assess and
understand the variability that exists between individuals.
This approach has remained poorly understood so far and
brings forward very significant challenges, as progress in
this field can open pioneering directions in biology and
medicine. As both neuroimaging- and genetic-domain
observations represent a huge amount of variables (of the
order of 106), performing statistically rigorous analyses on
such amounts of data represents a computational challenge
that cannot be addressed with conventional computational
techniques. This project started in October 2010 for two
years in the framework of the Microsoft Research - INRIA
Joint Research Center and aims to explore cloud computing
techniques to address the above computational challenge. The
project will rely on Microsoft's Azure cloud platform and
will leverage the complementary expertise of two INRIA teams:
KerData (Rennes) in the area of scalable cloud data
management and PARIETAL (Saclay) in the field of
neuroimaging. For more details, see the official press
release
http://
The Brittany Regional Council provides half of the financial support for the PhD thesis of D. Moise (GRID5000BD project). This support amounts to a total of around 14,000 Euros/year. This support ends in September 2011.
KerData is leading the MapReduce project (October 2010 – March 2014) funded by the ANR ARPEGE 2010 Program on embedded systems and large infrastructures. This project is devoted to using MapReduce programming paradigm on clouds and hybrid infrastructures. It started in October 2010 in partnership with Argonne National Lab (USA), the University of Illinois at Urbana Champaign (USA), the UIUC-INRIA Joint Lab on Petascale Computing, IBM France, IBCP, MEDIT (SME) and the GRAAL INRIA project-team. In this project we explore advanced techniques for scalable, high-throughput, concurrency-optimized data and metadata management. Recent preliminary experiments with the BlobSeer storage platform designed by the KerData have shown substantial potential improvements of the data throughput compared to Hadoop, which acts as today's reference MapReduce platform.
Hemera (
http://
The SCALUS Marie Curie Initial Training Network (
http://
Two PhD parallel theses funded by the SCALUS Project are co-advised by G. Antoniu (KerData) and María Pérez (Universidad Politécnica de Madrid, UPM). Both started in September 2010: Houssem-Eddine Chihoub, hosted by KerData, and Bunjamin Memishi, hosted at UPM. Both theses will explore ways to continue the preliminary joint work started by our teams involving BlobSeer and GloBeM (see Section ) in the framework of real cloud infrastructures, with real applications. Discussions and preliminary experiments are in progress on how the OpenNebula cloud toolkit developed at Universidad Complutense de Madrid could be used as a global framework for this work.
DataCloud@work was initiated in 2010 by G. Antoniu (KerData) as an Associate Team in partnership with Politehnica University of Bucharest (PUB) and the MYRIADS Team ( Inria Rennes – Bretagne Atlantique). It aims to investigate ways to provide advanced, autonomic storage mechanisms for cloud services. More specifically, the goal is to explore how to build an efficient, secure and reliable storage service for data-intensive distributed applications running in cloud environments by enabling an autonomic behavior. A secondary goal is to leverage the grid operating system approach as a cloud technology (e.g., by relying on its OS-support for virtual organizations). The project builds on preliminary prototypes: the BlobSeer data-sharing platform (designed by the KerData Team), on the MonALISA monitoring framework (whose main technical contributor is the PUB Team), and on the XtreemOS grid operation system (designed under the leadership of the MYRIADS Team). This work uses as a framework the Nimbus cloud toolkit from Argonne National Lab.
In 2010 we addressed the following topics: 1) Introduce of self-adaptation capabilities in BlobSeer, based on the MonALISA monitoring framework; 2) Design and prototype an implementation of a generic security management framework for BlobSeer-based cloud storage; 3) Design mechanisms facilitating the deployment of BlobSeer on XtreemOS-enabled IaaS clouds based on Nimbus.
The main results achieved this year are described in
detail at
http://
B. Nicolae visited ANL (USA) thanks to the INRIA Explorateur Programme for 3 months (April to July 2010). This served as a preliminary step preparing the MapReduce ANR project started in October 2010 in partneship with ANL.
In 2010, 3 PhD students from PUB hosted in Rennes for 3 months each (9 months overall). One PhD student from Rennes hosted in Bucharest twice (two weeks overall).
In 2010, 3 joint publications involving at least 2 of the 3 partners of the Associate Team have been made, 2 joint publications with Argonne National Lab and a large number of Master and Bachelor theses. The results were presented at 3 internal workshops organized in Rennes.
In 2010, 2 PhD theses strongly related to the Associate Team have been defended: B .Nicolae (KerData) in Rennes and Alexandru Costan (PUB) in Bucharest. The French and Romanian leaders of the Associate Team participated to both PhD committees.
Overall, 6 Bachelor theses locally carried out in Bucharest and 4 Master theses in Rennes were dedicated to subtasks derived from the scientific schedule of the DataCloud@work Associate Team. Out of these, 2 Master students from PUB were hosted by the KerData team through INRIA's Internship Programme (co-funded by KerData on its own resources).
MapReduce is an ANR project with international partners: Argonne National Lab (USA), the University of Illinois at Urbana-Champaign (UIUC, USA) and the Joint INRIA-UIUC Lab for Petascale Computing (JLPC). See Section for details.
This work has been done in collaboration with Matthieu Dorier, student at ENS Cachan, Brittany Campus, during his summer 2010 internship at the INRIA-UIUC Joint Laboratory for Petascale Computing at Urbana-Champaign.
Preliminary discussions have been held at the 2nd
workshop of the INRIA-UIUC Joint Laboratory for Petascale
Computing (JLPC,
http://
G. Antoniu and B. Nicolae visited the National Center for Supercomputing Applications (NCSA) at UIUC in April 2010 to explore how the BlobSeer BLOB-based approach developed by KerData could be used to optimize the management of concurrent data I/O requests generated by massively parallel simulations that run simultaneously with parallel visualization tools. A preliminary study in this context was performed by Matthieu Dorier, Master student (M1) at ENS Cachan/Brittany, during a 3-month internship at NCSA/UIUC, in collaboration with several researchers at NCSA/UIUC involved in the JLPC (Marc Snir, Franck Cappello, Dave Semeraro). This study showed the benefit of a new approach using dedicated I/O cores.
We intend to extend this approach in two directions: 1) Compare the use of dedicated cores with the use of dedicated nodes, and model the performance of both approaches in order to select the best one according the the applications and execution platforms I/O characteristics; 2) Build a BlobSeer-based metadata software layer enabled to schedule I/O operations coming from the simulation. This work will continue during the master internship (M2) of Matthieu Dorier at KerData in 2011. It is expected to be pursued further during his PhD thesis in the KerData Team. This topic is also part of INRIA's proposed contribution in the framework of an IP European project proposal to be submitted in January 2011. This IP project will involve 2 INRIA teams: KerData in Rennes and GRAND-LARGE in Saclay (through the JLPC at Urbana-Champaign).
FP3C (Framework and Programming for Post-Petascale Computing) is a joint project co-funded by the French National Research Agency (ANR) and by the Japan Science and Technology Agency (JST). It started in September 2010 for 3 years. Its main goal is to develop a programming chain and associated runtime systems which will allow scientific end-users to efficiently execute their applications on Post-Petascale, highly hierarchical computing platforms making use of multi-core processors and accelerators. This project gathers majors actors involved in HPC research in France (INRIA, CEA, CNRS) and Japan (University of Tsukuba, University of Tokyo, Tokyo Institute of Technology, University of Kyoto).
Within this framework, we collaborate with Osamu Tatebe from the University of Tsukuba in the area of large-scale data-sharing. The goal of this collaboration is to design, implement and validate an integrated architecture for a Petascale storage system by weaving the best properties of global file systems (transparency, standard access interface) and RAM-based, BLOB storage systems (versioning, access efficiency under heavy concurrency). More specifically, we intend to explore how a hierarchical approach can be used to build a BLOB-based storage system file system.
While such an approach has been used in classical, non-distributed computer architecture to explore the combined usage of file storage and RAM storage, no convincing tentative has been made regarding Post-Petascale distributed storage systems. As a first step, our objective in 2011 is to specify the the joint architecture for a BLOB-based file storage architecture.
Several informal discussions took place with Ruby Krishnaswamy from Orange Labs, Issy-les-Moulineaux on potential collaborations in the area of cloud storage. Orange Labs is interested in BlobSeer-based concurrency-optimized storage support for virtual machine images and cloud application data.
L. Bougé serves as a Vice-Chair of the Steering Committeeof the Euro-Parannual conference series on parallel computing. G. Antoniu serves as a Local Chair for the Parallel and Distributed Data Managementtopic of Euro-Par 2011, to be held in Bordeaux.
G. Antoniu served as a Vice-Chair of the Program Committeefor the storage track of the IEEE NASinternational conference on Networking, Architecture, and Storage.
G. Antoniu serves as a coordinator for the MapReduce ANR project (ARPEGE 2010 call), started in October 2010 in collaboration with Argonne National Lab, the University of Illinois at Urbana Champaign, the UIUC/INRIA Joint Lab on Petascale Computing, IBM, IBCP, MEDIT and the GRAAL INRIA Project-Team.
G. Antoniu and B. Thirion (PARIETAL Project-Team, Inria Saclay – Île-de-France) co-lead the AzureBrain Microsoft-INRIA Project started in October 2010 in the framework of the Microsoft Research - INRIA Joint Center (2010-2012).
G. Antoniu serves as a coordinator for the DataCloud@work Associate Team, a project involving the KerData and MYRIADS INRIA Teams in Rennes and the Distributed Systems Group from Politehnica University of Bucharest (2010–2012).
G. Antoniu coordinates the involvement of the Inria Rennes – Bretagne AtlantiqueResearch Center in the SCALUS Project of the Marie-Curie Initial Training Networks Programme (ITN), call FP7-PEOPLE-ITN-2008 (2009-2013).
G. Antoniu coordinates the involvement of the Inria Rennes – Bretagne AtlantiqueResearch Center in the CoreGRID ERCIM Working Group.
L. Bougé serves as a Vice-Chair of the National Selection Committee for High-School Mathematics Teachers, Informatics Track.
is a member of the Editorial Advisory Boardof the Scientific ProgrammingJournal.
served in the Program Committees for the following conferences and workshops: CloudCom 2010, ICPADS 2010, 3PGCIC-2010, MapReduce 2010, MAPRED 2010, ADiS 2010, SRMPDS 2010, AINA 2011, CISIS 2011, PDP 2011, RenPar'19, RenPar'20.
served in the Program Committee for the following conferences: NPC 2010.
served as a member of the Selection Committee for the Gilles Kahn PhD Thesis Award 2010.
was the chair of the national evaluation committee for the 2010 Scientific Excellence Award ( Prime d'excellence scientifique, PES) targeted to the researchers on an academic teaching position in France.
gave a keynote talk entitled Autonomic cloud storage: challenges at stakeat the ADiS workshop held in February 2010 in Krakow, Poland.
gave an invited talk entitled BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrencyat the Parallel@Illinois Special Event Series, University of Illinois at Urbana-Champaign, IL, USA, in April 2010.
gave a keynote talk entitled Scalable MapReduce Data Processing on Clouds: the BlobSeer Approachat the International Conference on High Performance Computing and Simulation (HPCS 2010) conference held in June 2010 in Caen, France.
gave a a talk entitled BlobSeer: Efficient, Versioning-Based Storage for Massive Data under Heavy Access Concurrency on Cloudsat Microsoft Research - INRIA Workshop on Extreme Operating Systems held in November 2010 in Paris, France.
gave an invited talk entitled Concurrency-optimized I/O for visualizing HPC simulations: An Approach Using Dedicated I/O coresat the 4nd workshop of the Joint Laboratory for Petascale Computing held in November 2010 at NCSA/UIUC, Urbana-Champaign, IL, USA.
Only the teaching contributions of project-team members on non-teaching positions are mentioned below.
gave lectures on peer-to-peer systems within the Peer-to-Peer SystemsModule of the Master Program (2nd year), University Rennes 1. He gave lectures on Grid Data Management within the Distributed ArchitecturesModule of the ALMA Master Program (2nd year) of the University of Nantes. He also taught a full course on Grid Computingfor final year engineering students at the ESIEA Engineering School, Paris.
serves as the Scientific Correspondent for the International Relations Office of the Inria Rennes – Bretagne AtlantiqueResearch Center.
serves as the Scientific Leader of the KerData research team.
chairs the Computer Science and Telecommunication Department ( Département Informatique et Télécommunications, DIT) of the Brittany Extension of Ens Cachan. He leads the Master Program ( Magistère) in Computer Science at the Brittany Extension of Ens Cachan.
is a member of Scientific Committee of Inria Rennes – Bretagne Atlantique( Comité des projets), standing for the Ens Cachanpartner.
is a member of Scientific Committee of Inria Rennes – Bretagne Atlantique( Comité des projets), standing for the KerData research team.