Section: New Results

System support: Performance and dependability benchmaking

Participants : Amit Sangroya, Dàmian Serrano, Sara Bouchenak [correspondant] .

MapReduce has become a popular programming model and runtime environment for developing and executing distributed data-intensive and compute-intensive applications. It offers developers a means to transparently handle data partitioning, replication, task scheduling and fault tolerance on a cluster of commodity computers. MapReduce allows a wide range of applications such as log analysis, data mining, Web search engines, scientific computing, bioinformatics, decision support and business intelligence.

There has been a large amount of work on MapReduce towards improving its performance and reliability. Several efforts have explored task scheduling policies in MapReduce, cost-based optimization techniques, replication and partitioning policies. There has also been a considerable interest in extending MapReduce with other fault tolerance models, or with techniques from database systems. However, there has been very little in the way of empiric evaluation for the comparison of the different systems. Most evaluations of these systems have relied on microbenchmarks based on simple MapReduce programs. While microbenchmarks may be useful in targetting specific system features, they are not representative of full distributed applications, and they do not provide multi-user realistic workloads. Furthermore, as far as we know, no studies have investigated dependability benchmarking of MapReduce.

Thus, we provide MapReduce Benchmarking (MRB), a novel MapReduce benchmark suite to enable a thorough analysis of a wide range of features of MapReduce systems. MRB has the following features. First, it enables the empirical evaluation of the performance and dependability of MapReduce systems. This provides a means to analyze the effectiveness of scalability and fault tolerance, two key features of MapReduce. Second, it covers a variety of application domains, workload and faultload characteristics, ranging from compute-oriented to data-oriented applications, batch applications to online real-time applications. While MapReduce farmewsorks were originally limited to offline batch processing, recent works are exploring the extension of MapReduce beyond batch processing. Moreover, in order to stress MapReduce dependability and performance, the benchmark suite enables different fault injection rates, workloads and concurrency levels. Finally, the benchmark suite is portable and easy to use on a wide range of platforms, covering different MapReduce frameworks and cloud infrastructures. This work has been submitted for publication.