The Roma team is common to CNRS, ENS Lyon, UCBL, and Inria. This team is part of the Laboratoire de l'Informatique du Parallélisme (LIP), UMR ENS Lyon/CNRS/Inria/UCBL 5668. The team is located at the École normale supérieure de Lyon. The external collaborators are members of the APO team of the IRIT laboratory (UMR 5505), located at the ENSEEIHT site of IRIT.
The Roma project aims at designing models, algorithms, and scheduling strategies to optimize the execution of scientific applications.
Modern computing platforms provide huge amounts of computational power—the top supercomputers contain more than 100,000 cores, and volunteer computing grids gather millions of processors. Squeezing the most out of these platforms could enable scientists to solve problems that remain currently beyond reach. However, to reach such a goal, all platform resources must be efficiently used: computational units, communication capabilities, memory hierarchies, energy, etc. Such resource optimizations are quite difficult because modern platforms have new, and hard to manage, characteristics: they contain multicore processors and sometimes specialized processors such as GPGPUs (General Purpose - Graphical Processing Units); they may be distributed on a very large scale, which can significantly impact communications; they may be volatile and even unreliable; and their usage may be subject to conflicting objectives from the platform owner(s) and users. Therefore, harnessing the full power of modern computing platforms requires careful theoretical algorithmic studies of resource optimization problems. The goal of the Roma project is to perform such studies and to design efficient practical scheduling strategies and resource allocation algorithms.
Historically, the Roma team comes from the merging of two of the three groups that composed the GRAAL project-team: (i) the group focusing on fundamental research on scheduling strategies and algorithm design for heterogeneous platforms; and (ii) the group working on direct solvers for sparse linear systems. The Roma project is organized around two main research themes —that are relevant to the focus of the former groups— and four transverse topics.
The two main research themes are:
Static algorithms for dynamic environments
Direct solvers for sparse linear systems
The four transverse topics are:
Memory-aware algorithms
Linear algebra on post-petascale multicore platforms
Multi-criteria optimization
Combinatorial scientific computing
Sparse direct (multifrontal) solvers in distributed-memory environments have a wide range of applications as they are used at the heart of many numerical methods in simulation: whether a model uses finite elements or finite differences, or requires the optimization of a complex linear or nonlinear function, one often ends up solving a linear system of equations involving sparse matrices. There are therefore a number of application fields, among which some of the ones cited by the users of our sparse direct solver Mumps (see Section ) are: structural mechanics, biomechanics, medical image processing, tomography, geophysics, ad-hoc networking modeling (e.g., Markovian processes), electromagnetics, fluid dynamics, econometric models, oil reservoir simulation, magneto-hydro-dynamics, chemistry, acoustics, glaciology, astrophysics, circuit simulation.
Mumps (for MUltifrontal Massively Parallel Solver, see
http://
Mumps implements a direct method, the multifrontal method; it is a parallel code capable of exploiting distributed-memory computers; its main originalities are its numerical robustness and the wide range of functionalities available.
The latest release is Mumps 4.10.0 (May 2011).
More information on Mumps is available at
http://
In this work , we defined a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identified a set of crucial parameters, instantiated them and compared the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then proposed a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outlined comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
We dealt with the impact of fault prediction techniques on checkpointing strategies. We extended the classical analysis of Young and Daly in the presence of a fault prediction system, which is characterized by its recall and its precision, and which provides either exact or window-based time predictions. We succeeded in deriving the optimal value of the checkpointing period (thereby minimizing the waste of resource usage due to checkpoint overhead) in all scenarios. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale. In addition, the results of this analytical evaluation were nicely corroborated by a comprehensive set of simulations, thereby demonstrating the validity of the model and the accuracy of the results.
Processor failures in post-petascale settings are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback, has been recently advocated by Ferreira et al. . We first identified an incorrect analogy made in their work between process replication and the birthday problem, and derived correct values for the Mean Number of Failures To Interruption and Mean Time To Interruption for exponentially distributed failures. We then extended these results to arbitrary failure distributions, including closed-form solutions for Weibull distributions. Finally, we evaluated process replication using both synthetic and real-world failure traces. Our main findings are: (i) replication is less beneficial than claimed by Ferreira et al.; (ii) although the choice of the checkpointing period can have a high impact on application execution in the no-replication case, with process replication this choice is no longer critical.
This work dealt with the complexity of scheduling computational workflows in the presence of Exponential failures. When such a failure occurs, rollback and recovery is used so that the execution can resume from the last checkpointed state. The goal is to minimize the expected execution time, and we have to decide in which order to execute the tasks, and whether to checkpoint or not after the completion of each given task. We showed that this scheduling problem is strongly NP-complete, and proposed a (polynomial-time) dynamic programming algorithm for the case where the application graph is a linear chain. These results laid the theoretical foundations of the problem, and constituted a prerequisite before discussing scheduling strategies for arbitrary DAGS of moldable tasks subject to general failure distributions.
We investigated the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents a large IO file. A task can only be executed if all input and output files fit into memory, and a file can only be removed from memory after it has been consumed. Such trees arise, for instance, in the multifrontal method of sparse matrix factorization. The maximum amount of memory needed depends on the execution order of the tasks. With one processor the objective of the tree traversal is to minimize the required memory. This problem was well studied and optimal polynomial algorithms were proposed. We extended the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With the multiple processors comes the additional objective to minimize the time needed to traverse the tree, i.e., to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We studied the computational complexity of this problem and provided an inapproximability result even for unit weight trees. We proposed several heuristics, each with a different optimization focus, and we analyzed them in an extensive experimental evaluation using realistic trees.
In this work, we studied the complexity of traversing workflows whose tasks require large I/O files. Such workflows arise in many scientific fields, such as image processing, genomics or geophysical simulations. They usually exhibit some regularity, and most of them can be modeled as Series-Parallel Graph. We target a classical two-level memory system, where the main memory is faster but smaller than the secondary memory. A task in the workflow can be processed if all its predecessors have been processed, and if its input and output files fit in the currently available main memory. The amount of available memory at a given time depends upon the ordering in which the tasks are executed. We focus on the problem of minimizing the amount of main memory needed to process the whole DAG.
We first concentrate on the parallel composition of task chains, or fork-join graphs. We adapt an algorithm designed for trees by Liu . We prove that an optimal schedule for fork-join can be split in two optimal tree schedules, which are obtained using Liu's algorithm. We then move to Series-Parallel graphs and propose a recursive adaptation of the previous algorithm, which consists in serializing every parallel compositions, starting from the innermost, using the fork-join algorithm. Simulations show that this algorithm always reach the optimal performance, and we provide a sketch of the optimality proof. We also study compositions of complete bipartite graphs, which are another important class of DAGs arising in scientific workflows. We propose an optimal algorithm for a class of compositions which we name tower of complete bipartite graphs.
Divisible Load Theory (DLT) has received a lot of attention in the past decade. A divisible load is a perfect parallel task, that can be split arbitrarily and executed in parallel on a set of possibly heterogeneous resources. The success of DLT is strongly related to the existence of many optimal resource allocation and scheduling algorithms, what strongly differs from general scheduling theory. Moreover, recently, close relationships have been underlined between DLT, that provides a fruitful theoretical framework for scheduling jobs on heterogeneous platforms, and MapReduce, that provides a simple and efficient programming framework to deploy applications on large scale distributed platforms.
The success of both have suggested to extend their framework to non-linear complexity tasks. We show that both DLT and MapReduce are better suited to workloads with linear complexity. In particular, we prove that divisible load theory cannot directly be applied to quadratic workloads, such as it has been proposed recently. We precisely state the limits for classical DLT studies and we review and propose solutions based on a careful preparation of the dataset and clever data partitioning algorithms. In particular, through simulations, we show the possible impact of this approach on the volume of communications generated by MapReduce, in the context of Matrix Multiplication and Outer Product algorithms.
We consider a task graph mapped on a set of homogeneous processors. We aim at minimizing the energy consumption while enforcing two constraints: a prescribed bound on the execution time (or makespan), and a reliability threshold. Dynamic voltage and frequency scaling (DVFS) is an approach frequently used to reduce the energy consumption of a schedule, but slowing down the execution of a task to save energy is decreasing the reliability of the execution.
In this work, to improve the reliability of a schedule while reducing the energy consumption, we allow for the re-execution of some tasks. We assess the complexity of the tri-criteria scheduling problem (makespan, reliability, energy) of deciding which task to re-execute, and at which speed each execution of a task should be done, with two different speed models: either processors can have arbitrary speeds (continuous model), or a processor can run at a finite number of different speeds and change its speed during a computation (VDD model). We propose several novel tri-criteria scheduling heuristics under the continuous speed model, and we evaluate them through a set of simulations. The two best heuristics turn out to be very efficient and complementary.
We consider the problem of scheduling an application on a parallel computational platform. The application is a particular task graph, either a linear chain of tasks, or a set of independent tasks. The platform is made of identical processors, whose speed can be dynamically modified. It is also subject to failures: if a processor is slowed down to decrease the energy consumption, it has a higher chance to fail. Therefore, the scheduling problem requires to re-execute or replicate tasks (i.e., execute twice a same task, either on the same processor, or on two distinct processors), in order to increase the reliability. It is a tri-criteria problem: the goal is to minimize the energy consumption, while enforcing a bound on the total execution time (the makespan), and a constraint on the reliability of each task.
Our main contribution is to propose approximation algorithms for these particular classes of task graphs. For linear chains, we design a fully polynomial time approximation scheme. However, we show that there exists no constant factor approximation algorithm for independent tasks, unless P=NP, and we are able in this case to propose an approximation algorithm with a relaxation on the makespan constraint.
We study the problem of replica placement in tree networks subject to server capacity and distance constraints. The client requests are known beforehand, while the number and location of the servers are to be determined. The Single policy enforces that all requests of a client are served by a single server in the tree, while in the Multiple policy, the requests of a given client can be processed by multiple servers, thus distributing the processing of requests over the platform. For the Single policy, we prove that all instances of the problem are NP-hard, and we propose approximation algorithms. The problem with the Multiple policy was known to be NP-hard with distance constraints, but we provide a polynomial time optimal algorithm to solve the problem in the particular case of binary trees when no request exceeds the server capacity.
We tackle pipeline workflow applications
that are executed on a distributed platform with setup times. In
such applications, several computation stages are interconnected as
a linear application graph, and each stage holds a buffer of limited
size where intermediate results are stored and a processor setup
time occurs when passing from one stage to another. The considered
stage/processor mapping strategy is based on interval mappings,
where an interval of consecutive stages is performed by the same
processor and the objective is the throughput optimization. Typical
examples for this kind of applications are streaming applications
such as audio and video coding or decoding, image processing using
co-processing devices as FPGA. Even when neglecting setup times, the
problem is NP-hard on heterogeneous platforms and we therefore
restrict to homogeneous resources. We provide an optimal algorithm
for constellations with identical buffer capacities. When buffer
sizes are not fixed, we deal with the problem of allocating the
buffers in shared memory and present a
We study the problem of minimum makespan scheduling when tasks are restricted to subsets of the processors (resource constraints), and require either one or multiple distinct processors to be executed (parallel tasks). This problem is related to the minimum makespan scheduling problem on unrelated machines, as well as to the concurrent job shop problem, and it amounts to finding a semi-matching in bipartite graphs or hypergraphs. While the problem was known to be NP-complete for bipartite graphs, but solvable in polynomial time for unweighted graphs (i.e., unit tasks), we prove that the problem is NP-complete for hypergraphs even in the unweighted case. We design several greedy algorithms of low complexity to solve two versions of the problem, and assess their performance through a set of exhaustive simulations. Even though there is no approximation guarantee on these linear algorithms, they return solutions close to the optimal (or a known lower bound) in average.
We present an iterative algorithm which asymptotically scales the
We discuss efficient shared memory parallelization of sparse matrix computations whose main traits resemble to those of the sparse matrix-vector multiply operation. Such computations are difficult to parallelize because of the relatively small computational granularity characterized by small number of operations per each data access. Our main application is a sparse matrix scaling algorithm which is more memory bound than the sparse matrix vector multiplication operation. We take the application and parallelize it using the standard OpenMP programming principles. Apart from the common race condition avoiding constructs, we do not reorganize the algorithm. Rather, we identify associated performance metrics and describe models to optimize them. By using these models, we implement parallel matrix scaling algorithms for two well-known sparse matrix storage formats. Experimental results show that simple parallelization attempts which leave data/work partitioning to the runtime scheduler can suffer from the overhead of avoiding race conditions especially when the number of threads increases. The proposed algorithms perform better than these algorithms by optimizing the identified performance metrics and reducing the overhead.
In a technical report , we investigate the push-relabel algorithm for solving the problem of finding a maximum cardinality matching in a bipartite graph in the context of the maximum transversal problem. We describe in detail an optimized yet easy-to-implement version of the algorithm and fine-tune its parameters. We also introduce new performance-enhancing techniques. On a wide range of real-world instances, we compare the push-relabel algorithm with state-of-the-art augmenting path-based algorithms and the recently proposed pseudoflow approach. We conclude that a carefully tuned push-relabel algorithm is competitive with all known augmenting path-based algorithms, and superior to the pseudoflow-based ones. We finalized this work by reporting the most important results in a journal article .
We investigate one dimensional partitioning of sparse matrices under a given ordering of the rows/columns. The partitioning constraint is to have load balance across processors when different parts are assigned to different processors. The load is defined as the number of rows, or columns, or the nonzeros assigned to a processor. The partitioning objective is to optimize different functions, including the well-known total communication volume arising in a distributed memory implementation of parallel sparse matrix-vector multiplication operations. The difference between our problem in this work and the general sparse matrix partitioning problem is that the parts should correspond to disjoint intervals of the given order. Whereas the partitioning problem without the interval constraint corresponds to the NP-complete hypergraph partitioning problem, the restricted problem corresponds to a polynomial-time solvable variant of the hypergraph partitioning problem. We adapt an existing dynamic programming algorithm designed for graphs to solve two related partitioning problems in graphs. We then propose graph models for a given hypergraph and a partitioning objective function so that the standard cutsize definition in the graph model exactly corresponds to the hypergraph partitioning objective function. In extensive experiments, we show that our proposed algorithm is helpful in practice. It even demonstrates performance superior to the standard hypergraph partitioners when the number of parts is high.
The elimination tree model for sparse unsymmetric matrices and an algorithm for constructing it have been recently proposed , .
The construction algorithm has a worst-case time complexity of
We study the adaptation of a parallel distributed-memory solver, MUMPS, into a shared-memory code, targetting multicore architectures. An advantage of adapting the code rather than starting with a new design is to fully benefit from its numerical kernels and functionalities. We show how one can take advantage of OpenMP directives and of existing libraries optimized for shared-memory environments, in our case BLAS libraries . We have also started to study approaches that take advantage of the specificities of NUMA architectures.
Matrices coming from elliptic PDEs have been shown to have a low-rank property. Although the dense internal datastructures involved in a multifrontal method, the so-called frontal matrices or fronts, are full-rank, their off-diagonal blocks can then be approximated by low-rank products. We have studied a low-rank format called Block Low Rank and explained how it can be used to reduce the memory footprint and complexity of both the factorization and solve phases, depending on the way variables are grouped. The proposed approach can be used either to accelerate the factorization and solution phases or to build a preconditioner . We have started the development of a version of MUMPS that exploits such properties. This work is in collaboration with EDF (contract funding for the Ph.D. thesis of C. Weisbecker at INPT) and C. Ashcraft (LSTC).
We have worked on the parallel computation of several entries of the inverse of a large sparse matrix. We assume that the matrix has already been factorized by a direct method and that the factors are distributed. Entries are efficiently computed by exploiting sparsity of the right-hand sides and the solution vectors in the triangular solution phase. We demonstrate that in this setting, parallelism and computational efficiency are two contrasting objectives. We develop an efficient approach and show its efficacy by runs using the MUMPS code that implements a parallel multifrontal method.
We have studied the memory scalability of the parallel multifrontal factorization of sparse matrices. In particular, we are interested in controlling the active memory specific to the multifrontal factorization. We illustrate why commonly used mapping strategies (e.g. proportional mapping) cannot achieve a high memory efficiency. We propose a class of “memory-aware” algorithms that aim at maximizing performance under given memory constraints, and explain why they provide reliable memory estimates, thus a more robust solver. We study these issues in the context of the MUMPS solver, in which new experimental static scheduling strategies have been implemented and experimented on large matrices .
The ANR White Project Rescue was launched in November 2010, for a duration of 48 months. It gathers three Inria partners (Roma, Grand-Large and Hiepacs) and is led by Roma. The main objective of the project is to develop new algorithmic techniques and software tools to solve the exascale resilience problem. Solving this problem implies a departure from current approaches, and calls for yet-to-be-discovered algorithms, protocols and software tools.
This proposed research follows three main research thrusts. The first thrust deals with novel checkpoint protocols. The second thrust entails the development of novel execution models, i.e., accurate stochastic models to predict (and, in turn, optimize) the expected performance (execution time or throughput) of large-scale parallel scientific applications. In the third thrust, we will develop novel parallel algorithms for scientific numerical kernels.
The Aloha associate-team is a joint project of the Roma team and of the Information and Computer science Department of the University of Hawai`i (UH) at Mānoa, Honolulu, USA. Building on a vast array of theoretical techniques and expertise developed in the field of parallel and distributed computing, and more particularly application scheduling, we tackle database questions from a fresh perspective. To this end, this proposal includes:
a group that specializes in database systems research and who has both industrial and academic experience, the group of Lipyeow Lim (UH);
a group that specializes in practical aspects of scheduling problems and in simulation for emerging platforms and applications, and who has a long experience of multidisciplinary research, the group of Henri Casanova (UH);
a group that specializes in the theoretical aspects of scheduling problems and resource management (the Roma team).
The research work focuses on the following three thrusts:
Online, multi-criteria query optimization
Fault-Tolerance for distributed databases
Query scheduling for distributed databases
Oliver Sinnen, senior lecturer at the Department of Electrical and Computer Engineering (ECE) of the University of Auckland, New Zealand, visited the Roma team for three months (April-June, 2012). He worked with Loris Marchal and Frédéric Vivien on scheduling tree-shaped task graphs to minimize both the peak memory usage and the makespan (see Section ).
The University of Pittsburgh (Rami Melhem), the Roma team (Yves Robert and Frédéric Vivien) and the University of Hawai'i at Manoa (Henri Casanova) have organized a workshop in Pittsburgh, on June 28-30, 2012. The workshop focused on scheduling and algorithms for large-scale systems. This was the seventh edition of this workshop series, after Aussois in August 2004, San Diego in November 2005, Aussois in May 2008, Knoxville in May 2009, Aussois in May 2010, and Aussois in May 2011. The next workshop will be held in Schloss Dagstuhl in September 2013.
is an associate editor of the Journal of Parallel and Distributed Computing (JPDC). She was program vice co-chair of IEEE Cluster 2012; program vice-chair of IEEE AINA 2012; member of the organizing committee of SIAM PP 2012, and organizer of a mini-symposium in SIAM PP12. She is workshops co-chair of ICPP 2013. She is or was a member of the program committees of of the following conferences and workshops: CCGrid 2012, HPDC 2012, IPDPS 2012, IPCE 2013, CCGrid 2013, IPDPS 2013, CLOSER 2013, HCW 2013, IGCC 2013.
was a member of the program committees of Vecpar'12, Renpar'13.
was a member of the program committees of IPDPS'2012, ICPP 2012 and IPDPS'2013.
is an associate editor of IJHPCA, IJGUC and JOCS. He is Program Chair of ICPP 2013 (Int. Conference on Parallel Processing) and of HiPC 2013 (Int. Conference on High Performance Computing). He was Program vice-chair of HiPC 2012. He is a Steering committee member of IPDPS and HCW. He is or was a member of the program committees of of the following conferences and workshops: EduPar 2012, FTXS 2012, ISC 2012, ISCIS 2012, EduPar 2013, FTXS 2013, ICCS 2013, IGCC 2013, ISC 2013 and SC 2013.
was a member of the program committee for six conferences/workshops in 2012 (HiPC12, IEEE Cluster 12, EuroPart 2012, PCO'12, PMAA2012, TCPP PhD Forum). He was an organizer of a mini-symposium in SIAM PP12. He is in the program committee of IPDPS 2013.
is an associate editor of Parallel Computing. Frédéric Vivien is program vice-chair, for the algorithms track, of IPDPS 2013. He is or was a member of the program committee of the following conferences and workshops: EuroPDP 2013, RenPar 2013, ROADEF 2013, SC 13, CCGrid 2012, Cluster 2012, ROADEF 2012, AHPAA 2012.
Loris Marchal, Ordonnancement, 36h, M2, École normale supérieure de Lyon, France.
Bora Uçar has given a mini course in Parallel Computing Group at the University of Murcia, 28 and 29 November 2011.
Frédéric Vivien, Algorithmique et Programmation Parallèles, 36 h, M1, École normale supérieure de Lyon, France.
Frédéric Vivien, Ordonnancement, 3 h, M2, École normale supérieure de Lyon, France.
PhD: Paul Renaud-Goud, Energy-aware scheduling: complexity and algorithms, École Normale Supérieure de Lyon, July 5, 2012, Anne Benoit and Yves Robert.
PhD in progress: Guillaume Aupy, Multi-criteria scheduling on volatile platforms, September 1, 2011, Anne Benoit and Yves Robert.
PhD in progress: Dounia Zaidouni, Performance and execution models for exascale applications in failure-prone environments, October 1, 2011, Frédéric Vivien and Yves Robert.
PhD in progress: Mohamed Sid-Lakhdar, Exploitation of multicore architectures in the resolution of sparse linear systems by multifrontal methods, October 1, 2011, Jean-Yves L'Excellent et Frédéric Vivien.
PhD in progress: Julien Herrmann, Numerical algorithms for large-scale platforms, September 1, 2012, Loris Marchal and Yves Robert.