The Roma project aims at designing models, algorithms, and scheduling strategies to optimize the execution of scientific applications.

Scientists now have access to tremendous computing power. For instance, the four most powerful computing platforms in the TOP 500 list each includes more than 500,000 cores and deliver a sustained performance of more than 10 Peta FLOPS. The volunteer computing platform BOINC is another example with more than 440,000 enlisted computers and, on average, an aggregate performance of more than 9 Peta FLOPS. Furthermore, it had never been so easy for scientists to have access to parallel computing resources, either through the multitude of local clusters or through distant cloud computing platforms.

Because parallel computing resources are ubiquitous, and because the available computing power is so huge, one could believe that scientists no longer need to worry about finding computing resources, even less to optimize their usage. Nothing is farther from the truth. Institutions and government agencies keep building larger and more powerful computing platforms with a clear goal. These platforms must allow to solve problems in reasonable timescales, which were so far out of reach. They must also allow to solve problems more precisely where the existing solutions are not deemed to be sufficiently accurate. For those platforms to fulfill their purposes, their computing power must therefore be carefully exploited and not be wasted. This often requires an efficient management of all types of platform resources: computation, communication, memory, storage, energy, etc. This is often hard to achieve because of the characteristics of new and emerging platforms. Moreover, because of technological evolutions, new problems arise, and fully tried and tested solutions need to be thoroughly overhauled or simply discarded and replaced. Here are some of the difficulties that have, or will have, to be overcome:

computing platforms are hierarchical: a processor includes several cores, a node includes several processors, and the nodes themselves are gathered into clusters. Algorithms must take this hierarchical structure into account, in order to fully harness the available computing power;

the probability for a platform to suffer from a hardware fault automatically increases with the number of its components. Fault-tolerance techniques become unavoidable for large-scale platforms;

the ever increasing gap between the computing power of nodes and the bandwidths of memories and networks, in conjunction with the organization of memories in deep hierarchies, requires to take more and more care of the way algorithms use memory;

energy considerations are unavoidable nowadays. Design specifications for new computing platforms always include a maximal energy consumption. The energy bill of a supercomputer may represent a significant share of its cost over its lifespan. These issues must be taken into account at the algorithm-design level.

We are convinced that dramatic breakthroughs in algorithms and scheduling strategies are required for the scientific computing community to overcome all the challenges posed by new and emerging computing platforms. This is required for applications to be successfully deployed at very large scale, and hence for enabling the scientific computing community to push the frontiers of knowledge as far as possible. The Roma project-team aims at providing fundamental algorithms, scheduling strategies, protocols, and software packages to fulfill the needs encountered by a wide class of scientific computing applications, including domains as diverse as geophysics, structural mechanics, chemistry, electromagnetism, numerical optimization, or computational fluid dynamics, to quote a few. To fulfill this goal, the Roma project-team takes a special interest in dense and sparse linear algebra.

The work in the Roma team is organized along three research themes.

**Algorithms for probabilistic environments.** In this
theme, we consider problems where some of the platform
characteristics, or some of the application characteristics, are
described by probability distributions. This is in particular the case when
considering the resilience of applications in failure-prone
environments: the possibility of faults is modeled by probability distributions.

**Platform-aware scheduling strategies.** In this theme, we
focus on the design of scheduling strategies that finely take into
account some platform characteristics beyond the most classical
ones, namely the computing speed of processors and accelerators,
and the communication bandwidth of network links. In the scope of
this theme, when designing scheduling strategies, we focus
either on the energy consumption or on the memory behavior. All
optimization problems under study are multi-criteria.

**High-performance computing and linear algebra.** We work
on algorithms and tools for both sparse and dense linear algebra. In
sparse linear algebra, we work on most aspects of direct multifrontal
solvers for linear systems. In dense linear algebra, we focus on the
adaptation of factorization kernels to emerging and future
platforms. In addition, we also work on combinatorial scientific
computing, that is, on the design of combinatorial algorithms and
tools to solve combinatorial problems, such as those
encountered, for instance, in the preprocessing phases of solvers of
sparse linear systems.

There are two main research directions under this research theme. In the first one, we consider the problem of the efficient execution of applications in a failure-prone environment. Here, probability distributions are used to describe the potential behavior of computing platforms, namely when hardware components are subject to faults. In the second research direction, probability distributions are used to describe the characteristics and behavior of applications.

An application is resilient if it can successfully produce a correct result in spite of potential faults in the underlying system. Application resilience can involve a broad range of techniques, including fault prediction, error detection, error containment, error correction, checkpointing, replication, migration, recovery, etc. Faults are quite frequent in the most powerful existing supercomputers. The Jaguar platform, which ranked third in the TOP 500 list in November 2011 , had an average of 2.33 faults per day during the period from August 2008 to February 2010 . The mean-time between faults of a platform is inversely proportional to its number of components. Progresses will certainly be made in the coming years with respect to the reliability of individual components. However, designing and building high-reliability hardware components is far more expensive than using lower reliability top-of-the-shelf components. Furthermore, low-power components may not be available with high-reliability. Therefore, it is feared that the progresses in reliability will far from compensate the steady projected increase of the number of components in the largest supercomputers. Already, application failures have a huge computational cost. In 2008, the DARPA white paper on “System resilience at extreme scale” stated that high-end systems wasted 20% of their computing capacity on application failure and recovery.

In such a context, any application using a significant fraction of a supercomputer and running for a significant amount of time will have to use some fault-tolerance solution. It would indeed be unacceptable for an application failure to destroy centuries of CPU-time (some of the simulations run on the Blue Waters platform consumed more than 2,700 years of core computing time and lasted over 60 hours; the most time-consuming simulations of the US Department of Energy (DoE) run for weeks to months on the most powerful existing platforms ).

Our research on resilience follows two different directions. On the one hand we design new resilience solutions, either generic fault-tolerance solutions or algorithm-based solutions. On the other hand we model and theoretically analyze the performance of existing and future solutions, in order to tune their usage and help determine which solution to use in which context.

Static scheduling algorithms are algorithms where all decisions are taken before the start of the application execution. On the contrary, in non-static algorithms, decisions may depend on events that happen during the execution. Static scheduling algorithms are known to be superior to dynamic and system-oriented approaches in stable frameworks , , , , that is, when all characteristics of platforms and applications are perfectly known, known a priori, and do not evolve during the application execution. In practice, the prediction of application characteristics may be approximative or completely infeasible. For instance, the amount of computations and of communications required to solve a given problem in parallel may strongly depend on some input data that are hard to analyze (this is for instance the case when solving linear systems using full pivoting).

We plan to consider applications whose characteristics change dynamically and are subject to uncertainties. In order to benefit nonetheless from the power of static approaches, we plan to model application uncertainties and variations through probabilistic models, and to design for these applications scheduling strategies that are either static, or partially static and partially dynamic.

In this theme, we study and design scheduling strategies, focusing either on energy consumption or on memory behavior. In other words, when designing and evaluating these strategies, we do not limit our view to the most classical platform characteristics, that is, the computing speed of cores and accelerators, and the bandwidth of communication links.

In most existing studies, a single optimization objective is considered, and the target is some sort of absolute performance. For instance, most optimization problems aim at the minimization of the overall execution time of the application considered. Such an approach can lead to a very significant waste of resources, because it does not take into account any notion of efficiency nor of yield. For instance, it may not be meaningful to use twice as many resources just to decrease by 10% the execution time. In all our work, we plan to look only for algorithmic solutions that make a “clever” usage of resources. However, looking for the solution that optimizes a metric such as the efficiency, the energy consumption, or the memory-peak minimization, is doomed for the type of applications we consider. Indeed, in most cases, any optimal solution for such a metric is a sequential solution, and sequential solutions have prohibitive execution times. Therefore, it becomes mandatory to consider multi-criteria approaches where one looks for trade-offs between some user-oriented metrics that are typically related to notions of Quality of Service—execution time, response time, stretch, throughput, latency, reliability, etc.—and some system-oriented metrics that guarantee that resources are not wasted. In general, we will not look for the Pareto curve, that is, the set of all dominating solutions for the considered metrics. Instead, we will rather look for solutions that minimize some given objective while satisfying some bounds, or “budgets”, on all the other objectives.

Energy-aware scheduling has proven an important issue in the past decade, both for economical and environmental reasons. Energy issues are obvious for battery-powered systems. They are now also important for traditional computer systems. Indeed, the design specifications of any new computing platform now always include an upper bound on energy consumption. Furthermore, the energy bill of a supercomputer may represent a significant share of its cost over its lifespan.

Technically, a processor running at speed *reliability* due to the energy efficiency, different models have
been proposed for fault tolerance: (i) *re-execution* consists in
re-executing a task that does not meet the reliability constraint ; (ii) *replication* consists in executing the
same task on several processors simultaneously, in order to meet the
reliability constraints ; and (iii)
*checkpointing* consists in “saving” the work done at some
certain instants, hence reducing the amount of work lost
when a failure occurs .

Energy issues must be taken into account at all levels, including the algorithm-design level. We plan to both evaluate the energy consumption of existing algorithms and to design new algorithms that minimize energy consumption using tools such as resource selection, dynamic frequency and voltage scaling, or powering-down of hardware components.

For many years, the bandwidth between memories and processors
has increased more slowly than the computing power of processors, and the
latency of memory accesses has been improved at an even slower pace.
Therefore, in the time needed for a processor to perform a floating
point operation, the amount of data transferred between the memory and the
processor has been decreasing
with each passing year. The risk is for
an application to reach a point where the time needed to solve a
problem is no longer dictated by the processor computing power but by
the memory characteristics, comparable to the *memory wall* that
limits CPU performance. In such a case, processors would be greatly
under-utilized, and a large part of the computing power of the platform
would be wasted. Moreover, with the advent of multicore processors,
the amount of memory per core has started to stagnate, if not to
decrease. This is especially harmful to memory intensive
applications. The problems related to the sizes and the bandwidths of
memories are further exacerbated on modern computing platforms because
of their deep and highly heterogeneous hierarchies. Such a hierarchy
can extend from core private caches to shared memory within a CPU, to disk
storage and even tape-based storage systems, like in the Blue Waters
supercomputer . It may also be the case that
heterogeneous cores are used (such as hybrid CPU and GPU computing),
and that each of them has a limited memory.

Because of these trends, it is becoming more and more important to precisely take memory constraints into account when designing algorithms. One must not only take care of the amount of memory required to run an algorithm, but also of the way this memory is accessed. Indeed, in some cases, rather than to minimize the amount of memory required to solve the given problem, one will have to maximize data reuse and, especially, to minimize the amount of data transferred between the different levels of the memory hierarchy (minimization of the volume of memory inputs-outputs). This is, for instance, the case when a problem cannot be solved by just using the in-core memory and that any solution must be out-of-core, that is, must use disks as storage for temporary data.

It is worth noting that the cost of moving data has lead to the development of so called “communication-avoiding algorithms” . Our approach is orthogonal to these efforts: in communication-avoiding algorithms, the application is modified, in particular some redundant work is done, in order to get rid of some communication operations, whereas in our approach, we do not modify the application, which is provided as a task graph, but we minimize the needed memory peak only by carefully scheduling tasks.

Our work on high-performance computing and linear algebra is organized along three research directions. The first direction is devoted to direct solvers of sparse linear systems. The second direction is devoted to combinatorial scientific computing, that is, the design of combinatorial algorithms and tools that solve problems encountered in some of the other research themes, like the problems faced in the preprocessing phases of sparse direct solvers. The last direction deals with the adaptation of classical dense linear algebra kernels to the architecture of future computing platforms.

The solution of sparse systems of linear equations (symmetric or unsymmetric, often with an irregular structure, from a few hundred thousand to a few hundred million equations) is at the heart of many scientific applications arising in domains such as geophysics, structural mechanics, chemistry, electromagnetism, numerical optimization, or computational fluid dynamics, to cite a few. The importance and diversity of applications are a main motivation to pursue research on sparse linear solvers. Because of this wide range of applications, any significant progress on solvers will have a significant impact in the world of simulation. Research on sparse direct solvers in general is very active for the following main reasons:

many applications fields require large-scale simulations that are still too big or too complicated with respect to today's solution methods;

the current evolution of architectures with massive, hierarchical, multicore parallelism imposes to overhaul all existing solutions, which represents a major challenge for algorithm and software development;

the evolution of numerical needs and types of simulations increase the importance, frequency, and size of certain classes of matrices, which may benefit from a specialized processing (rather than resort to a generic one).

Our research in the field is strongly related to the software package Mumps (see Section ). Mumps is both an experimental platform for academics in the field of sparse linear algebra, and a software package that is widely used in both academia and industry. The software package Mumps enables us to (i) confront our research to the real world, (ii) develop contacts and collaborations, and (iii) receive continuous feedback from real-life applications, which is extremely critical to validate our research work. The feedback from a large user community also enables us to direct our long-term objectives towards meaningful directions.

In this context, we aim at designing parallel sparse direct methods that will scale to large modern platforms, and that are able to answer new challenges arising from applications, both efficiently—from a resource consumption point of view—and accurately—from a numerical point of view. For that, and even with increasing parallelism, we do not want to sacrifice in any manner numerical stability, based on threshold partial pivoting, one of the main originalities of our approach (our “trademark”) in the context of direct solvers for distributed-memory computers; although this makes the parallelization more complicated, applying the same pivoting strategy as in the serial case ensures numerical robustness of our approach, which we generally measure in terms of sparse backward error. In order to solve the hard problems resulting from the always-increasing demands in simulations, special attention must also necessarily be paid to memory usage (and not only execution time). This requires specific algorithmic choices and scheduling techniques. From a complementary point of view, it is also necessary to be aware of the functionality requirements from the applications and from the users, so that robust solutions can be proposed for a wide range of applications.

Among direct methods, we rely on the multifrontal method , , . This method usually exhibits a good data locality and hence is efficient in cache-based systems. The task graph associated with the multifrontal method is in the form of a tree whose characteristics should be exploited in a parallel implementation.

Our work is organized along two main research directions. In the first one we aim at efficiently addressing new architectures that include massive, hierarchical parallelism. In the second one, we aim at reducing the running time complexity and the memory requirements of direct solvers, while controlling accuracy.

Combinatorial scientific computing (CSC) is a recently coined term (circa 2002) for interdisciplinary research at the intersection of discrete mathematics, computer science, and scientific computing. In particular, it refers to the development, application, and analysis of combinatorial algorithms to enable scientific computing applications. CSC's deepest roots are in the realm of direct methods for solving sparse linear systems of equations where graph theoretical models have been central to the exploitation of sparsity, since the 1960s. The general approach is to identify performance issues in a scientific computing problem, such as memory use, parallel speed up, and/or the rate of convergence of a method, and to develop combinatorial algorithms and models to tackle those issues.

Our target scientific computing applications are (i) the preprocessing phases of direct methods (in particular MUMPS), iterative methods, and hybrid methods for solving linear systems of equations; and (ii) the mapping of tasks (mostly the sub-tasks of the mentioned solvers) onto modern computing platforms. We focus on the development and use of graph and hypergraph models, and related tools such as hypergraph partitioning algorithms, to solve problems of load balancing and task mapping. We also focus on bipartite graph matching and vertex ordering methods for reducing the memory overhead and computational requirements of solvers. Although we direct our attention on these models and algorithms through the lens of linear system solvers, our solutions are general enough to be applied to some other resource optimization problems.

The quest for efficient, yet portable, implementations of dense linear algebra kernels (QR, LU, Cholesky) has never stopped, fueled in part by each new technological evolution. First, the LAPACK library relied on BLAS level 3 kernels (Basic Linear Algebra Subroutines) that enable to fully harness the computing power of a single CPU. Then the ScaLAPACK library built upon LAPACK to provide a coarse-grain parallel version, where processors operate on large block-column panels. Inter-processor communications occur through highly tuned MPI send and receive primitives. The advent of multi-core processors has led to a major modification in these algorithms , , . Each processor runs several threads in parallel to keep all cores within that processor busy. Tiled versions of the algorithms have thus been designed: dividing large block-column panels into several tiles allows for a decrease in the granularity down to a level where many smaller-size tasks are spawned. In the current panel, the diagonal tile is used to eliminate all the lower tiles in the panel. Because the factorization of the whole panel is now broken into the elimination of several tiles, the update operations can also be partitioned at the tile level, which generates many tasks to feed all cores.

The number of cores per processor will keep increasing in the following years. It is projected that high-end processors will include at least a few hundreds of cores. This evolution will require to design new versions of libraries. Indeed, existing libraries rely on a static distribution of the work: before the beginning of the execution of a kernel, the location and time of the execution of all of its component is decided. In theory, static solutions enable to precisely optimize executions, by taking parameters like data locality into account. At run time, these solutions proceed at the pace of the slowest of the cores, and they thus require a perfect load-balancing. With a few hundreds, if not a thousand, cores per processor, some tiny differences between the computing times on the different cores (“jitter”) are unavoidable and irremediably condemn purely static solutions. Moreover, the increase in the number of cores per processor once again mandates to increase the number of tasks that can be executed in parallel.

We study solutions that are part-static part-dynamic, because such solutions have been shown to outperform purely dynamic ones . On the one hand, the distribution of work among the different nodes will still be statically defined. On the other hand, the mapping and the scheduling of tasks inside a processor will be dynamically defined. The main difficulty when building such a solution will be to design lightweight dynamic schedulers that are able to guarantee both an excellent load-balancing and a very efficient use of data locality.

Sparse direct (multifrontal) solvers in distributed-memory environments have a wide range of applications as they are used at the heart of many numerical methods in simulation: whether a model uses finite elements or finite differences, or requires the optimization of a complex linear or nonlinear function, one often ends up solving a linear system of equations involving sparse matrices. There are therefore a number of application fields, among which some of the ones cited by the users of our sparse direct solver Mumps (see Section ) are: structural mechanics, biomechanics, medical image processing, tomography, geophysics, electromagnetism, fluid dynamics, econometric models, oil reservoir simulation, magneto-hydro-dynamics, chemistry, acoustics, glaciology, astrophysics, circuit simulation, and work on hybrid direct-iterative methods.

Mumps (for *MUltifrontal Massively Parallel Solver*)
see
http://

The latest public release is Mumps 4.10.0 (May 2011); the new release is scheduled for February 2015 and will be under the Cecill-C licence, following an agreement between CERFACS, CNRS, ENS Lyon, INPT, Inria and University of Bordeaux.

Yves Robert was awarded the 2014 IEEE Technical Committee on Scalable Computing (TCSC) Award for Excellence.

In October 2014, CERFACS, ENS Lyon, INPT, Inria and University of
Bordeaux launched a consortium around the software package MUMPS
(see http://

Several applications process queries expressed as trees of Boolean operators applied to predicates on sensor data streams, e.g., mobile apps and automotive apps. Sensor data must be retrieved from the sensors, which incurs a cost, e.g., an energy expense that depletes the battery of a mobile device, a bandwidth usage. The objective is to determine the order in which predicates should be evaluated so as to shortcut part of the query evaluation and minimize the expected cost. This problem has been studied assuming that each data stream occurs at a single predicate. In this work , we study the case in which a data stream occurs in multiple predicates, either when each predicate references a single stream or when a predicate can reference multiple streams. In the single-stream case we give an optimal algorithm for a single-level tree and show that the problem is NP-complete for DNF trees. For DNF trees we show that there exists an optimal predicate evaluation order that is depth-first, which provides a basis for designing a range of heuristics. In the multi-stream case we show that the problem is NP-complete even for single-level trees. As in the single stream case, for DNF trees we show that there exists a depth-first leaf evaluation order that is optimal and we design efficient heuristics.

Errors have become a critical problem for high performance computing. Checkpointing protocols
are often used for error recovery after fail-stop failures.
However, silent errors cannot be ignored, and their peculiarity is that such errors
are identified only when the corrupted data is activated. To cope with silent errors,
we need a verification mechanism
to check whether the application state is correct. Checkpoints should be supplemented with
verifications to detect silent errors. When a verification is successful, only the last checkpoint
needs to be kept in memory because it is known to be correct.
In this work (RR UT-EECS-14-729), we analytically determine the best balance of verifications and checkpoints
so as to optimize platform throughput. We introduce a balanced algorithm using a pattern with

In this work (RR-Inria-8599), we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a popular approach for reducing the energy consumption, using lower speeds/voltages can increase the number of errors, thereby complicating the problem. We consider an application workflow whose dependence graph is a chain of tasks, and we study three execution scenarios: (i) a single speed is used during the whole execution; (ii) a second, possibly higher speed is used for any potential re-execution; (iii) different pairs of speeds can be used throughout the execution. For each scenario, we determine the optimal checkpointing and verification locations (and the optimal speeds for the third scenario) to minimize either objective. The different execution scenarios are then assessed and compared through an extensive set of experiments.

A significant percentage of the computing capacity of large-scale platforms is wasted due to interferences incurred by multiple applications that access a shared parallel file system concurrently. One solution to handling I/O bursts in large-scale HPC systems is to absorb them at an intermediate storage layer consisting of burst buffers. However, our analysis of the Argonne's Mira system shows that burst buffers cannot prevent congestion at all times. As a consequence, I/O performance is dramatically degraded, showing in some cases a decrease in I/O throughput of 67%. In this work (RR-Inria-8519), we analyze the effects of interference on application I/O bandwidth, and propose several scheduling techniques to mitigate congestion. We show through extensive experiments that our global I/O scheduler is able to reduce the effects of congestion, even on systems where burst buffers are used, and can increase the overall system throughput up to 56%. We also show that it outperforms current Mira I/O schedulers.

In this work (RR-Inria-8474), we revisit the well-studied problem of replica placement in tree
networks. Rather than minimizing the number of servers needed to serve all client requests,
we aim at minimizing the total power consumed by these servers. In addition, we use the
most general (and powerful) server assignment policy, where the requests of a client can be
served by multiple servers located in the (unique) path from this client to the root of the tree.
We consider multi-modal servers that can operate at a set of discrete speeds, using the
dynamic voltage and frequency scaling (DVFS) technique. The optimization problem is to determine an
optimal location of the servers in the tree, as well as the speed at which each server is operated.
A major result is the NP-completeness of this problem, to be contrasted with the minimization of the number of servers,
which has polynomial complexity. Another important contribution is the formulation of
a Mixed Integer Linear Program (MILP)
for the problem, together with the design of several polynomial-time heuristics.
We assess the efficiency of these heuristics
by simulation. For mid-size instances (up to 30 nodes in the tree),
we evaluate their absolute performance by comparison with the optimal solution (obtained via the MILP). The most efficient heuristics provide satisfactory results, within

Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree, i.e., to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.

Scientific workloads are often described as directed acyclic task
graphs. In this work , we focus on the
multifrontal factorization of sparse matrices, whose task graph is
structured as a tree of parallel tasks. Among the existing models
for parallel tasks, the concept of *malleable* tasks is
especially powerful as it allows each task to be processed on a
time-varying number of processors. Following the model advocated by
Prasanna and Musicus , for matrix
computations, we consider malleable tasks whose speedup is

We have studied the complexity of traversing tree-shaped workflows whose tasks require large I/O files. We target a heterogeneous architecture with two resource types, each with a different memory, such as a multicore node equipped with a dedicated accelerator (FPGA or GPU). The tasks in the workflow are colored according to their type and can be processed if all there input and output files can be stored in the corresponding memory. The amount of used memory of each type at a given execution step strongly depends upon the ordering in which the tasks are executed, and upon when communications between both memories are scheduled. The objective is to determine an efficient traversal that minimizes the maximum amount of memory of each type needed to traverse the whole tree. In this study , we establish the complexity of this two-memory scheduling problem, and provide inapproximability results. In addition, we design several heuristics, based on both post-order and general traversals, and we evaluate them on a comprehensive set of tree graphs, including random trees as well as assembly trees arising in the context of sparse matrix factorizations.

The tremendous increase in the size and heterogeneity of supercomputers makes it very difficult to predict the performance of a scheduling algorithm. Therefore, dynamic solutions, where scheduling decisions are made at runtime have overpassed static allocation strategies. The simplicity and efficiency of dynamic schedulers such as Hadoop are a key of the success of the MapReduce framework. Dynamic schedulers such as StarPU, PaRSEC or StarSs are also developed for more constrained computations, e.g. task graphs coming from linear algebra. To make their decisions, these runtime systems make use of some static information, such as the distance of tasks to the critical path or the affinity between tasks and computing resources (CPU, GPU,. . .) and of dynamic information, such as where input data are actually located. In this study , we concentrate on two elementary linear algebra kernels, namely the outer product and the matrix multiplication. For each problem, we propose several dynamic strategies that can be used at runtime and we provide an analytic study of their theoretical performance. We prove that the theoretical analysis provides very good estimate of the amount of communications induced by a dynamic strategy and can be used in order to efficiently determine thresholds used in dynamic scheduler, thus enabling to choose among them for a given problem and architecture.

The classical redistribution problem aims at optimally scheduling communications when reshuffling from an initial data distribution to a target data distribution. This target data distribution is usually chosen to optimise some objective for the algorithmic kernel under study (good computational balance or low communication volume or cost), and therefore to provide high efficiency for that kernel. However, the choice of a distribution minimizing the target objective is not unique. This leads to generalizing the redistribution problem as follows: find a re-mapping of data items onto processors such that the data redistribution cost is minimal, and the operation remains as efficient. This work studies the complexity of this generalized problem. We compute optimal solutions and evaluate, through simulations, their gain over classical redistribution. We also show the NP-hardness of the problem to find the optimal data partition and processor permutation (defined by new subsets) that minimize the cost of redistribution followed by a simple computational kernel. Finally, experimental validation of the new redistribution algorithms are conducted on a multicore cluster, for both a 1D-stencil kernel and a more compute-intensive dense linear algebra routine.

We study the hierarchically structured bin packing problem . In this problem, the items to be packed into bins are at the leaves of a tree. The objective of the packing is to minimize the total number of bins into which the descendants of an internal node are packed, summed over all internal nodes. We investigate an existing algorithm and make a correction to the analysis of its approximation ratio. Further results regarding the structure of an optimal solution and a strengthened inapproximability result are given.

We discuss the use of hypergraph partitioning based methods in fill-reducing orderings of sparse matrices for Cholesky, LU and QR factorizations .
For the Cholesky factorization, we investigate a recent result on pattern-wise decomposition of sparse matrices, generalize the result, and develop algorithmic tools to obtain more effective ordering methods.
The generalized results help us formulate the fill-reducing ordering problem for LU factorization as we do for the Cholesky case, without ever symmetrizing the given matrix

We investigate the problem of partitioning finite difference meshes in two dimensions among the processors of a parallel computer . The objective is to achieve a perfect load balance while minimizing the communication cost. There are well-known graph, hypergraph, and geometry-based partitioning algorithms for this problem. The known geometric algorithms have linear running time and obtain the best results for very special mesh sizes and processor numbers. We propose another geometric algorithm. The proposed algorithm is linear; is applicable to much more cases than some well-known alternatives; obtains better results than the graph partitioning algorithms; obtains better results than the hypergraph partitioning algorithms almost always. Our algorithm also obtains better results than a known asymptotically-optimal algorithm for some small number of processors. We also catalog related theoretical results.

We present an iterative algorithm which asymptotically scales the

In the context of the Mumps sparse direct solver (see Section ), we worked in 2014 on: block-low-rank solvers and shared-memory parallelism , , hybrid (shared-distributed) parallelism and efficient collective communications in asynchronous environments , and scheduling strategies to decrease the memory-usage of multifrontal solvers. Quite significant performance gains have been obtained on up to 2000 cores of a Bullx DLC system (CALMIP mesocentre), some of the corresponding developments will be made available in the next release of our solver. We also worked on setting up a consortium of industrial users to fund engineers working on MUMPS (see Section ). These activities were done in collaboration with INP Toulouse and with CERFACS, CNRS, ENS Lyon, Univ. Bordeaux, EDF, LSTC (Livermore, California) and EMGS (Norway).

Related to the evolutions and support of the Mumps solver (see Section), we worked on:

setting up a consortium of industrial users to fund the project. Four membership contracts were signed this year by Altair, EDF, LSTC and Michelin.

a contract with EMGS (Norway) related to low-rank compression for geophysics applications; the contract is managed by INP Toulouse.

The ANR White Project Rescue was launched in November 2010,
for a duration of 48 months (and was later extended for 6 additional
months).
It gathers three Inria partners (Roma, Grand-Large and Hiepacs) and is led by Roma.
The main objective of the project is to develop new algorithmic techniques and software
tools to solve the *exascale resilience problem*. Solving this
problem implies a departure from current approaches,
and calls for yet-to-be-discovered algorithms, protocols and software tools.

This proposed research follows three main research thrusts. The first thrust deals with
novel *checkpoint protocols*. The second
thrust entails the development of novel *execution models*, i.e.,
accurate stochastic models to predict (and, in turn, optimize) the
expected performance (execution time or throughput) of large-scale
parallel scientific applications. In the third thrust, we will develop
novel *parallel algorithms* for scientific numerical kernels.

The ANR Project Solhar was launched in November 2013, for a duration of 48 months. It gathers five academic partners (the HiePACS, Cepage, Roma and Runtime Inria project-teams, and CNRS-IRIT) and two industrial partners (CEA/CESTA and EADS-IW). This project aims at studying and designing algorithms and parallel programming models for implementing direct methods for the solution of sparse linear systems on emerging computers equipped with accelerators.

The proposed research is organized along three distinct research thrusts. The first objective deals with linear algebra kernels suitable for heterogeneous computing platforms. The second one focuses on runtime systems to provide efficient and robust implementation of dense linear algebra algorithms. The third one is concerned with scheduling this particular application on a heterogeneous and dynamic environment.

Since January 2013, the team is participating to the C2S@Exa
http://

Type: FP7

Defi: Future and Emerging Technologies

Instrument: Specific Targeted Research Project

Objectif: Challenging current Thinking

Duration: June 2013 - May 2016

Coordinator: Nikolaos Bellas

Partners: CERTH, Greece; EPFL, Switzerland; RWTH Aachen University, Germany; The Queen’s University of Belfast, UK; IMEC, Belgium

Inria contact: Frédéric Vivien

Abstract: A new computing paradigm that exploits uncertainty to design systems that are energy-efficient and scale gracefully under hardware errors by operating below the nominal operating point, in a controlled way, without inducing massive or fatal errors.

In 2014, the University of Illinois at Urbana-Champaign, Inria, the French national computer science institute, Argonne National Laboratory, Barcelona Supercomputing Center, and Jülich Supercomputing Centre formed the Joint Laboratory on Extreme Scale Computing (JLESC), a follow-up of the Inria-Illinois Joint Laboratory for Petascale Computing. The Joint Laboratory is based at Illinois and includes researchers from Inria, and the National Center for Supercomputing Applications, ANL, BSC, and JSC. It focuses on software challenges found in extreme scale high-performance computers.

Research areas include:

Scientific applications (big compute and big data) that are the drivers of the research in the other topics of the joint-laboratory.

Modeling and optimizing numerical libraries, which are at the heart of many scientific applications.

Novel programming models and runtime systems, which allow scientific applications to be updated or reimagined to take full advantage of extreme-scale supercomputers.

Resilience and Fault-tolerance research, which reduces the negative impact when processors, disk drives, or memory fail in supercomputers that have tens or hundreds of thousands of those components.

I/O and visualization, which are important part of parallel execution for numerical silulations and data analytics

HPC Clouds, that may execute a portion of the HPC workload in the near future.

Several members of the ROMA team are involved in the JLESC joint lab through their research on resilience. Yves Robert is the scientific representant of Inria in JLESC.

The Aloha associate-team is a joint project of the Roma team and of
the Information and Computer science Department of the University of
Hawai`i (UH) at Mānoa, Honolulu, USA. Building on a vast array of
theoretical techniques and expertise developed in the field of
parallel and distributed computing, and more particularly application
*scheduling*, we tackle database questions from a fresh
perspective. To this end, this proposal includes:

a group that specializes in database systems research and who has both industrial and academic experience, the group of Lipyeow Lim (UH);

a group that specializes in practical aspects of scheduling problems and in simulation for emerging platforms and applications, and who has a long experience of multidisciplinary research, the group of Henri Casanova (UH);

a group that specializes in the theoretical aspects of scheduling problems and resource management (the Roma team).

The research work focuses on the following three thrusts:

Online, multi-criteria query optimization

Fault-Tolerance for distributed databases

Query scheduling for distributed databases

Yves Robert has been appointed as a visiting scientist by the ICL laboratory (headed by Jack Dongarra) at the University of Tennessee Knoxville. He collaborates with several ICL researchers on high-performance linear algebra and resilience methods at scale.

Yves Robert is a member of the Scientifc Council of Maison de la Simulation, Saclay.

Yves Robert is a member of the Steering Committee of IPDPS, of HCW of HeteroPar, and of Edu-EuroPar.

Yves Robert and Frédéric Vivien organized the *Scheduling for
large-scale platforms* in Lyon, July 1-4, 2014. There were 55 participants.

Bora Uçar organized the CSC14 workshop, the *Sixth SIAM
Workshop on Combinatorial Scientific Computing* in Lyon,
July 21-23, 2014. There were 54 participants.

was vice-program chair, for the track
*Algorithms*, of IPDPS 2014, and workshops co-chair of ICPP
2014.

was program vice-chair for the track
*Algorithms* of SC'14.

was program vice-chair, for the algorithms track, of HiPC 2014, and was co-responsible of the stream “Algorithmes distribués, multi-agents et calcul parallèle” for ROADEF 2014.

was a member of the program committees of the following conferences and workshops: HCW 2014, PDP 2014, Ena-HPC 2014 and CCGrid 2014.

was a member of the program committees of ComPAS 2014, Vecpar 2014, CSC 2014.

was a member of the program committee of IPDPS'2014 and HeteroPar'2014.

was a member of the program committee of the following conferences and workshops: ISCIS'14, EduPar'14, ICCS'14, and FTXS'14.

was was a member of the program committee for MPP2014 (a workshop of SBAC-PAD2014), HiPC14, IPDPS 2014, PMAA14, and PCO 2014 (a workshop of IPDPS).

was a member of the program committee of the following conferences and workshops: SC'14, posters of SC'14, IPDPS 2014, ComPAS'2014, PDP 2014, and EduHPC'14.

Bora Uçar was elected as the Secretary of the SIAM activity group on Supercomputing. (1 January 2014–12 December 2015). He was also appointed as the member of the steering committee of Combinatorial Scientific Computing.

is an associate editor of the *Journal of
Parallel and Distributed Computing (JPDC)* and of the *Journal
of Sustainable Computing: Informatics and Systems (SUSCOM)*.

is an associate editor of *International
Journal of High Performance Computing Applications (IJHPCA)*,
*International Journal of Grid and Utility Computing (IJGUC)*,
and *Journal of Computational Science (JOCS)*.

is an associate editor of *Parallel Computing*.

Licence: Anne Benoit, Systèmes et Réseaux, 48h, L3, École normale supérieure de Lyon, France.

Licence: Yves Robert, Algorithmes, 48h, L3, École normale supérieure de Lyon, France.

Master: Frédéric Vivien, Algorithmique et Programmation Parallèles, 36 h, M1, École normale supérieure de Lyon, France.

Master: Frédéric Vivien, Algorithms for High-Performance Computing Platforms, 36 h, M2, École normale supérieure de Lyon, France.

Master: Bora Uçar, Combinatorial Scientific Computing, 36 h, M2, École normale supérieure de Lyon, France.

PhD: Guillaume Aupy, Resilient and energy-efficient scheduling algorithms at scale, École normale supérieure de Lyon, September 16, 2014, Anne Benoit and Yves Robert.

PhD: Dounia Zaidouni, Combining checkpointing and other resilience mechanisms for exascale systems, December 10, 2014, École normale supérieure de Lyon, Frédéric Vivien and Yves Robert.

PhD: Wissam M. Sid-Lakhdar, Scaling the solution of large sparse linear systems using multifrontal methods on hybrid shared-distributed memory architectures, December 1 2014, Jean-Yves L'Excellent.

PhD in progress: Aurélien Cavelan, Algorithms for detecting and correcting silent errors, September 1, 2014, Anne Benoît and Yves Robert.

PhD in progress: Oguz Kaya, Parallel and Distributed Sparse Tensor Computations, September 1, 2014, Yves Robert and Bora Uçar.

PhD in progress: Julien Herrmann, Numerical algorithms for large-scale platforms, September 1, 2012, Loris Marchal and Yves Robert.

Bora Uçar was a member of the PhD defense committee of Clément Vuchener as a “rapporteur”, Université de Bordeaux, École doctoral de mathématiques et d'Informatique.

Yves Robert was a member of the HdR defense committee of Frédéric Suter, ENS Lyon.