Over the last few decades, there have been innumerable science,
engineering and societal breakthroughs enabled by the development of
High Performance Computing (HPC) applications, algorithms and
architectures.
These powerful tools have provided researchers with the ability to
computationally find efficient solutions for some of the most
challenging scientific questions and problems in medicine and biology,
climatology, nanotechnology, energy and environment.
It is admitted today that numerical simulation is the third pillar
for the development of scientific discovery at the same level as
theory and experimentation.
Numerous reports and papers also confirm that very high performance
simulation will open new opportunities not only for research but also
for a large spectrum of industrial sectors.

An important force which has continued to drive HPC has been to focus on frontier milestones which consist in technical goals that symbolize the next stage of progress in the field. In the 1990s, the HPC community sought to achieve computing at a teraflop rate and and exascale machines are now expected in the next few months/years.

For application codes to sustain petaflops and more in the next few years, hundreds of thousands of processor cores or more are needed, regardless of processor technology. Currently, a few HPC simulation codes easily scale to this regime and major algorithms and codes development efforts are critical to achieve the potential of these new systems. Scaling to exaflop involves improving physical models, mathematical modeling, super scalable algorithms that will require paying particular attention to acquisition, management and visualization of huge amounts of scientific data.

In this context, the purpose of the project is to contribute performing efficiently frontier simulations arising from challenging academic and industrial research. The solution of these challenging problems require a multidisciplinary approach involving applied mathematics, computational and computer sciences. In applied mathematics, it essentially involves advanced numerical schemes. In computational science, it involves massively parallel computing and the design of highly scalable algorithms and codes to be executed on emerging hierarchical many-core, possibly heterogeneous, platforms. Through this approach, intends to contribute to all steps that go from the design of new high-performance more scalable, robust and more accurate numerical schemes to the optimized implementations of the associated algorithms and codes on very high performance supercomputers. This research will be conduced on close collaboration in particular with European and US initiatives and in the framework of EuroHPC collaborative projects.

Currently, we have one major multiscale application that is in material physics.
We contribute to all steps of the design of the parallel simulation tool.
More precisely, our applied mathematics skill will contribute to the
modeling and our advanced numerical schemes will help in the design
and efficient software implementation for very large parallel multiscale simulations.
Moreover, the robustness and efficiency of our algorithmic research in linear
algebra are validated through industrial and academic collaborations with
different partners involved in various application fields.
Finally, we are also involved in a few collaborative initiatives in various application domains in a
co-design like framework.
These research activities are conducted in a wider multi-disciplinary context with colleagues in other
academic or industrial groups where our contribution is related to our expertises.
Not only these collaborations enable our expertise to have a stronger
impact in various application domains through the promotion of advanced algorithms,
methodologies or tools, but in return they open new avenues for research in the continuity of our core research activities.

Thanks to the two Inria collaborative agreements such as with Airbus/Conseil Régional Grande Aquitaine and with CEA, we have joint research efforts in a co-design framework enabling efficient and effective technological transfer towards industrial R&D. Furthermore, thanks to the past associate team we contribute with world leading groups at Berkeley National Lab and Stanford University to the design of fast numerical solvers and their parallel implementations.

Our high performance software packages are integrated in several academic or industrial complex codes and are validated on very large scale simulations. For all our software developments, we use first the experimental platform , the various large parallel platforms available through GENCI in France (CCRT, CINES and IDRIS Computational Centers), and next the high-end parallel platforms that will be available via European and US initiatives or projects such as PRACE.

The methodological component of concerns the expertise for the design as well as the efficient and scalable implementation of highly parallel numerical algorithms to perform frontier simulations. In order to address these computational challenges a hierarchical organization of the research is considered. In this bottom-up approach, we first consider in Section generic topics concerning high performance computational science. The activities described in this section are transversal to the overall project and their outcome will support all the other research activities at various levels in order to ensure the parallel scalability of the algorithms. The aim of this activity is not to study general purpose solution but rather to address these problems in close relation with specialists of the field in order to adapt and tune advanced approaches in our algorithmic designs. The next activity, described in Section , is related to the study of parallel linear algebra techniques that currently appear as promising approaches to tackle huge problems on extreme scale platforms. We highlight the linear problems (linear systems or eigenproblems) because they are in many large scale applications the main computational intensive numerical kernels and often the main performance bottleneck. These parallel numerical techniques will be the basis of both academic and industrial collaborations, some are described in Section , but will also be closely related to some functionalities developed in the parallel fast multipole activity described in Section . Finally, as the accuracy of the physical models increases, there is a real need to go for parallel efficient algorithm implementation for multiphysics and multiscale modeling in particular in the context of code coupling. The challenges associated with this activity will be addressed in the framework of the activity described in Section .

Currently, we have one major application (see Section ) that is in material physics. We will collaborate to all steps of the design of the parallel simulation tool. More precisely, our applied mathematics skill will contribute to the modelling, our advanced numerical schemes will help in the design and efficient software implementation for very large parallel simulations. We also participate to a few co-design actions in close collaboration with some applicative groups. The objective of this activity is to instantiate our expertise in fields where they are critical for designing scalable simulation tools. We refer to Section for a detailed description of these activities.

The research directions proposed in

are strongly influenced by both the applications we are studying and the architectures that we target (i.e., massively parallel heterogeneous many-core architectures, ...). Our main goal is to study the methodology needed to efficiently exploit the new generation of high-performance computers with all the constraints that it induces. To achieve this high-performance with complex applications we have to study both algorithmic problems and the impact of the architectures on the algorithm design.

From the application point of view, the project will be interested in multiresolution, multiscale and hierarchical approaches which lead to multi-level parallelism schemes. This hierarchical parallelism approach is necessary to achieve good performance and high-scalability on modern massively parallel platforms. In this context, more specific algorithmic problems are very important to obtain high performance. Indeed, the kind of applications we are interested in are often based on data redistribution for example (e.g., code coupling applications). This well-known issue becomes very challenging with the increase of both the number of computational nodes and the amount of data. Thus, we have both to study new algorithms and to adapt the existing ones. In addition, some issues like task scheduling have to be restudied in this new context. It is important to note that the work developed in this area will be applied for example in the context of code coupling (see Section ).

Considering the complexity of modern architectures like massively parallel architectures or new generation heterogeneous multicore architectures, task scheduling becomes a challenging problem which is central to obtain a high efficiency. With the recent addition of colleagues from the scheduling community (O. Beaumont and L. Eyraud-Dubois), the team is better equipped than ever to design scheduling algorithms and models specifically tailored to our target problems. It is important to note that this topic is strongly linked to the underlying programming model. Indeed, considering multicore and heterogeneous architectures, it has appeared, in the last five years, that the best programming model is an approach mixing multi-threading within computational nodes and message passing between them. In the last five years, a lot of work has been developed in the high-performance computing community to understand what is critic to efficiently exploit massively multicore platforms that will appear in the near future. It appeared that the key for the performance is firstly the granularity of the computations. Indeed, in such platforms the granularity of the parallelism must be small so that we can feed all the computing units with a sufficient amount of work. It is thus very crucial for us to design new high performance tools for scientific computing in this new context. This will be developed in the context of our solvers, for example, to adapt to this new parallel scheme. Secondly, the larger the number of cores inside a node, the more complex the memory hierarchy. This remark impacts the behavior of the algorithms within the node. Indeed, on this kind of platforms, NUMA effects will be more and more problematic. Thus, it is very important to study and design data-aware algorithms which take into account the affinity between computational threads and the data they access. This is particularly important in the context of our high-performance tools. Note that this work has to be based on an intelligent cooperative underlying run-time (like the tools developed by the Inria Project-Team) which allows a fine management of data distribution within a node.

Another very important issue concerns high-performance computing
using “heterogeneous” resources within a computational
node. Indeed, with the deployment of the GPU and the use of
more specific co-processors, it is
important for our algorithms to efficiently exploit these new type
of architectures. To adapt our algorithms and tools to these
accelerators, we need to identify what can be done on the GPU
for example and what cannot. Note that recent results in the field
have shown the interest of using both regular cores and GPU to
perform computations. Note also that in opposition to the case of
the parallelism granularity needed by regular multicore
architectures, GPU requires coarser grain parallelism. Thus,
making both GPU and regular cores work all together will lead
to two types of tasks in terms of granularity.
This represents a challenging problem especially in terms of scheduling.
From this perspective, we investigate
new approaches for composing parallel applications within a runtime
system for heterogeneous platforms.

In the context of scaling up, and particularly in the context of minimizing energy consumption, it is generally acknowledged that the solution lies in the use of heterogeneous architectures, where each resource is particularly suited to specific types of tasks, and in a fine control at the algorithmic level of data movements and the trade-offs to be made between computation and communication. In this context, we are particularly interested in the optimization of the training phase of deep convolutional neural networks which consumes a lot of memory and for which it is possible to exchange computations for data movements and memory occupation. We are also interested in the complexity introduced by resource heterogeneity itself, both from a theoretical point of view on the complexity of scheduling problems and from a more practical point of view on the implementation of specific kernels in dense or sparse linear algebra.

In order to achieve an advanced knowledge concerning the design of
efficient computational kernels to be used on our high performance
algorithms and codes, we will develop research activities first on
regular frameworks before extending them to more irregular and complex
situations.
In particular, we will work first on optimized dense linear algebra
kernels and we will use them in our more complicated direct and hybrid
solvers for sparse linear algebra and in our fast multipole algorithms for
interaction computations.
In this context, we will participate to the development of those kernels
in collaboration with groups specialized in dense linear algebra.
In particular, we intend develop a strong collaboration with the group of Jack Dongarra
at the University of Tennessee and collaborating research groups. The objectives will be to
develop dense linear algebra algorithms and libraries for multicore
architectures in the context the project
and for GPU and hybrid multicore/GPU architectures in the context of the
project.
A new solver has emerged from the associate team,
Chameleon. While and focus on multicore and GPU
architectures, respectively, Chameleon makes the most out of
heterogeneous architectures thanks to task-based dynamic runtime systems.

A more prospective objective is to study the resiliency in the context of large-scale scientific applications for massively parallel architectures. Indeed, with the increase of the number of computational cores per node, the probability of a hardware crash on a core or of a memory corruption is dramatically increased. This represents a crucial problem that needs to be addressed. However, we will only study it at the algorithmic/application level even if it needed lower-level mechanisms (at OS level or even hardware level). Of course, this work can be performed at lower levels (at operating system) level for example but we do believe that handling faults at the application level provides more knowledge about what has to be done (at application level we know what is critical and what is not). The approach that we will follow will be based on the use of a combination of fault-tolerant implementations of the run-time environments we use (like for example ) and an adaptation of our algorithms to try to manage this kind of faults. This topic represents a very long range objective which needs to be addressed to guaranty the robustness of our solvers and applications.

Finally, it is important to note that the main goal of is to design tools and algorithms that will be used within complex simulation frameworks on next-generation parallel machines. Thus, we intend with our partners to use the proposed approach in complex scientific codes and to validate them within very large scale simulations as well as designing parallel solution in co-design collaborations.

Starting with the developments of basic linear algebra kernels tuned for
various classes of computers, a significant knowledge on
the basic concepts for implementations on high-performance scientific computers has been accumulated.
Further knowledge has been acquired through the design of more sophisticated linear algebra algorithms
fully exploiting those basic intensive computational kernels.
In that context, we still look at the development of new computing platforms and their associated programming
tools.
This enables us to identify the possible bottlenecks of new computer architectures
(memory path, various level of caches, inter processor or node network) and to propose
ways to overcome them in algorithmic design.
With the goal of designing efficient scalable linear algebra solvers for large scale applications, various
tracks will be followed in order to investigate different complementary approaches.
Sparse direct solvers have been for years the methods of choice for solving linear systems of equations,
it is nowadays admitted that classical approaches are not scalable neither from a computational complexity
nor from a memory view point for large problems such as those arising from the discretization of large 3D PDE problems.
We will continue to work on sparse direct solvers on the one hand to make sure they fully benefit from most advanced computing platforms
and on the other hand to attempt to reduce their memory and computational costs for some classes of problems where
data sparse ideas can be considered.
Furthermore, sparse direct solvers are a key building boxes for the
design of some of our parallel algorithms such as the hybrid solvers described in the sequel of this section.
Our activities in that context will mainly address preconditioned Krylov subspace methods; both components,
preconditioner and Krylov solvers, will be investigated.
In this framework, and possibly in relation with the research activity on fast multipole, we intend to study how emerging

For the solution of large sparse linear systems, we design numerical schemes and software packages for direct and hybrid parallel solvers. Sparse direct solvers are mandatory when the linear system is very ill-conditioned; such a situation is often encountered in structural mechanics codes, for example. Therefore, to obtain an industrial software tool that must be robust and versatile, high-performance sparse direct solvers are mandatory, and parallelism is then necessary for reasons of memory capability and acceptable solution time. Moreover, in order to solve efficiently 3D problems with more than 50 million unknowns, which is now a reachable challenge with new multicore supercomputers, we must achieve good scalability in time and control memory overhead. Solving a sparse linear system by a direct method is generally a highly irregular problem that induces some challenging algorithmic problems and requires a sophisticated implementation scheme in order to fully exploit the capabilities of modern supercomputers.

New supercomputers incorporate many microprocessors which are composed of one or many computational cores. These new architectures induce strongly hierarchical topologies. These are called NUMA architectures. In the context of distributed NUMA architectures, in collaboration with the Inria team, we study optimization strategies to improve the scheduling of communications, threads and I/O. We have developed dynamic scheduling designed for NUMA architectures in the solver. The data structures of the solver, as well as the patterns of communication have been modified to meet the needs of these architectures and dynamic scheduling. We are also interested in the dynamic adaptation of the computation grain to use efficiently multi-core architectures and shared memory. Experiments on several numerical test cases have been performed to prove the efficiency of the approach on different architectures. Sparse direct solvers such as are currently limited by their memory requirements and computational cost. They are competitive for small matrices but are often less efficient than iterative methods for large matrices in terms of memory. We are currently accelerating the dense algebra components of direct solvers using block low-rank compression techniques.

In collaboration with the ICL team from the University of Tennessee, and the team from Inria, we are evaluating the way to replace the embedded scheduling driver of the solver by one of the generic frameworks, or , to execute the task graph corresponding to a sparse factorization. The aim is to design algorithms and parallel programming models for implementing direct methods for the solution of sparse linear systems on emerging computer equipped with GPU accelerators. More generally, this work will be performed in the context of the ANR project which aims at designing high performance sparse direct solvers for modern heterogeneous systems. This ANR project involves several groups working either on the sparse linear solver aspects ( and from Inria and APO from IRIT), on runtime systems ( from Inria) or scheduling algorithms ( and from Inria). The results of these efforts will be validated in the applications provided by the industrial project members, namely CEA-CESTA and Airbus Central R & T.

One route to the parallel scalable solution of large sparse linear systems in parallel scientific computing is the use of hybrid methods that hierarchically combine direct and iterative methods. These techniques inherit the advantages of each approach, namely the limited amount of memory and natural parallelization for the iterative component and the numerical robustness of the direct part. The general underlying ideas are not new since they have been intensively used to design domain decomposition techniques; those approaches cover a fairly large range of computing techniques for the numerical solution of partial differential equations (PDEs) in time and space. Generally speaking, it refers to the splitting of the computational domain into sub-domains with or without overlap. The splitting strategy is generally governed by various constraints/objectives but the main one is to express parallelism. The numerical properties of the PDEs to be solved are usually intensively exploited at the continuous or discrete levels to design the numerical algorithms so that the resulting specialized technique will only work for the class of linear systems associated with the targeted PDE.

In that context, we continue our effort on the design of algebraic non-overlapping domain decomposition techniques that rely on the solution of a Schur complement system defined on the interface introduced by the partitioning of the adjacency graph of the sparse matrix associated with the linear system. Although it is better conditioned than the original system the Schur complement needs to be precondition to be amenable to a solution using a Krylov subspace method. Different hierarchical preconditioners will be considered, possibly multilevel, to improve the numerical behaviour of the current approaches implemented in our software library . This activity will be developed further developped in the H2020 project. In addition to this numerical studies, advanced parallel implementation will be developed that will involve close collaborations between the hybrid and sparse direct activities.

Preconditioning is the main focus of the two activities described above. They aim at speeding up the convergence of a Krylov subspace method that is the complementary component involved in the solvers of interest for us. In that framework, we believe that various aspects deserve to be investigated; we will consider the following ones:

Many eigensolvers also rely on Krylov subspace techniques. Naturally some links exist between the Krylov subspace linear solvers and the Krylov subspace eigensolvers. We plan to study the computation of eigenvalue problems with respect to the following two different axes:

In this research project, we are interested in the design of new advanced techniques to solve large mixed dense/sparse linear systems, the extensive comparison of these new approaches to the existing ones, and the application of these innovative ideas on realistic industrial test cases in the domain of aeroacoustics (in collaboration with Airbus Central R & T).

In most scientific computing applications considered nowadays as computational challenges (like biological and material systems, astrophysics or electromagnetism), the introduction of hierarchical methods based on an octree structure has dramatically reduced the amount of computation needed to simulate those systems for a given accuracy. For instance, in the N-body problem arising from these application fields, we must compute all pairwise interactions among N objects (particles, lines, ...) at every timestep. Among these methods, the Fast Multipole Method (FMM) developed for gravitational potentials in astrophysics and for electrostatic (coulombic) potentials in molecular simulations solves this N-body problem for any given precision with

runtime complexity against

for the direct computation.

The potential field is decomposed in a near field part, directly computed, and a far field part approximated thanks to multipole and local expansions. We introduced a matrix formulation of the FMM that exploits the cache hierarchy on a processor through the Basic Linear Algebra Subprograms (BLAS). Moreover, we developed a parallel adaptive version of the FMM algorithm for heterogeneous particle distributions, which is very efficient on parallel clusters of SMP nodes. Finally on such computers, we developed the first hybrid MPI-thread algorithm, which enables to reach better parallel efficiency and better memory scalability. We plan to work on the following points in .

Nowadays, the high performance computing community is examining
alternative architectures that address the limitations of modern
cache-based designs. GPU (Graphics Processing Units) and the Cell
processor have thus already been used in astrophysics and in molecular
dynamics. The Fast Mutipole Method has also been implemented on GPU.
We intend to examine the
potential of using these forthcoming processors as a building block
for high-end parallel computing in N-body calculations. More
precisely, we want to take advantage of our specific underlying BLAS routines
to obtain an efficient and easily portable FMM for these new architectures.
Algorithmic issues such as dynamic load balancing among heterogeneous
cores will also have to be solved in order to gather all the available
computation power.
This research action will be conduced on close connection with the
activity described in
Section .

In many applications arising from material physics or astrophysics, the distribution of the data is highly non uniform and the data can grow between two time steps. As mentioned previously, we have proposed a hybrid MPI-thread algorithm to exploit the data locality within each node. We plan to further improve the load balancing for highly non uniform particle distributions with small computation grain thanks to dynamic load balancing at the thread level and thanks to a load balancing correction over several simulation time steps at the process level.

The engine that we develop will be extended to new potentials arising
from material physics such as those used in dislocation
simulations. The interaction between dislocations is long ranged
(

The boundary element method (BEM) is a well known
solution of boundary value problems appearing in various fields of
physics. With this approach, we only have to solve an integral
equation on the boundary. This implies an interaction that decreases in space, but results
in the solution of a dense linear system with

Many important physical phenomena in material physics and climatology are inherently complex applications. They often use multi-physics or multi-scale approaches, which couple different models and codes. The key idea is to reuse available legacy codes through a coupling framework instead of merging them into a stand-alone application. There is typically one model per different scale or physics and each model is implemented by a parallel code.

For instance, to model a crack propagation, one uses a molecular dynamic code to represent the atomistic scale and an elasticity code using a finite element method to represent the continuum scale. Indeed, fully microscopic simulations of most domains of interest are not computationally feasible. Combining such different scales or physics is still a challenge to reach high performance and scalability.

Another prominent example is found in the field of aeronautic propulsion: the conjugate heat transfer simulation in complex geometries (as developed by the CFD team of CERFACS) requires to couple a fluid/convection solver (AVBP) with a solid/conduction solver (AVTP). As the AVBP code is much more CPU consuming than the AVTP code, there is an important computational imbalance between the two solvers.

In this context, one crucial issue is undoubtedly the load balancing
of the whole coupled simulation that remains an open question. The
goal here is to find the best data distribution for the whole coupled
simulation and not only for each stand-alone code, as it is most
usually done. Indeed, the naive balancing of each code on its own can
lead to an important imbalance and to a communication bottleneck
during the coupling phase, which can drastically decrease the overall
performance. Therefore, we argue that it is required to model the
coupling itself in order to ensure a good scalability, especially when
running on massively parallel architectures (tens of thousands of
processors/cores). In other words, one must develop new algorithms and
software implementation to perform a coupling-aware partitioning
of the whole application.
Another related problem is the problem of resource allocation. This is
particularly important for the global coupling efficiency and
scalability, because each code involved in the coupling can be more or
less computationally intensive, and there is a good trade-off to find
between resources assigned to each code to avoid that one of them
waits for the other(s). What does furthermore happen if the load of one code
dynamically changes relatively to the other one? In such a case, it could
be convenient to dynamically adapt the number of resources used
during the execution.

There are several open algorithmic problems that we investigate in the project-team. All these problems uses a similar methodology based upon the graph model and are expressed as variants of the classic graph partitioning problem, using additional constraints or different objectives.

As a preliminary step related to the dynamic load balancing of coupled
codes, we focus on the problem of dynamic load balancing of a single
parallel code, with variable number of processors. Indeed, if the
workload varies drastically during the simulation, the load must be
redistributed regularly among the processors. Dynamic load balancing
is a well studied subject but most studies are limited to an initially
fixed number of processors. Adjusting the number of processors at
runtime allows one to preserve the parallel code efficiency or keep
running the simulation when the current memory resources are
exceeded. We call this problem, MxN graph repartitioning.

We propose some methods based on graph repartitioning in order to re-balance the load while changing the number of processors. These methods are split in two main steps. Firstly, we study the migration phase and we build a “good” migration matrix minimizing several metrics like the migration volume or the number of exchanged messages. Secondly, we use graph partitioning heuristics to compute a new distribution optimizing the migration according to the previous step results.

As stated above, the load balancing of coupled code is a major
issue, that determines the performance of the complex simulation, and
reaching high performance can be a great challenge. In this context,
we develop new graph partitioning techniques, called co-partitioning. They address the problem of load balancing for two
coupled codes: the key idea is to perform a "coupling-aware"
partitioning, instead of partitioning these codes independently, as it
is classically done. More precisely, we propose to enrich the classic
graph model with inter-edges, which represent the coupled code
interactions. We describe two new algorithms, and compare them to the
naive approach. In the preliminary experiments we perform on
synthetically-generated graphs, we notice that our algorithms succeed
to balance the computational load in the coupling phase and in some
cases they succeed to reduce the coupling communications costs.
Surprisingly, we notice that our algorithms do not degrade
significantly the global graph edge-cut, despite the additional
constraints that they impose.

Besides this, our co-partitioning technique requires to use graph
partitioning with fixed vertices, that raises serious issues
with state-of-the-art software, that are classically based on the
well-known recursive bisection paradigm (RB). Indeed, the RB method
often fails to produce partitions of good quality. To overcome this
issue, we propose a new direct DIMACS'10 collection.

Graph handling and partitioning play a central role in the activity described here but also in other numerical techniques detailed in sparse linear algebra Section. The Nested Dissection is now a well-known heuristic for sparse matrix ordering to both reduce the fill-in during numerical factorization and to maximize the number of independent computation tasks. By using the block data structure induced by the partition of separators of the original graph, very efficient parallel block solvers have been designed and implemented according to super-nodal or multi-frontal approaches. Considering hybrid methods mixing both direct and iterative solvers such as , obtaining a domain decomposition leading to a good balancing of both the size of domain interiors and the size of interfaces is a key point for load balancing and efficiency in a parallel context.

We intend to revisit some well-known graph partitioning techniques in the light of the hybrid solvers and design new algorithms to be tested in the package.

Due to the increase of available computer power, new applications in nano science and physics appear such as study of properties of new materials (photovoltaic materials, bio- and environmental sensors, ...), failure in materials, nano-indentation. Chemists, physicists now commonly perform simulations in these fields. These computations simulate systems up to billion of atoms in materials, for large time scales up to several nanoseconds. The larger the simulation, the smaller the computational cost of the potential driving the phenomena, resulting in low precision results. So, if we need to increase the precision, there are two ways to decrease the computational cost. In the first approach, we improve algorithms and their parallelization and in the second way, we consider a multiscale approach.

A domain of interest is the material aging for the nuclear industry. The materials are exposed to complex conditions due to the combination of thermo-mechanical loading, the effects of irradiation and the harsh operating environment. This operating regime makes experimentation extremely difficult and we must rely on multi-physics and multi-scale modeling for our understanding of how these materials behave in service. This fundamental understanding helps not only to ensure the longevity of existing nuclear reactors, but also to guide the development of new materials for 4th generation reactor programs and dedicated fusion reactors. For the study of crystalline materials, an important tool is dislocation dynamics (DD) modeling. This multiscale simulation method predicts the plastic response of a material from the underlying physics of dislocation motion. DD serves as a crucial link between the scale of molecular dynamics and macroscopic methods based on finite elements; it can be used to accurately describe the interactions of a small handful of dislocations, or equally well to investigate the global behavior of a massive collection of interacting defects.

To explore i.e. to simulate these new areas, we need to develop and/or to improve significantly models, schemes and solvers used in the classical codes. In the project, we want to accelerate algorithms arising in those fields.

We focus on the following topics (in particular in the currently under definition OPTIDIS project in collaboration with CEA Saclay, CEA Ile-de-france and SIMaP Laboratory in Grenoble) in connection with research described at Sections and .

Scientific simulation for ITER tokamak modeling provides a natural bridge between theory and experimentation and is also an essential tool for understanding and predicting plasma behavior. Recent progresses in numerical simulation of fine-scale turbulence and in large-scale dynamics of magnetically confined plasma have been enabled by access to petascale supercomputers. These progresses would have been unreachable without new computational methods and adapted reduced models. In particular, the plasma science community has developed codes for which computer runtime scales quite well with the number of processors up to thousands cores. The research activities of concerning the international ITER challenge have started in the Inria Project Lab in collaboration with and were related to two complementary studies: a first one concerning the turbulence of plasma particles inside a tokamak (in the context of code) and a second one concerning the MHD instability edge localized modes (in the context of code). The activity concerning was completed at the end of 2018.

Other numerical simulation tools designed for the ITER challenge aim at making a significant progress in understanding active control methods of plasma edge MHD instability Edge Localized Modes (ELMs) which represent a particular danger with respect to heat and particle loads for Plasma Facing Components (PFC) in the tokamak. The goal is to improve the understanding of the related physics and to propose possible new strategies to improve effectiveness of ELM control techniques. The simulation tool used ( code) is related to non linear MHD modeling and is based on a fully implicit time evolution scheme that leads to 3D large very badly conditioned sparse linear systems to be solved at every time step. In this context, the use of library to solve efficiently these large sparse problems by a direct method is a challenging issue.

This activity continues within the context of the project, in which the solver is identified to allow the processing of very larger linear systems for the nuclear fusion code from . Contrary to the code, the problem to be treated corresponds to the complete 3D volume of the plasma torus. The objective is to be competitive, for complex geometries, compared to an Algebraic MultiGrid approach designed by one partner of .

Parallel and numerically scalable hybrid solvers based on a fully algebraic coarse space correction have been theoretically studied and various advanced parallel implementations have been designed. Their parallel scalability has been initially investigated on large scale problems within the project thanks to a close collaboration with the BSC and the integration of within the Alya software. This activity will further develop in the project. The performance has also been assessed on PRACE Tier-0 machine within a PRACE Project Access through a collaboration with CERFACS and Laboratoire de Physique des Plasmas at Ecole Polytechnique for the calculation of plasma propulsion. A comparative parallel scalability study with the Algebraic MultiGrid from Petsc has been conducted in that framework.

This domains is in the context of a long term collaboration with Airbus Research Centers.
Wave propagation phenomena intervene in many different aspects of systems design at Airbus. They drive the level of acoustic vibrations that mechanical components have to sustain, a level that one may want to diminish for comfort reason (in the case of aircraft passengers, for instance) or for safety reason (to avoid damage in the case of a payload in a rocket fairing at take-off). Numerical simulations of these phenomena plays a central part in the upstream design phase of any such project. Airbus Central R & T has developed over the last decades an in-depth knowledge in the field of Boundary Element Method (BEM) for the simulation of wave propagation in homogeneous media and in frequency domain. To tackle heterogeneous media (such as the jet engine flows, in the case of acoustic simulation), these BEM approaches are coupled with volumic finite elements (FEM). We end up with the need to solve large (several millions unknowns) linear systems of equations composed of a dense part (coming for the BEM domain) and a sparse part (coming from the FEM domain). Various parallel solution techniques are available today, mixing tools created by the academic world (such as the Mumps and Pastix sparse solvers) as well as parallel software tools developed in-house at Airbus (dense solver SPIDO, multipole solver,

The training phase of Deep Convolutional Neural Networks represents nowadays a significant share of the computations performed on HPC supercomputers. It introduces several new problems concerning resource allocation and scheduling issues, because of the specific pattern of task graphs induced by the stochastic gradient descent and because memory consumption is particularly critical when performing training. As of today, the most classical parallelization methods consists in partitioning mini-batches, images, filters,... but all these methods induce high synchronization and communication costs, and only very partially resolve memory issues. Within the framework of the Inria IPL on HPC Big Data and Learning convergence, we are working on re-materialization techniques and on the use of model parallelism, in particular to be able to build on the research that has been carried out in a more traditional HPC framework on the exploitation of resource heterogeneity and dynamic runtime scheduling.

We have published our work on Efficient Combination of Rematerialization and Offloading for Training DNNs in the NeurIPS conference. NeurIPS is the main and most prestigious conference of the machine learning community. This publication highlights our 3-year effort to use our scheduling expertise from the HPC environment to obtain meaningful contributions for the machine learning community. In addition, we expect this to boost the visibility of the associated Rotor software within the machine learning community.

This work is based on the seminar titled “Resiliency in Numerical Algorithm Design for Extreme Scale Simulations” held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 hours on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications, and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.

For more information on this work we refer to .

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to timecritical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

For more information on this work we refer to .

To achieve high performance and high energy efficiency on near-future exascale computing systems, three key technology gaps needs to be bridged. These gaps include: energy efficiency and thermal control; extreme computation efficiency via HW acceleration and new arithmetics; methods and tools for seamless integration of reconfigurable accelerators in heterogeneous HPC multi-node platforms. TEXTAROSSA aims at tackling this gap through a co-design approach to heterogeneous HPC solutions, supported by the integration and extension of HW and SW IPs, programming models and tools derived from European research.

More information on this project can be found in .

When discretization of an aeroacoustic physical model is based on the application of both the Finite Elements Method (FEM) and the Boundary Elements Method (BEM), this leads to coupled FEM/BEM linear systems combining sparse and dense parts. In this work, we propose and compare a set of implementation schemes relying on the coupling of the open-source sparse direct solver MUMPS with the proprietary direct solvers from Airbus Central R&T, i.e. the scalapack-like dense solver SPIDO and the hierarchical

We are interested in the quantification of uncertainties in discretized elliptic partial differential equations with a random coefficient field. In sampling-based approaches, this relies on solving large numbers of symmetric positive definite (SPD) linear systems with different matrices. In particular, we consider the case in which these operators are sampled by Markov chain Monte Carlo, which leads to sequences of correlated matrices. We investigate recycling Krylov subspace strategies for the iterative solution of sequences of linear systems formed with such matrices. The linear systems are solved using initialized conjugate gradient (Init-CG) methods, where approximate eigenvectors of the previously sampled operator are used to set an initial guess, and deflated conjugate gradient (Def-CG) methods, where the Krylov subspace is augmented with these vectors. The following aspects of eigenvector approximation, and their effect on deflation and initialization, are investigated in this context: (i) projection technique, and (ii) refreshing strategy of the eigen-search space. Our numerical experiments show that, when not using a preconditioner, these aspects only impact convergence behaviors of Def-CG at the early stages of the sampling sequence. Second, unlike sequences with multiple right-hand sides and a constant operator, our experiments with multiple matrices show that, even for highly correlated matrices, Init-CG does not reproduce the convergence behavior of Def-CG. Finally, the limits of deflation used as a means to compensate for the inefficiency of block-Jacobi (bJ) preconditioners are investigated. For small systems, using a bJ preconditioner while deflating with at least as many approximate eigenvectors as the number of bJ blocks achieves similar convergence behaviors to PCG with a constant algebraic multigrid (AMG) preconditioner. For larger systems, although the effect of deflation is improved when using the right refreshing strategy of the eigen-search space, the combination of deflation with bJ preconditioners does not scale as well as using PCG with a constant AMG preconditioner based on the median realization of the coefficient field.

For more information on this work we refer to .

We are concerned with the iterative solution of linear systems with multiple right-hand sides available one group after another, including the case where there are massive number (like tens of thousands) of right-hand sides associated with a single matrix so that all of them cannot be solved at once but rather need to be split into chunks of possible variable sizes. For such sequences of linear systems with multiple left and right-hand sides, we develop a new recycling block generalized conjugate residual method with inner orthogonalization and inexact breakdown (IB-BGCRO-DR), which glues subspace recycling technique in GCRO-DR [SIAM J. Sci. Comput., 28(5) (2006), pp. 1651–1674] and inexact breakdown mechanism in IB-BGMRES [Linear Algebra Appl., 419 (2006), pp. 265–285] to guarantee this new algorithm could reuse spectral information for subsequent cycles as well as for the remaining linear systems to be solved. Related variant IB-BFGCRO-DR that suits for flexible preconditioning is devised to cope with constraints on some applications while also enabling mixed-precision calculation, which provides advantages in speed and memory usage over double precision as well as in perspective of emerging computing units such as the GPUs.

For more information on this work we refer to that is a preliminary version of a paper recently accepted in SIMAX.

Low-rank compression techniques are very promising for reducing memory footprint and execution time on a large spectrum of linear solvers. Sparse direct supernodal approaches are one of these techniques. However, despite providing a very good scalability and reducing the memory footprint, they suffer from an important flops overhead in their unstructured low-rank updates. As a consequence, the execution time is not improved as expected. In this paper, we study a solution to improve low-rank compression techniques in sparse supernodal solvers. The proposed method tackles the overprice of the low-rank updates by identifying the blocks that have poor compression rates. We show that the fill-in levels of the graph based block incomplete LU factorization can be used in a new context to identify most of these non-compressible blocks at low cost. This identification enables to postpone the low-rank compression step to trade small extra memory consumption for a better time to solution. The solution is validated within the library with a large set of application matrices. It demonstrates sequential and multithreaded speedup up to 8.5×, for small memory overhead of less than 1.49× with respect to the original version.

For more information on this work we refer to .

In high-dimension, solving an eigenvalue problem encounters several issues related to the curse of dimensionality especially when only eigenvalues in a given interval are desired. To overcome these difficulties, we consider the FEAST algorithm that is developed for solving a Hermitian eigenproblem inside a given interval and based on a contour integration that projects the matrix pencil onto the subspace associated with the eigenpairs is the desired interval. We also consider the tensor-train decomposition (TT) for the operators and the eigenvectors to overcome the curse of dimensionality and reduce the required storage. In this paper, we extend the FEAST algorithm to match with the TT representation of operators and vectors to form an algorithm that computes the eigenvalue problems in an interval based on operations in TT-format. The proposed algorithm is applied on some high-dimensional problems, including the Laplacian operator and the vibrational Hamiltonian, and it shows a high efficiency and accuracy. We validate the results by comparing with an analytical solution for Laplacian operator, and with an existing method in TT-format for computing certain minimal eigenpairs.

The scientific report describing this work should be published in early 2022.

In this work, we studied the solution of linear systems defined in tensor spaces of different dimensions. Specifically, we considered nested subspace techniques in tensor format for the solution of 3d (PDEs in a 3-dimensional space), 4d (parametric PDEs, or 3D PDEs with multiple right-hand sides), 5d (time-dependent parametric PDEs) linear problems. In particular, we have derived bounds to evaluate the quality of the solutions computed in low-rank tensor format compared to their more classical linear algebra counterparts.

The scientific report describing this work should be published in early 2022.

We propose a new method to estimate plant biodiversity with Rényi and Rao indexes through the so called High Order Singular Value Decomposition (HOSVD) of tensors. Starting from NASA multispectral images we evaluate biodiversity and we compare original biodiversity estimates with those realised via the HOSVD compression methods for big data. Our strategy turns out to be extremely powerful in terms of storage memory and precision of the outcome. The obtained results are so promising that we can support the efficiency of our method in the ecological framework.

For more information on this work we refer to .

In this work we propose an extension of Correspondence Analysis (CA) to tensors through High Order Singular Value Decomposition (HOSVD) from a geometric viewpoint. Correspondence analysis is a well-known tool, developed from principal component analysis, for studying contingency tables. Different algebraic extensions of CA to multi-way tables have been proposed over the years, nevertheless neglecting its geometric meaning. Relying on the Tucker model and the HOSVD, we propose a direct way to associate with each tensor mode a point cloud. We prove that the point clouds are related to each other. Specifically using the CA metrics we show that the barycentric relation is still true in the tensor framework. Finally two data sets are used to underline the advantages and the drawbacks of our strategy with respect to the classical matrix approaches.

For more information on this work we refer to .

In the context of a parallel plasma physics simulation code, we perform a qualitativeperformance study between two natural candidates for the parallel solution of 3D Poisson problemsthat are multigrid and domain decomposition. We selected one representative of each of thesenumerical techniques implemented in state of the art parallel packages and show that dependingon the regime used in terms of number of unknowns per computing cores the best alternative interms of time to solution varies. Those results show the interest of having both types of numericalsolvers integrated in a simulation code that can be used in very different configurations in termsof selected modelisations, problem sizes and parallel computing platforms.

For more information on this work we refer to .

JOREK is a massively parallel fully implicit non-linear extended magneto-hydrodynamic (MHD) code for realistic tokamak X-point plasmas. It has become a widely used versatile simulation code for studying large-scale plasma instabilities and their control and is continuously developed in an international community with strong involvements in the European fusion research programme and ITER organization. This article gives a comprehensive overview of the physics models implemented, numerical methods applied for solving the equations and physics studies performed with the code.

For more information on this work we refer to .

The training phase in Deep Neural Networks has become an important source of computing resource usage and the resulting volume of computation makes it crucial to perform efficiently on parallel architectures. Data parallelism is the most widely used method, but it requires to replicate the network weights on all processors, and to perform collective communications of the network weights. In this context, model parallelism is an attractive alternative, in which the different layers of the network are distributed over the computing processors. Indeed, it is expected to better distribute weights (to cope with memory problems) and it eliminates the need for large collective communications since only forward activations are communicated. However, to be efficient, it must be combined with a pipelined approach, which in turn induces new memory costs. We have thus worked to formalize pipelined model parallelism as a scheduling problem, to establish its complexity, and to analyze the consequences of the assumptions that are typically performed in practical solutions such as Pipedream.

More information on this work can be found in .

Rematerialization and offloading are two well known strategies to save memory during the training phase of deep neural networks, allowing data scientists to consider larger models, batch sizes or higher resolution data. Rematerialization trades memory for computation time, whereas Offloading trades memory for data movements. As these two resources are independent, it is appealing to consider the simultaneous combination of both strategies to save even more memory. We precisely model the costs and constraints corresponding to Deep Learning frameworks such as PyTorch or Tensorflow, we propose optimal algorithms to find a valid sequence of memory-constrained operations and finally, we evaluate the performance of proposed algorithms on realistic networks and computation platforms. Our experiments show that the possibility to offload can remove one third of the overhead of rematerialization, and that together they can reduce the memory used for activations by a factor 4 to 6, with an overhead below 20

More information on this work can be found in .

We have proposed READYS, a reinforcement learning algorithm for the dynamic scheduling of computations modeled as a Directed Acyclic Graph (DAGs). Our goal is to develop a scheduling algorithm in which allocation and scheduling decisions are made at runtime, based on the state of the system, as performed in runtime systems such as StarPU or ParSEC. Reinforcement Learning is a natural candidate to achieve this task, since its general principle is to build step by step a strategy that, given the state of the system (the state of the resources and a view of the ready tasks and their successors in our case), makes a decision to optimize a global criterion. Moreover, the use of Reinforcement Learning is natural in a context where the duration of tasks (and communications) is stochastic. We propose READYS that combines Graph Convolutional Networks (GCN) with an Actor-Critic Algorithm (A2C): it builds an adaptive representation of the scheduling problem on the fly and learns a scheduling strategy, aiming at minimizing the makespan. A crucial point is that READYS builds a general scheduling strategy which is neither limited to only one specific application or task graph nor one particular problem size, and that can be used to schedule any DAG. We focus on different types of task graphs originating from linear algebra factorization kernels (CHOLESKY, LU, QR) and we consider heterogeneous platforms made of a few CPUs and GPUs. We first propose to analyze the performance of READYS when learning is performed on a given (platform, kernel, problem size) combination. Using simulations, we show that the scheduling agent obtains performance very similar or even superior to algorithms from the literature, and that it is especially powerful when the scheduling environment contains a lot of uncertainty. We additionally demonstrate that our agent exhibits very promising generalization capabilities. To the best of our knowledge, this is the first paper which shows that reinforcement learning can really be used for dynamic DAG scheduling on heterogeneous resources.

More information on this work can be found in .

Some on the ongoing PhD thesis are developed within bilareal contract with industry for PhD advisory:

Luc Giraud is member of . The eleventh Gene Golub SIAM Summer School was entilted “Theory and Practice of Deep Learning”.

SBAC-PAD'21 (O. Beaumont, Program Co-Chair)

IPDPS'21 (O. Beaumont, O. Coulaud, L. Eyraud-Dubois, M. Faverge, A. Guermouche), PDSEC'21 (O. Coulaud, L. Giraud), SC'21 (O. Beaumont)

Luc Giraud is member of the board on Modelization, Simulation and data analysis of the . He is also member of the scientific council of the (Laboratoire de Mathématiques Appliquées à l'Aéronautique et au Spatial).

He also acted as expert for the Czech Science Foundation.

Pierre Ramet is member of the CDT (Technological Development Commission) at inria Bordeaux since 2015.