Over the last few decades, there have been innumerable science,
engineering and societal breakthroughs enabled by the development of
High Performance Computing (HPC) applications, algorithms and
architectures.
These powerful tools have provided researchers with the ability to
computationally find efficient solutions for some of the most
challenging scientific questions and problems in medicine and biology,
climatology, nanotechnology, energy and environment.
It is admitted today that numerical simulation is the third pillar
for the development of scientific discovery at the same level as
theory and experimentation.
Numerous reports and papers also confirm that very high performance
simulation will open new opportunities not only for research but also
for a large spectrum of industrial sectors.

An important force which has continued to drive HPC has been to focus on frontier milestones which consist in technical goals that symbolize the next stage of progress in the field. In the 1990s, the HPC community sought to achieve computing at a teraflop rate and and exascale machines are now expected in the next few months/years.

For application codes to sustain petaflops and more in the next few years, hundreds of thousands of processor cores or more are needed, regardless of processor technology. Currently, a few HPC simulation codes easily scale to this regime and major algorithms and codes development efforts are critical to achieve the potential of these new systems. Scaling to exaflop involves improving physical models, mathematical modeling, super scalable algorithms that will require paying particular attention to acquisition, management and visualization of huge amounts of scientific data.

In this context, the purpose of the HiePACS project is to contribute performing
efficiently frontier simulations arising from challenging academic and industrial research.
The solution of these challenging problems require a multidisciplinary approach
involving applied mathematics, computational and computer sciences.
In applied mathematics, it essentially involves advanced numerical
schemes.
In computational science, it involves massively parallel computing and the
design of highly scalable algorithms and codes to be executed on
emerging hierarchical many-core, possibly heterogeneous, platforms.
Through this approach, HiePACS intends to contribute to all steps that
go from the design of new high-performance more scalable, robust and more
accurate numerical schemes to the optimized implementations of the
associated algorithms and codes on very high performance
supercomputers. This research will be conduced on close collaboration
in particular with European and US initiatives
and in the framework of EuroHPC collaborative projects.

The methodological part of HiePACS covers several topics.
First, we address generic studies concerning massively parallel computing, the
design of high-end performance algorithms and software to be executed on
future extreme scale platforms.
Next, several research prospectives in scalable parallel linear algebra techniques
are addressed, ranging from dense direct, sparse direct, iterative and hybrid approaches for large linear systems. We are also interested in the general problem of minimizing memory consumption
and data movements, by changing algorithms and possibly performing extra computations,
in particular in the context of Deep Neural Networks.
Then we consider research on N-body interaction computations
based on efficient parallel fast multipole methods and finally, we
address research tracks related to the algorithmic challenges
for complex code couplings in multiscale/multiphysic simulations.

Currently, we have one major multiscale application that is in material physics.
We contribute to all steps of the design of the parallel simulation tool.
More precisely, our applied mathematics skill will contribute to the
modeling and our advanced numerical schemes will help in the design
and efficient software implementation for very large parallel multiscale simulations.
Moreover, the robustness and efficiency of our algorithmic research in linear
algebra are validated through industrial and academic collaborations with
different partners involved in various application fields.
Finally, we are also involved in a few collaborative initiatives in various application domains in a
co-design like framework.
These research activities are conducted in a wider multi-disciplinary context with colleagues in other
academic or industrial groups where our contribution is related to our expertises.
Not only these collaborations enable our expertise to have a stronger
impact in various application domains through the promotion of advanced algorithms,
methodologies or tools, but in return they open new avenues for research in the continuity of our core research activities.

Thanks to the two Inria collaborative agreements such as with Airbus/Conseil Régional Grande Aquitaine
and with CEA, we have joint research efforts in a co-design framework enabling efficient and effective
technological transfer towards industrial R&D.
Furthermore, thanks to the past associate team FastLA we contribute with world leading groups at Berkeley
National Lab and Stanford University to
the design of fast numerical solvers and their parallel implementations.

Our high performance software packages are
integrated in several academic or industrial complex
codes and are validated on very large scale simulations. For all our
software developments, we use first the experimental platform PlaFRIM, the various large
parallel platforms available through GENCI in France
(CCRT, CINES and IDRIS Computational Centers), and next the high-end
parallel platforms that will be available via European and US
initiatives or projects such as PRACE.

The methodological component of HiePACS concerns the expertise for
the design as well as the efficient and scalable implementation of
highly parallel numerical algorithms to perform frontier simulations.
In order to address these computational challenges a hierarchical
organization of the research is considered.
In this bottom-up approach, we first consider in
Section 3.2
generic topics concerning high performance computational science.
The activities described in this section are transversal to the
overall project and their outcome will support all the other research
activities at various levels in order to ensure the parallel
scalability of the algorithms.
The aim of this activity is not to study general purpose solution but
rather to address these problems in close relation with specialists of
the field in order to adapt and tune advanced approaches in our
algorithmic designs.
The next activity, described in
Section 3.3, is
related to the study of parallel linear algebra techniques that currently
appear as promising approaches to tackle huge problems on extreme scale platforms.
We highlight the linear problems (linear systems or eigenproblems)
because they are in many large scale applications the main
computational intensive numerical kernels and often the main
performance bottleneck.
These parallel numerical techniques will be the basis of both academic and industrial
collaborations, some are described in
Section 4.1,
but will also be closely related to some functionalities developed in the
parallel fast multipole activity described in
Section 3.4. Finally, as
the accuracy of the physical models increases, there is a real need
to go for parallel efficient algorithm implementation for
multiphysics and multiscale modeling in particular in the context of
code coupling.
The challenges associated with this activity will be addressed in the
framework of the activity described in
Section 3.5.

Currently, we have one major application (see Section 4.1) that is in material physics. We will collaborate to all steps of the design of the parallel simulation tool. More precisely, our applied mathematics skill will contribute to the modelling, our advanced numerical schemes will help in the design and efficient software implementation for very large parallel simulations. We also participate to a few co-design actions in close collaboration with some applicative groups. The objective of this activity is to instantiate our expertise in fields where they are critical for designing scalable simulation tools. We refer to Section 4.2 for a detailed description of these activities.

The research directions proposed in

are strongly influenced by both the applications we are studying and the architectures that we target (i.e., massively parallel heterogeneous many-core architectures, ...). Our main goal is to study the methodology needed to efficiently exploit the new generation of high-performance computers with all the constraints that it induces. To achieve this high-performance with complex applications we have to study both algorithmic problems and the impact of the architectures on the algorithm design.

From the application point of view, the project will be interested in multiresolution, multiscale and hierarchical approaches which lead to multi-level parallelism schemes. This hierarchical parallelism approach is necessary to achieve good performance and high-scalability on modern massively parallel platforms. In this context, more specific algorithmic problems are very important to obtain high performance. Indeed, the kind of applications we are interested in are often based on data redistribution for example (e.g., code coupling applications). This well-known issue becomes very challenging with the increase of both the number of computational nodes and the amount of data. Thus, we have both to study new algorithms and to adapt the existing ones. In addition, some issues like task scheduling have to be restudied in this new context. It is important to note that the work developed in this area will be applied for example in the context of code coupling (see Section 3.5).

Considering the complexity of modern architectures like
massively parallel architectures or
new generation heterogeneous multicore architectures, task
scheduling becomes a challenging problem which is central to
obtain a high efficiency. With the recent addition of colleagues
from the scheduling community (O. Beaumont and L. Eyraud-Dubois), the
team is better equipped than ever to design scheduling algorithms and
models specifically tailored to our target problems. It is important
to note that this
topic is strongly linked to the underlying programming
model. Indeed, considering multicore and heterogeneous architectures, it has appeared, in
the last five years, that the best programming model is an approach
mixing multi-threading within computational nodes and message
passing between them. In the last five years, a lot of work has been
developed in the high-performance computing community to understand
what is critic to efficiently exploit massively multicore
platforms that will appear in the near future. It appeared that the
key for the performance is firstly the granularity of the
computations. Indeed, in such platforms the granularity of the parallelism
must be small so that we can feed all the computing units with a
sufficient amount of work. It is thus very crucial for us to design new high
performance tools for scientific computing in this new context. This
will be developed in the context of our solvers, for example, to adapt to
this new parallel scheme. Secondly, the larger the number of cores
inside a node, the more complex the memory hierarchy. This remark
impacts the behavior of the algorithms within the node. Indeed, on
this kind of platforms, NUMA effects will be more and more
problematic. Thus, it is very important to study and design
data-aware algorithms which take into account the affinity between
computational threads and the data they access. This is particularly
important in the context of our high-performance tools. Note that
this work has to be based on an intelligent cooperative underlying
run-time (like the tools developed by the Inria
STORM Project-Team) which allows a fine management of data
distribution within a node.

Another very important issue concerns high-performance computing
using “heterogeneous” resources within a computational
node. Indeed, with the deployment of the GPU and the use of
more specific co-processors, it is
important for our algorithms to efficiently exploit these new type
of architectures. To adapt our algorithms and tools to these
accelerators, we need to identify what can be done on the GPU
for example and what cannot. Note that recent results in the field
have shown the interest of using both regular cores and GPU to
perform computations. Note also that in opposition to the case of
the parallelism granularity needed by regular multicore
architectures, GPU requires coarser grain parallelism. Thus,
making both GPU and regular cores work all together will lead
to two types of tasks in terms of granularity.
This represents a challenging problem especially in terms of scheduling.
From this perspective, we investigate
new approaches for composing parallel applications within a runtime
system for heterogeneous platforms.

In the context of scaling up, and particularly in the context of minimizing energy consumption, it is generally acknowledged that the solution lies in the use of heterogeneous architectures, where each resource is particularly suited to specific types of tasks, and in a fine control at the algorithmic level of data movements and the trade-offs to be made between computation and communication. In this context, we are particularly interested in the optimization of the training phase of deep convolutional neural networks which consumes a lot of memory and for which it is possible to exchange computations for data movements and memory occupation. We are also interested in the complexity introduced by resource heterogeneity itself, both from a theoretical point of view on the complexity of scheduling problems and from a more practical point of view on the implementation of specific kernels in dense or sparse linear algebra.

In order to achieve an advanced knowledge concerning the design of
efficient computational kernels to be used on our high performance
algorithms and codes, we will develop research activities first on
regular frameworks before extending them to more irregular and complex
situations.
In particular, we will work first on optimized dense linear algebra
kernels and we will use them in our more complicated direct and hybrid
solvers for sparse linear algebra and in our fast multipole algorithms for
interaction computations.
In this context, we will participate to the development of those kernels
in collaboration with groups specialized in dense linear algebra.
In particular, we intend develop a strong collaboration with the group of Jack Dongarra
at the University of Tennessee and collaborating research groups. The objectives will be to
develop dense linear algebra algorithms and libraries for multicore
architectures in the context the PLASMA project
and for GPU and hybrid multicore/GPU architectures in the context of the
MAGMA project.
A new solver has emerged from the associate team,
Chameleon. While PLASMA and MAGMA focus on multicore and GPU
architectures, respectively, Chameleon makes the most out of
heterogeneous architectures thanks to task-based dynamic runtime systems.

A more prospective objective is to study the resiliency in the
context of large-scale scientific applications for massively
parallel architectures. Indeed, with the increase of the number of
computational cores per node, the probability of a hardware crash on
a core or of a memory corruption is dramatically increased. This represents a crucial problem
that needs to be addressed. However, we will only study it at the
algorithmic/application level even if it needed lower-level
mechanisms (at OS level or even hardware level). Of course, this
work can be performed at lower levels (at operating system) level for
example but we do believe that handling faults at the application
level provides more knowledge about what has to be done (at
application level we know what is critical and what is not). The
approach that we will follow will be based on the use of a
combination of fault-tolerant implementations of the run-time
environments we use (like for example ULFM) and
an adaptation of our algorithms to try to manage this kind of
faults. This topic represents a very long range objective which
needs to be addressed to guaranty the robustness of our solvers and
applications.

Finally, it is important to note that the main goal of HiePACS is to
design tools and algorithms that will be used within
complex simulation frameworks on next-generation parallel
machines. Thus, we intend with our partners to use the proposed
approach in complex scientific codes and to validate them within
very large scale simulations as well as designing parallel solution in co-design collaborations.

Starting with the developments of basic linear algebra kernels tuned for
various classes of computers, a significant knowledge on
the basic concepts for implementations on high-performance scientific computers has been accumulated.
Further knowledge has been acquired through the design of more sophisticated linear algebra algorithms
fully exploiting those basic intensive computational kernels.
In that context, we still look at the development of new computing platforms and their associated programming
tools.
This enables us to identify the possible bottlenecks of new computer architectures
(memory path, various level of caches, inter processor or node network) and to propose
ways to overcome them in algorithmic design.
With the goal of designing efficient scalable linear algebra solvers for large scale applications, various
tracks will be followed in order to investigate different complementary approaches.
Sparse direct solvers have been for years the methods of choice for solving linear systems of equations,
it is nowadays admitted that classical approaches are not scalable neither from a computational complexity
nor from a memory view point for large problems such as those arising from the discretization of large 3D PDE problems.
We will continue to work on sparse direct solvers on the one hand to make sure they fully benefit from most advanced computing platforms
and on the other hand to attempt to reduce their memory and computational costs for some classes of problems where
data sparse ideas can be considered.
Furthermore, sparse direct solvers are a key building boxes for the
design of some of our parallel algorithms such as the hybrid solvers described in the sequel of this section.
Our activities in that context will mainly address preconditioned Krylov subspace methods; both components,
preconditioner and Krylov solvers, will be investigated.
In this framework, and possibly in relation with the research activity on fast multipole, we intend to study how emerging

For the solution of large sparse linear systems, we design numerical schemes and software packages for direct and hybrid parallel solvers. Sparse direct solvers are mandatory when the linear system is very ill-conditioned; such a situation is often encountered in structural mechanics codes, for example. Therefore, to obtain an industrial software tool that must be robust and versatile, high-performance sparse direct solvers are mandatory, and parallelism is then necessary for reasons of memory capability and acceptable solution time. Moreover, in order to solve efficiently 3D problems with more than 50 million unknowns, which is now a reachable challenge with new multicore supercomputers, we must achieve good scalability in time and control memory overhead. Solving a sparse linear system by a direct method is generally a highly irregular problem that induces some challenging algorithmic problems and requires a sophisticated implementation scheme in order to fully exploit the capabilities of modern supercomputers.

New supercomputers incorporate many microprocessors which are
composed of one or many computational cores. These new architectures
induce strongly hierarchical topologies. These are called NUMA
architectures. In the context of distributed NUMA architectures,
in collaboration with the Inria STORM team, we study
optimization strategies to improve the scheduling of
communications, threads and I/O.
We have developed dynamic scheduling designed for NUMA architectures in the
PaStiX solver. The data structures of the solver, as well as the
patterns of communication have been modified to meet the needs of
these architectures and dynamic scheduling. We are also interested in
the dynamic adaptation of the computation grain to use efficiently
multi-core architectures and shared memory. Experiments on several
numerical test cases have been performed to prove the efficiency of
the approach on different architectures.
Sparse direct solvers such as PaStiX are currently limited by their
memory requirements and computational cost. They are competitive for
small matrices but are often less efficient than iterative methods for
large matrices in terms of memory. We are currently accelerating the dense algebra
components of direct solvers using block low-rank compression techniques.

In collaboration with the ICL team from the University of Tennessee,
and the STORM team from Inria, we are evaluating the way to replace
the embedded scheduling driver of the PaStiX solver by one of the
generic frameworks, PaRSEC or StarPU, to execute the task
graph corresponding to a sparse factorization.
The aim is to
design algorithms and parallel programming models for implementing
direct methods for the solution of sparse linear systems on emerging
computer equipped with GPU accelerators. More generally, this work
will be performed in the context of
the ANR SOLHARIS project which
aims at designing high performance sparse direct solvers for modern
heterogeneous systems. This ANR project involves several groups working
either on the sparse linear solver aspects (HiePACS and ROMA from
Inria and APO from IRIT), on runtime systems (STORM from Inria) or
scheduling algorithms (HiePACS and ROMA from Inria). The results of
these efforts will be validated in the applications provided by the
industrial project members, namely CEA-CESTA and Airbus Central R & T.

One route to the parallel scalable solution of large sparse linear systems in parallel scientific computing is the use of hybrid methods that hierarchically combine direct and iterative methods. These techniques inherit the advantages of each approach, namely the limited amount of memory and natural parallelization for the iterative component and the numerical robustness of the direct part. The general underlying ideas are not new since they have been intensively used to design domain decomposition techniques; those approaches cover a fairly large range of computing techniques for the numerical solution of partial differential equations (PDEs) in time and space. Generally speaking, it refers to the splitting of the computational domain into sub-domains with or without overlap. The splitting strategy is generally governed by various constraints/objectives but the main one is to express parallelism. The numerical properties of the PDEs to be solved are usually intensively exploited at the continuous or discrete levels to design the numerical algorithms so that the resulting specialized technique will only work for the class of linear systems associated with the targeted PDE.

In that context, we continue our effort on the design of algebraic non-overlapping domain decomposition techniques
that rely on the solution of a Schur complement system defined on the interface introduced by the partitioning of the
adjacency graph of the sparse matrix associated with the linear system.
Although it is better conditioned than the original system the Schur complement needs to be precondition to be
amenable to a solution using a Krylov subspace method.
Different hierarchical preconditioners will be considered, possibly multilevel, to improve the numerical behaviour
of the current approaches implemented in our software library MaPHyS. This activity will be developed further developped in
the H2020 EoCoE2 project.
In addition to this numerical studies, advanced parallel implementation will be developed that will involve close
collaborations between the hybrid and sparse direct activities.

Preconditioning is the main focus of the two activities described above. They aim at speeding up the convergence of a Krylov subspace method that is the complementary component involved in the solvers of interest for us. In that framework, we believe that various aspects deserve to be investigated; we will consider the following ones:

Many eigensolvers also rely on Krylov subspace techniques. Naturally some links exist between the Krylov subspace linear solvers and the Krylov subspace eigensolvers. We plan to study the computation of eigenvalue problems with respect to the following two different axes:

In this research project, we are interested in the design of new advanced techniques to solve large mixed dense/sparse linear systems, the extensive comparison of these new approaches to the existing ones, and the application of these innovative ideas on realistic industrial test cases in the domain of aeroacoustics (in collaboration with Airbus Central R & T).

In most scientific computing applications considered nowadays as computational challenges (like biological and material systems, astrophysics or electromagnetism), the introduction of hierarchical methods based on an octree structure has dramatically reduced the amount of computation needed to simulate those systems for a given accuracy. For instance, in the N-body problem arising from these application fields, we must compute all pairwise interactions among N objects (particles, lines, ...) at every timestep. Among these methods, the Fast Multipole Method (FMM) developed for gravitational potentials in astrophysics and for electrostatic (coulombic) potentials in molecular simulations solves this N-body problem for any given precision with

runtime complexity against

for the direct computation.

The potential field is decomposed in a near field part, directly
computed, and a far field part approximated thanks to multipole and
local expansions.
We introduced a matrix formulation of the
FMM that exploits the cache hierarchy on a processor through the Basic
Linear Algebra Subprograms (BLAS). Moreover, we developed a parallel
adaptive version of the FMM algorithm for heterogeneous particle
distributions, which is very efficient on parallel clusters of SMP
nodes. Finally on such computers, we developed the first hybrid
MPI-thread algorithm, which enables to reach better parallel
efficiency and better memory scalability.
We plan to work on the following points in HiePACS .

Nowadays, the high performance computing community is examining
alternative architectures that address the limitations of modern
cache-based designs. GPU (Graphics Processing Units) and the Cell
processor have thus already been used in astrophysics and in molecular
dynamics. The Fast Mutipole Method has also been implemented on GPU.
We intend to examine the
potential of using these forthcoming processors as a building block
for high-end parallel computing in N-body calculations. More
precisely, we want to take advantage of our specific underlying BLAS routines
to obtain an efficient and easily portable FMM for these new architectures.
Algorithmic issues such as dynamic load balancing among heterogeneous
cores will also have to be solved in order to gather all the available
computation power.
This research action will be conduced on close connection with the
activity described in
Section 3.2.

In many applications arising from material physics or astrophysics, the distribution of the data is highly non uniform and the data can grow between two time steps. As mentioned previously, we have proposed a hybrid MPI-thread algorithm to exploit the data locality within each node. We plan to further improve the load balancing for highly non uniform particle distributions with small computation grain thanks to dynamic load balancing at the thread level and thanks to a load balancing correction over several simulation time steps at the process level.

The engine that we develop will be extended to new potentials arising
from material physics such as those used in dislocation
simulations. The interaction between dislocations is long ranged
(

The boundary element method (BEM) is a well known
solution of boundary value problems appearing in various fields of
physics. With this approach, we only have to solve an integral
equation on the boundary. This implies an interaction that decreases in space, but results
in the solution of a dense linear system with

Many important physical phenomena in material physics and climatology are inherently complex applications. They often use multi-physics or multi-scale approaches, which couple different models and codes. The key idea is to reuse available legacy codes through a coupling framework instead of merging them into a stand-alone application. There is typically one model per different scale or physics and each model is implemented by a parallel code.

For instance, to model a crack propagation, one uses a molecular dynamic code to represent the atomistic scale and an elasticity code using a finite element method to represent the continuum scale. Indeed, fully microscopic simulations of most domains of interest are not computationally feasible. Combining such different scales or physics is still a challenge to reach high performance and scalability.

Another prominent example is found in the field of aeronautic propulsion: the conjugate heat transfer simulation in complex geometries (as developed by the CFD team of CERFACS) requires to couple a fluid/convection solver (AVBP) with a solid/conduction solver (AVTP). As the AVBP code is much more CPU consuming than the AVTP code, there is an important computational imbalance between the two solvers.

In this context, one crucial issue is undoubtedly the load balancing
of the whole coupled simulation that remains an open question. The
goal here is to find the best data distribution for the whole coupled
simulation and not only for each stand-alone code, as it is most
usually done. Indeed, the naive balancing of each code on its own can
lead to an important imbalance and to a communication bottleneck
during the coupling phase, which can drastically decrease the overall
performance. Therefore, we argue that it is required to model the
coupling itself in order to ensure a good scalability, especially when
running on massively parallel architectures (tens of thousands of
processors/cores). In other words, one must develop new algorithms and
software implementation to perform a coupling-aware partitioning
of the whole application.
Another related problem is the problem of resource allocation. This is
particularly important for the global coupling efficiency and
scalability, because each code involved in the coupling can be more or
less computationally intensive, and there is a good trade-off to find
between resources assigned to each code to avoid that one of them
waits for the other(s). What does furthermore happen if the load of one code
dynamically changes relatively to the other one? In such a case, it could
be convenient to dynamically adapt the number of resources used
during the execution.

There are several open algorithmic problems that we
investigate in the HiePACS project-team.
All these problems uses a similar methodology based upon the graph
model and are expressed as variants of the classic graph partitioning
problem, using additional constraints or different objectives.

As a preliminary step related to the dynamic load balancing of coupled
codes, we focus on the problem of dynamic load balancing of a single
parallel code, with variable number of processors. Indeed, if the
workload varies drastically during the simulation, the load must be
redistributed regularly among the processors. Dynamic load balancing
is a well studied subject but most studies are limited to an initially
fixed number of processors. Adjusting the number of processors at
runtime allows one to preserve the parallel code efficiency or keep
running the simulation when the current memory resources are
exceeded. We call this problem, MxN graph repartitioning.

We propose some methods based on graph repartitioning in order to re-balance the load while changing the number of processors. These methods are split in two main steps. Firstly, we study the migration phase and we build a “good” migration matrix minimizing several metrics like the migration volume or the number of exchanged messages. Secondly, we use graph partitioning heuristics to compute a new distribution optimizing the migration according to the previous step results.

As stated above, the load balancing of coupled code is a major
issue, that determines the performance of the complex simulation, and
reaching high performance can be a great challenge. In this context,
we develop new graph partitioning techniques, called co-partitioning. They address the problem of load balancing for two
coupled codes: the key idea is to perform a "coupling-aware"
partitioning, instead of partitioning these codes independently, as it
is classically done. More precisely, we propose to enrich the classic
graph model with inter-edges, which represent the coupled code
interactions. We describe two new algorithms, and compare them to the
naive approach. In the preliminary experiments we perform on
synthetically-generated graphs, we notice that our algorithms succeed
to balance the computational load in the coupling phase and in some
cases they succeed to reduce the coupling communications costs.
Surprisingly, we notice that our algorithms do not degrade
significantly the global graph edge-cut, despite the additional
constraints that they impose.

Besides this, our co-partitioning technique requires to use graph
partitioning with fixed vertices, that raises serious issues
with state-of-the-art software, that are classically based on the
well-known recursive bisection paradigm (RB). Indeed, the RB method
often fails to produce partitions of good quality. To overcome this
issue, we propose a new direct Scotch, for real-life
graphs available from the popular DIMACS'10 collection.

Graph handling and partitioning play a central role in the activity
described here but also in other numerical techniques detailed in
sparse linear algebra Section.
The Nested Dissection is now a well-known heuristic for sparse matrix
ordering to both reduce the fill-in during numerical factorization and
to maximize the number of independent computation tasks. By using the
block data structure induced by the partition of separators of the
original graph, very efficient parallel block solvers have been
designed and implemented according to super-nodal or multi-frontal
approaches. Considering hybrid methods mixing both direct and
iterative solvers such as MaPHyS, obtaining a domain
decomposition leading to a good balancing of both the size of domain
interiors and the size of interfaces is a key point for load balancing
and efficiency in a parallel context.

We intend to revisit some well-known graph partitioning techniques in
the light of the hybrid solvers and design new algorithms to be tested
in the Scotch package.

Due to the increase of available computer power, new applications in nano science and physics appear such as study of properties of new materials (photovoltaic materials, bio- and environmental sensors, ...), failure in materials, nano-indentation. Chemists, physicists now commonly perform simulations in these fields. These computations simulate systems up to billion of atoms in materials, for large time scales up to several nanoseconds. The larger the simulation, the smaller the computational cost of the potential driving the phenomena, resulting in low precision results. So, if we need to increase the precision, there are two ways to decrease the computational cost. In the first approach, we improve algorithms and their parallelization and in the second way, we will consider a multiscale approach.

A domain of interest is the material aging for the nuclear industry. The materials are exposed to complex conditions due to the combination of thermo-mechanical loading, the effects of irradiation and the harsh operating environment. This operating regime makes experimentation extremely difficult and we must rely on multi-physics and multi-scale modeling for our understanding of how these materials behave in service. This fundamental understanding helps not only to ensure the longevity of existing nuclear reactors, but also to guide the development of new materials for 4th generation reactor programs and dedicated fusion reactors. For the study of crystalline materials, an important tool is dislocation dynamics (DD) modeling. This multiscale simulation method predicts the plastic response of a material from the underlying physics of dislocation motion. DD serves as a crucial link between the scale of molecular dynamics and macroscopic methods based on finite elements; it can be used to accurately describe the interactions of a small handful of dislocations, or equally well to investigate the global behavior of a massive collection of interacting defects.

To explore i.e. to simulate these new areas, we need to develop and/or to improve significantly models, schemes and solvers used in the classical codes. In the project, we want to accelerate algorithms arising in those fields.

We will focus on the following topics (in particular in the currently under definition OPTIDIS project in collaboration with CEA Saclay, CEA Ile-de-france and SIMaP Laboratory in Grenoble) in connection with research described at Sections 3.4 and 3.5.

Scientific simulation for ITER tokamak modeling provides a natural
bridge between theory and experimentation and is also an essential
tool for understanding and predicting plasma behavior.
Recent progresses in numerical simulation of fine-scale turbulence and
in large-scale dynamics of magnetically confined plasma have been
enabled by access to petascale supercomputers. These progresses would
have been unreachable without new computational methods and adapted
reduced models. In particular, the plasma science community has
developed codes for which computer runtime scales quite well with the
number of processors up to thousands cores.
The research activities of HiePACS concerning the international ITER
challenge have started in the Inria Project Lab C2S@Exa in
collaboration with CEA-IRFM and were related to two complementary
studies: a first one concerning the turbulence of plasma particles
inside a tokamak (in the context of GYSELA code) and a second one
concerning the MHD instability edge localized modes (in the context of
JOREK code). The activity concerning GYSELA was completed at the
end of 2018.

Other numerical simulation tools designed for the ITER challenge aim
at making a significant progress in understanding active control
methods of plasma edge MHD instability Edge Localized Modes (ELMs)
which represent a particular danger with respect to heat and particle
loads for Plasma Facing Components (PFC) in the tokamak.
The goal is to improve the understanding of the related physics and to
propose possible new strategies to improve effectiveness of ELM
control techniques.
The simulation tool used (JOREK code) is related to non linear MHD
modeling and is based on a fully implicit time evolution scheme that leads
to 3D large very badly conditioned sparse linear systems to be solved
at every time step. In this context, the use of PaStiX library to
solve efficiently these large sparse problems by a direct method is a
challenging issue.

This activity continues within the context of the EoCoE2 project, in which the PaStiX solver is identified to allow the processing of very larger linear systems for the nuclear fusion code TOKAM3X from CEA-IRFM.
Contrary to the JOREK code, the problem to be treated corresponds to the complete 3D volume of the plasma torus. The objective is to be competitive, for complex geometries, compared to an Algebraic MultiGrid approach designed by one partner of EoCoE2.

Parallel and numerically scalable hybrid solvers based on a fully algebraic coarse space correction have been theoretically studied
and various advanced parallel implementations have been designed.
Their parallel scalability has been initially investigated on large scale problems within the EoCoE project thanks to a close collaboration with the BSC
and the integration of MaPHyS within the Alya software. This activity will further develop in the EoCoE2 project.
The performance has also been assessed on PRACE Tier-0 machine within a PRACE Project Access through a collaboration with CERFACS and
Laboratoire de Physique des Plasmas at Ecole Polytechnique for the calculation of plasma propulsion. A comparative parallel scalability study with the
Algebraic MultiGrid from Petsc has been conducted in that framework.

This domains is in the context of a long term collaboration with Airbus Research Centers.
Wave propagation phenomena intervene in many different aspects of systems design at Airbus. They drive the level of acoustic vibrations that mechanical components have to sustain, a level that one may want to diminish for comfort reason (in the case of aircraft passengers, for instance) or for safety reason (to avoid damage in the case of a payload in a rocket fairing at take-off). Numerical simulations of these phenomena plays a central part in the upstream design phase of any such project. Airbus Central R & T has developed over the last decades an in-depth knowledge in the field of Boundary Element Method (BEM) for the simulation of wave propagation in homogeneous media and in frequency domain. To tackle heterogeneous media (such as the jet engine flows, in the case of acoustic simulation), these BEM approaches are coupled with volumic finite elements (FEM). We end up with the need to solve large (several millions unknowns) linear systems of equations composed of a dense part (coming for the BEM domain) and a sparse part (coming from the FEM domain). Various parallel solution techniques are available today, mixing tools created by the academic world (such as the Mumps and Pastix sparse solvers) as well as parallel software tools developed in-house at Airbus (dense solver SPIDO, multipole solver,

The training phase of Deep Convolutional Neural Networks represents nowadays a significant share of the computations performed on HPC supercomputers. It introduces several new problems concerning resource allocation and scheduling issues, because of the specific pattern of task graphs induced by the stochastic gradient descent and because memory consumption is particularly critical when performing training. As of today, the most classical parallelization methods consists in partitioning mini-batches, images, filters,... but all these methods induce high synchronization and communication costs, and only very partially resolve memory issues. Within the framework of the Inria IPL on HPC Big Data and Learning convergence, we are working on re-materialization techniques and on the use of model parallelism, in particular to be able to build on the research that has been carried out in a more traditional HPC framework on the exploitation of resource heterogeneity and dynamic runtime scheduling.

The year 2020 was marked by the covid crisis and its impact on society and its overall activity. The world of research was also greatly affected: Faculty members have seen their teaching load increase significantly; PhD students and post-docs have often had to deal with a worsening of their working conditions, as well as with reduced interactions with their supervisors and colleagues; most scientific collaborations have been greatly affected, with many international activities cancelled or postponed to dates still to be defined.

HiePACS members have participated to preparation of the following successful 2020 H2020 proposals:

A-VCI is a theoretical vibrational spectroscopy algorithm developed to effectively reduce the number of vibrational states used in the configuration-interaction (CI) process. It constructs a nested basis for the discretization of the Hamiltonian operator inside a large CI approximation space and uses an a-posteriori error estimator (residue) to select the most relevant directions to expand the discretization space.

The Hamiltonian operator consists of 3 operators: a harmonic oscillator sum, the potential energy surface operator and the Coriolis operators. In addition, the code can compute the intensity of eigenvectors.

The code can handle molecules up to 10 atoms, which corresponds to solving an eigenvalue problem in a 24-dimensional space.

Chameleon is part of the MORSE (Matrices Over Runtime Systems @ Exascale) project. The overall objective is to develop robust linear algebra libraries relying on innovative runtime systems that can fully benefit from the potential of those future large-scale complex machines.

We expect advances in three directions based first on strong and closed interactions between the runtime and numerical linear algebra communities. This initial activity will then naturally expand to more focused but still joint research in both fields.

1. Fine interaction between linear algebra and runtime systems. On parallel machines, HPC applications need to take care of data movement and consistency, which can be either explicitly managed at the level of the application itself or delegated to a runtime system. We adopt the latter approach in order to better keep up with hardware trends whose complexity is growing exponentially. One major task in this project is to define a proper interface between HPC applications and runtime systems in order to maximize productivity and expressivity. As mentioned in the next section, a widely used approach consists in abstracting the application as a DAG that the runtime system is in charge of scheduling. Scheduling such a DAG over a set of heterogeneous processing units introduces a lot of new challenges, such as predicting accurately the execution time of each type of task over each kind of unit, minimizing data transfers between memory banks, performing data prefetching, etc. Expected advances: In a nutshell, a new runtime system API will be designed to allow applications to provide scheduling hints to the runtime system and to get real-time feedback about the consequences of scheduling decisions.

2. Runtime systems. A runtime environment is an intermediate layer between the system and the application. It provides low-level functionality not provided by the system (such as scheduling or management of the heterogeneity) and high-level features (such as performance portability). In the framework of this proposal, we will work on the scalability of runtime environment. To achieve scalability it is required to avoid all centralization. Here, the main problem is the scheduling of the tasks. In many task-based runtime environments the scheduler is centralized and becomes a bottleneck as soon as too many cores are involved. It is therefore required to distribute the scheduling decision or to compute a data distribution that impose the mapping of task using, for instance the so-called “owner-compute” rule. Expected advances: We will design runtime systems that enable an efficient and scalable use of thousands of distributed multicore nodes enhanced with accelerators.

3. Linear algebra. Because of its central position in HPC and of the well understood structure of its algorithms, dense linear algebra has often pioneered new challenges that HPC had to face. Again, dense linear algebra has been in the vanguard of the new era of petascale computing with the design of new algorithms that can efficiently run on a multicore node with GPU accelerators. These algorithms are called “communication-avoiding” since they have been redesigned to limit the amount of communication between processing units (and between the different levels of memory hierarchy). They are expressed through Direct Acyclic Graphs (DAG) of fine-grained tasks that are dynamically scheduled. Expected advances: First, we plan to investigate the impact of these principles in the case of sparse applications (whose algorithms are slightly more complicated but often rely on dense kernels). Furthermore, both in the dense and sparse cases, the scalability on thousands of nodes is still limited, new numerical approaches need to be found. We will specifically design sparse hybrid direct/iterative methods that represent a promising approach.

Overall end point. The overall goal of the MORSE associate team is to enable advanced numerical algorithms to be executed on a scalable unified runtime system for exploiting the full potential of future exascale machines.

Chameleon includes the following features:

- BLAS 3, LAPACK one-sided and LAPACK norms tile algorithms - Support QUARK and StarPU runtime systems and PaRSEC since 2018 - Exploitation of homogeneous and heterogeneous platforms through the use of BLAS/LAPACK CPU kernels and cuBLAS/MAGMA CUDA kernels - Exploitation of clusters of interconnected nodes with distributed memory (using OpenMPI)

MaPHyS is a software package that implements a parallel linear solver coupling direct and iterative approaches. The underlying idea is to apply to general unstructured linear systems domain decomposition ideas developed for the solution of linear systems arising from PDEs. The interface problem, associated with the so called Schur complement system, is solved using a block preconditioner with overlap between the blocks that is referred to as Algebraic Additive Schwarz. A fully algebraic coarse space is available for symmetric positive definite problems, that insures the numerical scalability of the preconditioner.

The parallel implementation is based on MPI+thread. Maphys relies on state-of-the art sparse and dense direct solvers.

MaPHyS is essentially a preconditioner that can be used to speed-up the convergence of any Krylov subspace method and is coupled with the ones implemented in the Fabulous package.

PaStiX is a scientific library that provides a high performance parallel solver for very large sparse linear systems based on block direct and block ILU(k) methods. It can handle low-rank compression techniques to reduce the computation and the memory complexity. Numerical algorithms are implemented in single or double precision (real or complex) for LLt, LDLt and LU factorization with static pivoting (for non symmetric matrices having a symmetric pattern). The PaStiX library uses the graph partitioning and sparse matrix block ordering packages Scotch or Metis.

The PaStiX solver is suitable for any heterogeneous parallel/distributed architecture when its performance is predictable, such as clusters of multicore nodes with GPU accelerators or KNL processors. In particular, we provide a high-performance version with a low memory overhead for multicore node architectures, which fully exploits the advantage of shared memory by using an hybrid MPI-thread implementation.

The solver also provides some low-rank compression methods to reduce the memory footprint and/or the time-to-solution.

This software implements in PyTorch a new activation checkpointing method which allows to significantly decrease memory usage when training Deep Neural Networks with the back-propagation algorithm. Similarly to checkpointing techniques coming from the literature on Automatic Differentiation, it consists in dynamically selecting the forward activations that are saved during the training phase, and then automatically recomputing missing activations from those previously recorded. We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs (this uses more memory, but requires fewer recomputations in the backward phase), and we provide in https://hal.inria.fr/hal-02352969 an algorithm to compute the optimal computation sequence for this model.

Our PyTorch implementation processes the entire chain, dealing with any sequential DNN whose internal layers may be arbitrarily complex and automatically executing it according to the optimal checkpointing strategy computed given a memory limit. In https://hal.inria.fr/hal-02352969, through extensive experiments, we show that our implementation consistently outperforms existing checkpoint-ing approaches for a large class of networks, image sizes and batch sizes.

ScalFMM is a software library to simulate N-body interactions using the Fast Multipole Method. The library offers two methods to compute interactions between bodies when the potential decays like 1/r. The first method is the classical FMM based on spherical harmonic expansions and the second is the Black-Box method which is an independent kernel formulation (introduced by E. Darve @ Stanford). With this method, we can now easily add new non oscillatory kernels in our library. For the classical method, two approaches are used to decrease the complexity of the operators. We consider either matrix formulation that allows us to use BLAS routines or rotation matrix to speed up the M2L operator.

ScalFMM intends to offer all the functionalities needed to perform large parallel simulations while enabling an easy customization of the simulation components: kernels, particles and cells. It works in parallel in a shared/distributed memory model using OpenMP and MPI. The software architecture has been designed with two major objectives: being easy to maintain and easy to understand. There is two main parts:

the management of the octree and the parallelization of the method the kernels. This new architecture allow us to easily add new FMM algorithm or kernels and new paradigm of parallelization.

PlaFRIM is an experimental platform for research in modeling, simulations and high performance computing. This platform has been set up from 2009 under the leadership of Inria Bordeaux Sud-Ouest in collaboration with computer science and mathematics laboratories, respectively Labri and IMB with a strong support in the region Aquitaine.

It aggregates different kinds of computational resources for research and development purposes. The latest technologies in terms of processors, memories and architecture are added when they are available on the market. It is now more than 1,000 cores (excluding GPU and Xeon Phi ) that are available for all research teams of Inria Bordeaux, Labri and IMB. This computer is in particular used by all the engineers who work in HiePACS and are advised by F. Rue from the SED.

In the context of the Inria International Lab JLESC we have an ongoing collaboration with Argonne National Laboratory on the use of agnostic compression techniques to reduce the memory footprint of iterative linear solvers. Large scale applications running on HPC systems often require a substantial amount of memory and can have a large data movement overhead. Lossy data compression techniques can reduce the size of the data and associated communication cost, but the effect of the loss of accuracy on the numerical algorithm can be hard to predict. In this work we examine the FGMRES algorithm, which requires the storage of a basis for the search space and another one for the space spanned by the residuals. We show that the vectors spanning this search space can be compressed by looking at the combination of FGMRES and compression in the context of inexact Krylov subspace methods. This allows us to derive a bound on the normwise relative compression error in each iteration. We use this bound to formulate a number of different practical compression strategies, and validate and compare them through numerical experiments. We refer the reader to 34 for a detailed presentation of this work.

We are concerned with the iterative solution of linear systems with multiple right-hand sides available one group after another, including the case where there are massive number (like tens of thousands) of right-hand sides associated with a single matrix so that all of them cannot be solved at once but rather need to be split into chunks of possible variable sizes. For such sequences of linear systems with multiple left and right-hand sides, we develop a new recycling block generalized conjugate residual method with inner orthogonalization and inexact breakdown (IB-BGCRO-DR), which glues subspace recycling technique in GCRO-DR [SIAM J. Sci. Comput., 28(5) (2006), pp 1651,1674] and inexact breakdown mechanism in IB-BGMRES [Linear Algebra Appl., 419 (2006), pp. 265,285] to guarantee this new algorithm could reuse spectral information for subsequent cycles as well as for the remaining linear systems to be solved. Related variant IB-BFGCRO-DR that suits for flexible preconditioning is devised to cope with constraints on some applications while also enabling mixed-precision calculation, which provides advantages in speed and memory usage over double precision as well as in perspective of emerging computing units such as the GPUs. Besides, we also discuss why/ how many possible choices are available when building recycling subspace as well as how to exploit the inexact breakdown mechanism for realizing the flexibility of search space expansion policies and monitoring individual convergence thresholds for each single right-hand side. As a side effect one also illustrates that this method can be applied to constant/ slowly-varying left-hand side. Finally we demonstrate the numerical and computational benefits of combining these ideas in such algorithms on a set of simple academic examples with a sequence of right-hand sides.

A scientific report should appear shortly that will detailed this work. Furthermore, a C++ implementation in the Fabulous package is also an ongoing work.

In 24, we consider the design of static resource allocation problems for irregular distributed parallel applications. More precisely, we focus on two classical tiled linear algebra kernels: the Matrix Multiplication (MM) and the LU decomposition (LU) algorithms on large linear systems. In the context of parallel distributed platforms, data exchanges can dramatically degrade the performance of linear algebra kernels and in this context, compression techniques such as Block Low Rank (BLR) compression techniques are good candidates both for limiting data storage on each processor and data exchanges between processors. On the other hand, the use of BLR representation makes the static allocation problem of tiles to processors more complex. Indeed, the load associated to each tile depends on its compression factor, which induces an heterogeneous load balancing problem. In turn, solving this load balancing problem optimally might lead to complex allocation schemes, where the tiles allocated to a given processor are scattered all over the matrix. This in turn induces communication costs, since matrix multiplication and LU decompositions heavily rely on broadcasting operations along rows and columns of processors, so that the communication volume is minimized when the maximal number of different processors on any row and column is minimized. In the fully homogeneous case, 2D Block Cyclic (BC) allocation solves both load balancing and communication minimization issues simultaneously , but it might lead to bad load balancing in the heterogeneous case. Our goal in this paper is to propose data allocation schemes dedicated to BLR format and to prove that it is possible to obtain good performance on makespan when simultaneously balancing the load and minimizing the maximal number of different processor in any row or column.

The conjugate gradient (CG) method is the most widely used iterative scheme for the solution of large sparse systems of linear equations when the matrix is symmetric positive definite. Although more than sixty year old, it is still a serious candidate for extreme-scale computations on large computing platforms. On the technological side, the continuous shrinking of transistor geometry and the increasing complexity of these devices affect dramatically their sensitivity to natural radiation, and thus diminish their reliability. One of the most common effects produced by natural radiation is the single event upset which consists in a bit-flip in a memory cell producing unexpected results at application level. Consequently, the future extreme scale computing facilities will be more prone to errors of any kind, including bit-flips, during their calculations. These numerical and technological observations are the main motivations for this work, where we first investigate through extensive numerical experiments the sensitivity of CG to bit-flips in its main computationally intensive kernels, namely the matrix-vector product and the preconditionner application. We further propose numerical criteria to detect the occurrence of such soft errors and assess their robustness through extensive numerical experiments.

We have been working on the revision of this work that has been eventually published in 15; a complementary study on persistent faults is detailed in 35.

Recent works in sparse direct solvers have multiplied the solutions to exploit data-sparsity additionally to sparse structure in these solvers. Among these solution, the PaStiX solver exploits block low-rank structures to reduce both its memory consumption and time to solution. In this talk, we will discuss recent changes made to the solver in this context such as the study of suitable low-rank compression kernels, adapted ordering heuristics, or scheduling strategies through the use of runtime systems. Performance results of the latest release including the presented features will be shown and discussed.

This work has been presented at the international conference SIAM PP 2020 33.

Low-rank compression techniques are very promising for reducing memory footprint and execution time on a large spectrum of linear solvers. Sparse direct supernodal approaches are one these techniques. However, despite providing a very good scalability and reducing the memory footprint, they suffer from an important flops overhead in their unstructured low-rank updates. As a consequence, the execution time is not improved as expected.

In this paper, we study a solution to improve low-rank compression
techniques in sparse supernodal solvers. The proposed method
tackles the overprice of the low-rank updates by identifying the blocks
that have poor compression rates.
We show that block incomplete LU factorization, thanks to the block fill-in levels, allows us to identify most of
these non-compressible blocks at low cost.
This identification enables us to postpone the low-rank compression step to
trade small extra memory consumption for a better time to
solution. The solution is validated within the PaStiX library with a
large set of application matrices. It demonstrates sequential and
multi-threaded speedup up to

This work will be submitted at the international conference EuroPar 2021.

The Adaptive Vibrational Configuration Interaction (A-VCI) algorithm is an iterative process that computes the anharmonic spectrum of a molecule using nested bases to discretize the Hamiltonian operator in high dimension. For large molecular systems, the size of the discretization space, the computation time and the memory consumption become prohibitive especially when are looking eigenvalues in the middle of the spectra. It is therefore necessary to develop new methods to further limit the number of basis functions.

To overcome this issues, we have worked this year in two direction.

This year, we focused mainly on the work of comparing point clouds obtained by multidimensional scaling calculations. Our goal is to find a way to automatically compare these clouds without having to look at them manually in order to judge whether or not they have the same shape. To do this, we took into account Hausdorff's distance. However this metric had several problems like the presence of outliers i.e. few points that are very far away but affect the distance by lot, different orientations of the cloud shape. We have proposed an algorithm that overcomes these difficulties and gives very unsuccessful results. To improve these performances, we now compute distances in larger spaces. That led to issues with the cost of alignment which rise exponentially with the number of dimensions studied. We have proposed a heuristic that allows us to compare the shapes after passing them through a process that have the same output regardless of the symmetries present. A report is being written.

We are involved in the ADT Gordon ((partners: TADAAM (coordinator), STORM, HiePACS, PLEIADE). The objectives of this ADT is to scale our sollver stack on a PLEIADE dimensioning metabarcoding application (multidimensional scaling method). Our goal is to be able to handle a problem leading to a distance matrix around 100 million individuals. Our contribution concerns the the scalability of the multidimensional scaling method and more particulary the random projection methods to speed up the SVD solver.
Experiments on occigen plateform has
allowed us to show that the solver stack was able to solve
efficiently a large problem up to one million individuals
in less than 22 minutes on 100 nodes. This has highlighted that for these problem sizes the
management of I/O, inputs and outputs with the disks, becomes
critical and dominates calculation times.

Concerning the use of tensor formalism for data analysis, we worked on two aspects in collaboration with EPC PLEIADE (A. Franc). Firstly, a methodology was set up to analyze the structured data of the Malabar project. These data are genetic data collected in the Arcachon bay. The frequencies of these genetic sequences were classified according to their species, zone, time and position in the water column of collection. On the frequency tensor, the Tucker decomposition without or with the positivity constraint of the tensor was applied and made it possible to review the techniques for extracting the relevant information. In a second step, we were interested in the Correspondance Analysis and its geometric interpretation (of point clouds) from frequency matrices and their studies. The analysis of point clouds is possible thanks to a linear relationship existing between the point clouds. We have studied how to extend this relationship to three point clouds, by applying some properties developed in the context of Tucker's hierarchy. An article is in progress.

The JOREK extended magneto-hydrodynamic (MHD) code is a widely used simulation code for studying the non-linear dynamics of large-scale instabilities in divertor tokamak plasmas. Due to the large scale-separation intrinsic to these phenomena both in space and time, the computational costs for simulations in realistic geometry and with realistic parameters can be very high, motivating the investment of considerable effort for optimization. The code is usually run with a fully implicit time integration allowing to use large time steps independent of a CFL criterion. This is particularly important due to the fast time scales associated with MHD waves and fast parallel heat transport. For solving the resulting large sparse-matrix system iteratively in each time step, a physics-based preconditioner building on the assumption of a weak coupling between the toroidal harmonics is applied. The solution for each harmonic matrix is determined independently in this preconditioner using a direct sparse matrix library. In this article, a set of developments regarding the JOREK solver and preconditioner is described, which lead to overall significant benefits for large production simulations. This comprises in particular enhanced convergence in highly non-linear scenarios and a general reduction of memory consumption and computational costs. The developments include faster construction of preconditioner matrices, a domain decomposition of preconditioning matrices for solver libraries that can handle distributed matrices, interfaces for additional solver libraries, an option to use matrix compression methods, and the implementation of a complex solver interface for the preconditioner. The most significant development presented consists in a generalization of the physics based preconditioner to “mode groups”, which allows to account for the dominant interactions between toroidal Fourier modes in highly non-linear simulations. At the cost of a moderate increase of memory consumption, the technique can strongly enhance convergence in suitable cases allowing to use significantly larger time steps. For all developments, benchmarks based on typical simulation cases demonstrate the resulting improvements.

In the context of a parallel plasma physics simulation code, we perform a qualitative performance study between two natural candidates for the parallel solution of 3D Poisson problems that are multigrid and domain decomposition. We selected one representative of each of these numerical techniques implemented in state of the art parallel packages and show that depending on the regime used in terms of number of unknowns per computing cores the best alternative in terms of time to solution varies. Those results show the interest of having both types of numerical solvers integrated in a simulation code that can be used in very different configurations in terms of problem sizes and parallel computing platforms. More information on these results will be shortly available in an Inria scientific report.

Wave propagation phenomena intervene in many different aspects of systems design at Airbus. They drive the level of acoustic vibrations that mechanical components have to sustain, a level that one may want to diminish for comfort reason (in the case of aircraft passengers, for instance) or for safety reason (to avoid damage in the case of a payload in a rocket fairing at take-off). Numerical simulations of these phenomena plays a central part in the upstream design phase of any such project. Airbus Central R&T has developed over the last decades an in-depth knowledge in the field of Boundary Element Method (BEM) for the simulation of wave propagation in homogeneous media and in frequency domain. To tackle heterogeneous media (such as the jet engine flows, in the case of acoustic simulation), these BEM approaches are coupled with volumic finite elements (FEM). We end up with the need to solve large (several millions unknowns) linear systems of equations composed of a dense part (coming for the BEM domain) and a sparse part (coming from the FEM domain). Various parallel solution techniques are available today, mixing tools created by the academic world (such as the Mumps and Pastix sparse solvers) as well as parallel software tools developed in-house at Airbus (dense solver SPIDO, multipole solver , H-matrix solver with an open sequential version available online hmatoss). In the current state of knowledge and technologies, these methods do not permit to tackle the simulation of aeroacoustics problems at the highest acoustic frequencies (between 5 and 20 kHz, upper limits of human audition) while considering the whole complexity of geometries and phenomena involved (higher acoustic frequency implies smaller mesh sizes that lead to larger unknowns number, a number that grows like

It is common to accelerate the boundary element method by compression techniques (FMM,

Training Deep Neural Networks is known to be an expensive operation, both in terms of computational cost and memory load. Indeed, during training, all intermediate layer outputs (called activations) computed during the forward phase must be stored until the corresponding gradient has been computed in the backward phase. These memory requirements sometimes prevent to consider larger batch sizes and deeper networks, so that they can limit both convergence speed and accuracy.

Recent works have proposed to offload some of the computed forward activations from the memory of the GPU to the memory of the CPU. This requires to determine which activations should be offloaded and when these transfers from and to the memory of the GPU should take place. In 23, We prove that this problem is NP-hard in the strong sense, and we propose two heuristics based on relaxations of the problem. We perform extensive experimental evaluation on standard Deep Neural Networks. We compare the performance of our heuristics against previous approaches from the literature, showing that they achieve much better performance in a wide variety of situations.

In 38 we consider the complexity of the optimization problems induced by the use of pipelined model parallelism for the he training phase in Deep Neural Networks has become an important source of computing resource usage and because of the resulting volume of computation, it is crucial to perform it efficiently on parallel architectures. Even today, data parallelism is the most widely used method, but the associated requirement to replicate all the weights on the totality of computation resources poses problems of memory at the level of each node and of collective communications at the level of the platform. In this context, the model parallelism, which consists in distributing the different layers of the network over the computing nodes, is an attractive alternative. Indeed, it is expected to better distribute weights (to cope with memory problems) and it does not imply large collective communications since only forward activations are communicated. However, to be efficient, it must be combined with a pipelined / streaming approach, which leads in turn to new memory costs. The goal of this paper is to model these memory costs in detail, to analyze the complexity of the associated throughput optimization problem under memory constraints and to show that it is possible to formalize this optimization problem as an Integer Linear Program (ILP). In 37, we formalize in detail the optimization problem associated with the placement of DNN layers onto computation resources when using pipelined model parallelism, and we derive a dynamic programming based heuristic, MadPipe, that allows to significantly improve the performance of the parallel model approach compared to the literature.

In order to express parallelism, parallel sparse direct solvers take advantage of the elimination tree to exhibit tree-shaped task graphs, where nodes represent computational tasks and edges represent data dependencies. One of the pre-processing stages of sparse direct solvers consists of mapping computational resources (processors) to these tasks. The objective is to minimize the factorization time by exhibiting good data locality and load balancing. The proportional mapping technique is a widely used approach to solve this resource-allocation problem. It achieves good data locality by assigning the same processors to large parts of the elimination tree. However, it may limit load balancing in some cases. In this paper, we propose a dynamic mapping algorithm based on proportional mapping. This new approach, named Steal, relaxes the data locality criterion to improve load balancing. In order to validate the newly introduced method, we perform extensive experiments on the PaStiX sparse direct solver. It demonstrates that our algorithm enables better static scheduling of the numerical factorization while keeping good data locality.

This work has been presented at the international conference EuroPar 2020 29.

In distributed memory systems, it is paramount to develop strategies to overlap the data transfers between memory nodes with the computations in order to exploit their full potential. In 28, we consider the problem of determining the order of data transfers between two memory nodes for a set of independent tasks with the objective of minimizing the makespan. We prove that, with limited memory capacity, the problem of obtaining the optimal data transfer order is NP-complete. We propose several heuristics to determine this order and discuss the conditions that might be favorable to different heuristics. We analyze our heuristics on traces obtained by running two molecular chemistry kernels, namely, Hartree‚Fock (HF) and Coupled Cluster Singles Doubles (CCSD), on 10 nodes of an HPC system. Our results show that some of our heuristics achieve significant overlap for moderate memory capacities and resulting in makespans that are very close to the lower bound.

Concurrent kernel execution is a relatively new feature in modern GPUs, which was designed to improve hardware utilization and the overall system throughput. However, the decision on the simultaneous execution of tasks is performed by the hardware with a leftover policy, that assigns as many resources as possible for one task and then assigns the remaining resources to the next task. This can lead to unreasonable use of resources. In 40, 27, we tackle the problem of co-scheduling for GPUs with and without preemption, with the focus on determining the kernels submission order to reduce the number of preemptions and the kernels makespan, respectively. We propose a graph-based theoretical model to build preemptive and non-preemptive schedules. We show that the optimal preemptive makespan can be computed by solving a Linear Program in polynomial time, and we propose an algorithm based on this solution which minimizes the number of preemptions. We also propose an algorithm that transforms a preemptive solution of optimal makespan into a non-preemptive solution with the smallest possible preemption overhead. We show, however, that finding the minimal amount of preemptions among all preemptive solutions of optimal makespan is a NP-hard problem, and computing the optimal non-preemptive schedule is also NP-hard. In addition, we study the non-preemptive problem, without searching first for a good preemptive solution, and present a Mixed Integer Linear Program solution to this problem. We performed experiments on real-world GPU applications and our approach can achieve optimal makespan by preempting 6 to 9% of the tasks. Our non-preemptive approach, on the other side, obtains makespan within 2.5% of the optimal preemptive schedules, while previous approaches exceed the preemptive makespan by 5 to 12%.

We study in 17 the problem of executing an application represented by a precedence task graph on a parallel machine composed of standard computing cores and accelerators. Contrary to most existing approaches, we distinguish the allocation and the scheduling phases and we mainly focus on the allocation part of the problem: choose the most appropriate type of computing unit for each task. We address both off-line and on-line settings and design generic scheduling approaches. In the first case, we establish strong lower bounds on the worst-case performance of a known approach based on Linear Programming for solving the allocation problem. Then, we refine the scheduling phase and we replace the greedy List Scheduling policy used in this approach by a better ordering of the tasks. Although this modification leads to the same approximability guarantees, it performs much better in practice. We also extend this algorithm to more types of computing units, achieving an approximation ratio which depends on the number of different types. In the on-line case, we assume that the tasks arrive in any, not known in advance, order which respects the precedence relations and the scheduler has to take irrevocable decisions about their allocation and execution. In this setting, we propose the first on-line scheduling algorithm which takes into account precedences. Our algorithm is based on adequate rules for selecting the type of processor where to allocate the tasks and it achieves a constant factor approximation guarantee if the ratio of the number of CPUs over the number of GPUs is bounded. Finally, all the previous algorithms for hybrid architectures have been experimented , on a large set of common benchmarks that we made available for the community. These simulations assess the good practical behavior of the algorithms with respect to the state-of-the-art solutions, whenever these exist, or baseline algorithms.

Along a slightly different direction, we consider in 30 the use of reinforcement learning to help to design efficient dynamic schedulers to be used in runtime systems such as StarPU. In practice, it is quite common to face combinatorial optimization problems which contain uncertainty along with non-determinism and dynamicity. These three properties call for appropriate algorithms; reinforcement learning (RL) is dealing with them in a very natural way. Today, despite some efforts, most real-life combinatorial optimization problems remain out of the reach of reinforcement learning algorithms. In this paper, we propose a reinforcement learning approach to solve a realistic scheduling problem, and apply it to an algorithm commonly executed in the high performance computing community, the Cholesky factorization. On the contrary to static scheduling, where tasks are assigned to processors in a predetermined ordering before the beginning of the parallel execution, our method is dynamic: task allocations and their execution ordering are decided at runtime, based on the system state and unexpected events, which allows much more flexibility. To do so, our algorithm uses graph neural networks in combination with an actor-critic algorithm (A2C) to build an adaptive representation of the problem on the fly. We show that this approach is competitive with state-of-the-art heuristics used in high-performance computing runtime systems. Moreover, our algorithm does not require an explicit model of the environment, but we demonstrate that extra knowledge can easily be incorporated and improves performance. We also exhibit key properties provided by this RL approach, and study its transfer abilities to other instances.

We are also interested in the possibility of performing extra (redundant) computations in order to save messages. This possibility might be of interest, for instance, in the context of energy savings, and it is crucial in contexts such as on-line games. In 22, we consider the issue of guaranteeing properties in Distributed Virtual Environments (DVEs) without a server and without global knowledge of the system state and therefore only by exchanging messages. This issue is particularly relevant in the case of online games, that operate in a fully distributed framework and for which network resources such as bandwidth are the critical resources. In the context of games, players typically need to know the distance between their character and other characters, at least approximately. Players all share the same position estimation algorithm but, in general, do not know the current positions of others. We provide a synchronized distributed algorithm Alc to guarantee, at any time, that the estimated distance d est between any pair of characters A and B is always a (1 +

In 25, we consider the theoretical problem of deriving strong bounds on the running time of Cholesky factorization, performed on homogeneous (GPUs or CPUs) resources. Due to the advent of multicore architectures and massive parallelism, the tiled Cholesky factorization algorithm has recently received plenty of attention and is often referenced by practitioners as a case study. It is also implemented in mainstream dense linear algebra libraries and is used as a testbed for runtime systems. However, we note that theoretical study of the parallelism of this algorithm is currently lacking. In this paper, we present new theoretical results about the tiled Cholesky factorization in the context of a parallel homogeneous model without communication costs. Based on the relative costs of involved kernels, we prove that only two different situations must be considered, typically corresponding to CPUs and GPUs. By a careful analysis on the number of tasks of each type that run simultaneously in the ALAP (As Late As Possible) schedule without resource limitation, we are able to determine precisely the number of busy processors at any time (as degree 2 polynomials). We then use this information to find a closed form formula for the minimum time to schedule a tiled Cholesky factorization of size n on P processors. We show that this bound outperforms classical bounds from the literature. We also prove that ALAP(P), an ALAP-based schedule where the number of resources is limited to P , has a makespan extremely close to the lower bound, thus proving both the effectiveness of ALAP(P) schedule and of the lower bound on the makespan.

In 18, we consider the influence on data-locality of the replication of data files, as automatically performed by Distributed File Systems such as HDFS. Replication of data files, as automatically performed by Distributed File Systems such as HDFS, is known to have a crucial impact on data locality in addition to system fault tolerance. Indeed, intuitively, having more replicas of the same input file gives more opportunities for this task to be processed locally, i.e. without any input file transfer. Given the practical importance of this problem, a vast literature has been proposed to schedule tasks, based on a random placement of replicated input files. Our goal in this paper is to study the performance of these algorithms, both in terms of makespan minimization (minimize the completion time of the last task when non-local processing is forbidden) and communication minimization (minimize the number of non-local tasks when no idle time on resources is allowed). In the case of homogenous tasks, we are able to prove, using models based on "balls into bins" and "power of two choices" problems, that the well known good behavior of classical strategies can be theoretically grounded. Going further, we even establish that it is possible, using semi-matchings theory, to find the optimal solution in very small time. We also use known graph-orientation results to prove that this optimal solution is indeed near-perfect with strong probability. In the more general case of heterogeneous tasks, we propose heuristics solutions both in the clairvoyant and non-clairvoyant cases (i.e. task length is known in advance or not), and we evaluate them through simulations, using actual traces of a Hadoop cluster.

In the context of the french ICARUS project (FUI), which focuses the development of high-fidelity calculation tools for the design of hot engine parts (aeronautics & automotive), we are looking to develop new load-balancing algorithms to optimize the complex numerical simulations of our industrial and academic partners (Turbomeca, Siemens, Cerfacs, Onera, ...). Indeed, the efficient execution of large-scale coupled simulations on powerful computers is a real challenge, which requires revisiting traditional load-balancing algorithms based on graph partitioning. A thesis on this subject has already been conducted in the Inria HiePACS team in 2016, which has successfully developed a co-partitioning algorithm that balances the load of two coupled codes by taking into account the coupling interactions between these codes.

This work was initially integrated into the StarPart platform. The necessary extension of our algorithms to parallel & distributed (increasingly dynamic) versions has led to a complete redesign of StarPart, which has been the focus of our efforts this year (as in the previous year).

The StarPart framework provides the necessary building blocks to develop new graph algorithms in the context of HPC, such as those we are targeting. The strength of StarPart lies in the fact that it is a light runtime system applied to the issue of "graph computing". It provides a unified data model and a uniform programming interface that allows easy access to a dozen partitioning libraries, including Metis, Scotch, Zoltan, etc. Thus, it is possible, for example, to load a mesh from an industrial test case provided by our partners (or an academic graph collection as DIMACS'10) and to easily compare the results for the different partitioners integrated in StarPart.

Alongside this work, we are beginning to work on the application of learning techniques to the problem of graph partitioning. Recent work on GCNs (Graph Convolutional Networks) is an interesting approach that we will explore.

There is an ongoing reasearch activity with Argonne National Laboratory in the framework of the JLESC International Lab, through a postdoc funded by the DPI, namely Nick Schenkels, who work on data compression techniques in Krylov methods for the solution of large linear systems. The objective is to use agnostic compressor developped at Argonne to compress the basis involved in Krylov methods that have a large memory footprint. The challenge is to design algorithm that reduce the memory consumption, hence the energy, while preserving the numerical convergence of the numerical technique. We refer to 34 for a detailed description of the work.

Prace-6IP, the Partnership for Advanced Computing is the permanent pan-European High Performance Computing service providing world-class systems for world-class science. Systems at the highest performance level (Tier-0) are deployed by Germany, France, Italy, Spain and Switzerland, providing researchers with more than 17 billion core hours of compute time. HPC experts from 25 member states enabled users from academia and industry to ascertain leadership and remain competitive in the Global Race. Currently PRACE is finalizing the transition to PRACE 2, the successor of the initial five year period. The objectives of PRACE-6IP are to build on and seamlessly continue the successes of PRACE and start new innovative and collaborative activities proposed by the consortium. These include: assisting the development of PRACE 2; strengthening the internationally recognised PRACE brand; continuing and extend advanced training which so far provided more than 36 400 person·training days; preparing strategies and best practices towards Exascale computing, work on forward-looking SW solutions; coordinating and enhancing the operation of the multi-tier HPC systems and services; and supporting users to exploit massively parallel systems and novel architectures. A high level Service Catalogue is provided. The proven project structure will be used to achieve each of the objectives in 7 dedicated work packages. The activities are designed to increase Europe’s research and innovation potential especially through: seamless and efficient Tier-0 services and a pan-European HPC ecosystem including national capabilities; promoting take-up by industry and new communities and special offers to SMEs; assistance to PRACE 2 development; proposing strategies for deployment of leadership systems; collaborating with the ETP4HPC, CoEs and other European and international organisations on future architectures, training, application support and policies. This will be monitored through a set of KPIs.

HiePACS is involved in WP6 where it acts as an expert for a Preparatory Access entitled “Fast Multidimensional Methods” led by INRAe. HiePACS is also involved in a project part of WP8 entitled “Linear Algebra, Krylov-subspace methods, and multi-grid solvers for the discovery of New Physics” lead by CaSToRC (The Cyprus Institute) in collaboration with Leibniz Supercomputing Center.

SIAM Conference on Parallel Processing for Scientific Computing, SIAM PP 2020 https://www.siam.org/conferences/cm/conference/pp20 (O. Beaumont)

SBAC-PAD'21 (O. Beaumont, Program Co-Chair)

EuroPar'20 (O. Beaumont), AlgoCloud'20 (O. Beaumont), HPML'20 (O. Beaumont), MSPW'20 (O. Beaumont), IPDPS Workshops'20 (O. Beaumont), ISC'20 Workshops (E. Agullo), PDSEC'20 (O. Coulaud, L. Giraud), SC'20 (A. Guermouche), ICPP'20 (A. Guermouche), SC'21 (O. Beaumont)

The members of the HiePACS project have performed reviewing for the following list of journals:
Computer Methods in Applied Mechanics and Engineering,
SIAM J. Matrix Analysis and Applications,
SIAM J. Scientific Computing,
Elsevier Applied Numerical Mathematics,
Elsevier Journal of Computational Physics, Elsevier Journal of Parallel and Distributed Computing, Elsevier Journal of Systems and Software, Elsevier Journal of Scheduling, IEEE Transactions on Parallel and Distributed Systems.

In the framework of the "Digital Career Week", Olivier Beaumont met high school students (with the NSI option) from the Lycée de la Mer (Gujan Mestras) for about two hours.