Over the last few decades, there have been innumerable science,
engineering and societal breakthroughs enabled by the development of
high performance computing (HPC) applications, algorithms and
architectures.
These powerful tools have provided researchers with the ability to
computationally find efficient solutions for some of the most
challenging scientific questions and problems in medicine and biology,
climatology, nanotechnology, energy and environment.
It is admitted today that *numerical simulation is the third pillar
for the development of scientific discovery at the same level as
theory and experimentation*.
Numerous reports and papers also confirmed that very high performance
simulation will open new opportunities not only for research but also
for a large spectrum of industrial sectors (see for example the
documents available on the web link
http://

An important force which has continued to drive HPC has been to focus on frontier milestones which consist in technical goals that symbolize the next stage of progress in the field. In the 1990s, the HPC community sought to achieve computing at a teraflop rate and currently we are able to compute on the first leading architectures at a petaflop rate. Generalist petaflop supercomputers are available and exaflop computers are foreseen in early 2020.

For application codes to sustain petaflops and more in the next few years, hundreds of thousands of processor cores or more will be needed, regardless of processor technology. Currently, a few HPC simulation codes easily scale to this regime and major algorithms and codes development efforts are critical to achieve the potential of these new systems. Scaling to a petaflop and more will involve improving physical models, mathematical modeling, super scalable algorithms that will require paying particular attention to acquisition, management and visualization of huge amounts of scientific data.

In this context, the purpose of the HiePACS project is to contribute performing
efficiently frontier simulations arising from challenging academic and industrial
research that are likely to be *multiscale* and *coupled* applications.
The solution of these challenging problems require a multidisciplinary approach
involving applied mathematics, computational and computer sciences.
In applied mathematics, it essentially involves advanced numerical
schemes.
In computational science, it involves massively parallel computing and the
design of highly scalable algorithms and codes to be executed on
emerging hierarchical many-core platforms.
Through this approach, HiePACS intends to contribute to all steps that
go from the design of new high-performance more scalable, robust and more
accurate numerical schemes to the optimized implementations of the
associated algorithms and codes on very high performance
supercomputers. This research will be conduced on close collaboration
in particular with European and US initiatives or projects such as
PRACE (Partnership for Advanced Computing in Europe)
EESI-2 (European Exascale Software Initiative 2)
and likely in the framework of H2020 European collaborative projects.

The methodological part of HiePACS covers several topics. First, we address generic studies concerning massively parallel computing, the design of high-end performance algorithms and software to be executed on future extreme scale platforms. Next, several research prospectives in scalable parallel linear algebra techniques are addressed, ranging from dense direct, sparse direct, iterative and hybrid approaches for large linear systems. Then we consider research plans for N-body interaction computations based on efficient parallel fast multipole methods and finally, we adress research tracks related to the algorithmic challenges for complex code couplings in multiscale/multiphysic simulations.

Currently, we have one major multiscale application that is in *material physics*.
We contribute to all steps of the design of the parallel simulation tool.
More precisely, our applied mathematics skill will contribute to the
modeling and our advanced numerical schemes will help in the design
and efficient software implementation for very large parallel multiscale simulations.
Moreover, the robustness and efficiency of our algorithmic research in linear
algebra are validated through industrial and academic collaborations with
different partners involved in various application fields.
Finally, we are also involved in a few collaborative intiatives in various application domains in a
co-design like framework.
These research activities are conducted in a wider multi-disciplinary context with collegues in other
academic or industrial groups where our contribution is related to our expertises.
Not only these collaborations enable our knowledges to have a stronger
impact in various application domains through the promotion of advanced algorithms,
methodologies or tools, but in return they open new avenues for research in the continuity of our core research activities.

Thanks to the two Inria collaborative agreements such as with EADS-Astrium/Conseil Régional Aquitaine and with CEA, we have joint research efforts in a co-design framework enabling efficient and effective technological transfer towards industrial R&D. Furthermore, thanks to two ongoing associated teams, namely MORSE and FastLA we contribute with world leading groups to the design of fast numerical solvers and their parallel implementations.

Our high performance software packages are integrated in several academic or industrial complex codes and are validated on very large scale simulations. For all our software developments, we use first the experimental platform PlaFRIM, the various large parallel platforms available through GENCI in France (CCRT, CINES and IDRIS Computational Centers), and next the high-end parallel platforms that will be available via European and US initiatives or projects such that PRACE.

The methodological component of HiePACS concerns the expertise for the design as well as the efficient and scalable implementation of highly parallel numerical algorithms to perform frontier simulations. In order to address these computational challenges a hierarchical organization of the research is considered. In this bottom-up approach, we first consider in Section generic topics concerning high performance computational science. The activities described in this section are transversal to the overall project and their outcome will support all the other research activities at various levels in order to ensure the parallel scalability of the algorithms. The aim of this activity is not to study general purpose solution but rather to address these problems in close relation with specialists of the field in order to adapt and tune advanced approaches in our algorithmic designs. The next activity, described in Section , is related to the study of parallel linear algebra techniques that currently appear as promising approaches to tackle huge problems on extreme scale platforms. We highlight the linear problems (linear systems or eigenproblems) because they are in many large scale applications the main computational intensive numerical kernels and often the main performance bottleneck. These parallel numerical techniques, which are involved in the IPL C2S@Exa, will be the basis of both academic and industrial collaborations, some are described in Section , but will also be closely related to some functionalities developed in the parallel fast multipole activity described in Section . Finally, as the accuracy of the physical models increases, there is a real need to go for parallel efficient algorithm implementation for multiphysics and multiscale modeling in particular in the context of code coupling. The challenges associated with this activity will be addressed in the framework of the activity described in Section .

Currently, we have one major application (see Section ) that is in material physics. We will contribute to all steps of the design of the parallel simulation tool. More precisely, our applied mathematics skill will contribute to the modelling, our advanced numerical schemes will help in the design and efficient software implementation for very large parallel multi-scale simulations. We also participate to a few co-design actions in close collaboration with some applicative groups. The objective of this activity is to instanciate our expertise in fields where they are critical for designing scalable simulation tools. We refer to Section for a detailed description of these activities.

.

The research directions proposed in HiePACS are strongly influenced by both the applications we are studying and the architectures that we target (i.e., massively parallel many-core architectures, ...). Our main goal is to study the methodology needed to efficiently exploit the new generation of high-performance computers with all the constraints that it induces. To achieve this high-performance with complex applications we have to study both algorithmic problems and the impact of the architectures on the algorithm design.

From the application point of view, the project will be interested in multiresolution, multiscale and hierarchical approaches which lead to multi-level parallelism schemes. This hierarchical parallelism approach is necessary to achieve good performance and high-scalability on modern massively parallel platforms. In this context, more specific algorithmic problems are very important to obtain high performance. Indeed, the kind of applications we are interested in are often based on data redistribution for example (e.g. code coupling applications). This well-known issue becomes very challenging with the increase of both the number of computational nodes and the amount of data. Thus, we have both to study new algorithms and to adapt the existing ones. In addition, some issues like task scheduling have to be restudied in this new context. It is important to note that the work done in this area will be applied for example in the context of code coupling (see Section ).

Considering the complexity of modern architectures like massively parallel architectures or new generation heterogeneous multicore architectures, task scheduling becomes a challenging problem which is central to obtain a high efficiency. Of course, this work requires the use/design of scheduling algorithms and models specifically to tackle our target problems. This has to be done in collaboration with our colleagues from the scheduling community like for example O. Beaumont (Inria REALOPT Project-Team). It is important to note that this topic is strongly linked to the underlying programming model. Indeed, considering multicore architectures, it has appeared, in the last five years, that the best programming model is an approach mixing multi-threading within computational nodes and message passing between them. In the last five years, a lot of work has been developed in the high-performance computing community to understand what is critic to efficiently exploit massively multicore platforms that will appear in the near future. It appeared that the key for the performance is firstly the grain of computations. Indeed, in such platforms the grain of the parallelism must be small so that we can feed all the processors with a sufficient amount of work. It is thus very crucial for us to design new high performance tools for scientific computing in this new context. This will be developed in the context of our solvers, for example, to adapt to this new parallel scheme. Secondly, the larger the number of cores inside a node, the more complex the memory hierarchy. This remark impacts the behaviour of the algorithms within the node. Indeed, on this kind of platforms, NUMA effects will be more and more problematic. Thus, it is very important to study and design data-aware algorithms which take into account the affinity between computational threads and the data they access. This is particularly important in the context of our high-performance tools. Note that this work has to be based on an intelligent cooperative underlying run-time (like the tools developed by the Inria RUNTIME Project-Team) which allows a fine management of data distribution within a node.

Another very important issue concerns high-performance computing
using “heterogeneous” resources within a computational
node. Indeed, with the emergence of the `GPU` and the use of
more specific co-processors, it is
important for our algorithms to efficiently exploit these new kind
of architectures. To adapt our algorithms and tools to these
accelerators, we need to identify what can be done on the `GPU`
for example and what cannot. Note that recent results in the field
have shown the interest of using both regular cores and `GPU` to
perform computations. Note also that in opposition to the case of
the parallelism granularity needed by regular multicore
architectures, `GPU` requires coarser grain parallelism. Thus,
making both `GPU` and regular cores work all together will lead
to two types of tasks in terms of granularity.
This represents a challenging problem especially in terms of scheduling.
From this perspective, we investigate
new approaches for composing parallel applications within a runtime
system for heterogeneous platforms.

The SOLHAR project aims at studying and designing algorithms and
parallel programming models for implementing direct methods for the
solution of sparse linear systems on emerging computers equipped with
accelerators.
Several attempts have been made to accomplish the porting of these
methods on such architectures; the proposed approaches are mostly
based on a simple offloading of some computational tasks (the coarsest
grained ones) to the accelerators and rely on fine hand-tuning of the
code and accurate performance modeling to achieve efficiency.
SOLHAR proposes an innovative approach which relies on the efficiency
and portability of runtime systems, such as the `StarPU` tool developed
in the RUNTIME team. Although the SOLHAR project will focus
on heterogeneous computers equipped with GPUs due to their wide
availability and affordable cost, the research accomplished on
algorithms, methods and programming models will be readily applicable
to other accelerator devices.
Our final goal would be to have high performance solvers
and tools which can efficiently run on all these types of
complex architectures by exploiting all the resources of the
platform (even if they are heterogeneous).

In order to achieve an advanced knowledge concerning the design of
efficient computational kernels to be used on our high performance
algorithms and codes, we will develop research activities first on
regular frameworks before extending them to more irregular and complex
situations.
In particular, we will work first on optimized dense linear algebra
kernels and we will use them in our more complicated direct and hybrid
solvers for sparse linear algebra and in our fast multipole algorithms for
interaction computations.
In this context, we will participate to the development of those kernels
in collaboration with groups specialized in dense linear algebra.
In particular, we intend develop a strong collaboration with the group of Jack Dongarra
at the University of Tennessee and collaborating research groups. The objectives will be to
develop dense linear algebra algorithms and libraries for multicore
architectures in the context the `PLASMA` project
and for `GPU` and hybrid multicore/`GPU` architectures in the context of the
`MAGMA` project.
The framework that hosts all these research activities is the associate team
MORSE.

A more prospective objective is to study the resiliency in the
context of large-scale scientific applications for massively
parallel architectures. Indeed, with the increase of the number of
computational cores per node, the probability of a hardware crash on
a core or of a memory corruption is dramatically increased. This represents a crucial problem
that needs to be addressed. However, we will only study it at the
algorithmic/application level even if it needed lower-level
mechanisms (at OS level or even hardware level). Of course, this
work can be performed at lower levels (at operating system) level for
example but we do believe that handling faults at the application
level provides more knowledge about what has to be done (at
application level we know what is critical and what is not). The
approach that we will follow will be based on the use of a
combination of fault-tolerant implementations of the run-time
environments we use (like for example `FT-MPI`) and
an adaptation of our algorithms to try to manage this kind of
faults. This topic represents a very long range objective which
needs to be addressed to guaranty the robustness of our solvers and
applications.
In that respect, we are involved in a ANR-Blanc project entitles RESCUE jointly with
two other Inria EPI, namely ROMA and GRAND-LARGE and the G8 ESC international initiative as well as in the
EXA2CT FP7 project.
The main objective of the RESCUE project is to develop new algorithmic techniques and
software tools to solve the exascale resilience problem. Solving this problem implies
a departure from current approaches, and calls for yet-to-be- discovered algorithms, protocols and software tools.

Finally, it is important to note that the main goal of HiePACS is to design tools and algorithms that will be used within complex simulation frameworks on next-generation parallel machines. Thus, we intend with our partners to use the proposed approach in complex scientific codes and to validate them within very large scale simulations as well as designing parallel solution in co-design collaborations.

Starting with the developments of basic linear algebra kernels tuned for
various classes of computers, a significant knowledge on
the basic concepts for implementations on high-performance scientific computers has been accumulated.
Further knowledge has been acquired through the design of more sophisticated linear algebra algorithms
fully exploiting those basic intensive computational kernels.
In that context, we still look at the development of new computing platforms and their associated programming
tools.
This enables us to identify the possible bottlenecks of new computer architectures
(memory path, various level of caches, inter processor or node network) and to propose
ways to overcome them in algorithmic design.
With the goal of designing efficient scalable linear algebra solvers for large scale applications, various
tracks will be followed in order to investigate different complementary approaches.
Sparse direct solvers have been for years the methods of choice for solving linear systems of equations,
it is nowadays admitted that classical approaches are not scalable neither from a computational complexity
nor from a memory view point for large problems such as those arising from the discretization of large 3D PDE problems.
We will continue to work on sparse direct solvers on the one hand to make sure they fully benefit from most advanced computing platforms
and on the other hand to attempt to reduce their memory and computational costs for some classes of problems where
data sparse ideas can be considered.
Furthermore, sparse direct solvers are a key building boxes for the
design of some of our parallel algorithms such as the hybrid solvers described in the sequel of this section.
Our activities in that context will mainly address preconditioned Krylov subspace methods; both components,
preconditioner and Krylov solvers, will be investigated.
In this framework, and possibly in relation with the research activity on fast multipole, we intend to study how emerging

Solving large sparse systems

Sparse direct solvers are mandatory when the linear system is very ill-conditioned; such a situation is often encountered in structural mechanics codes, for example. Therefore, to obtain an industrial software tool that must be robust and versatile, high-performance sparse direct solvers are mandatory, and parallelism is then necessary for reasons of memory capability and acceptable solution time. Moreover, in order to solve efficiently 3D problems with more than 50 million unknowns, which is now a reachable challenge with new multicore supercomputers, we must achieve good scalability in time and control memory overhead. Solving a sparse linear system by a direct method is generally a highly irregular problem that induces some challenging algorithmic problems and requires a sophisticated implementation scheme in order to fully exploit the capabilities of modern supercomputers.

New supercomputers incorporate many microprocessors which are
composed of one or many computational cores. These new architectures
induce strongly hierarchical topologies. These are called NUMA
architectures. In the context of distributed NUMA architectures,
in collaboration with the Inria RUNTIME team, we study
optimization strategies to improve the scheduling of
communications, threads and I/O.
We have developed dynamic scheduling designed for NUMA architectures in the
`PaStiX` solver. The data structures of the solver, as well as the
patterns of communication have been modified to meet the needs of
these architectures and dynamic scheduling. We are also interested in
the dynamic adaptation of the computation grain to use efficiently
multi-core architectures and shared memory. Experiments on several
numerical test cases have been performed to prove the efficiency of
the approach on different architectures.

In collaboration with the ICL team from the University of Tennessee,
and the RUNTIME team from Inria, we are evaluating the way to replace
the embedded scheduling driver of the `PaStiX` solver by one of the
generic frameworks, `PaRSEC` or `StarPU`, to execute the task
graph corresponding to a sparse factorization.
The aim is to
design algorithms and parallel programming models for implementing
direct methods for the solution of sparse linear systems on emerging
computer equipped with GPU accelerators. More generally, this work
will be performed in the context of the associate team MORSE and
the ANR SOLHAR project which
aims at designing high performance sparse direct solvers for modern
heterogeneous systems. This ANR project involves several groups working
either on the sparse linear solver aspects (HiePACS and ROMA from
Inria and APO from IRIT), on runtime systems (RUNTIME from Inria) or
scheduling algorithms (REALOPT and ROMA from Inria). The results of
these efforts will be validated in the applications provided by the
industrial project members, namely CEA-CESTA and Airbus Group Innovations.

On the numerical side, we are studying how the data sparsness that might exist in
some dense blocks appearing during the factorization can be exploited using different
compression techniques based on

One route to the parallel scalable solution of large sparse linear systems in parallel scientific computing is the use of hybrid methods that hierarchically combine direct and iterative methods. These techniques inherit the advantages of each approach, namely the limited amount of memory and natural parallelization for the iterative component and the numerical robustness of the direct part. The general underlying ideas are not new since they have been intensively used to design domain decomposition techniques; those approaches cover a fairly large range of computing techniques for the numerical solution of partial differential equations (PDEs) in time and space. Generally speaking, it refers to the splitting of the computational domain into sub-domains with or without overlap. The splitting strategy is generally governed by various constraints/objectives but the main one is to express parallelism. The numerical properties of the PDEs to be solved are usually intensively exploited at the continuous or discrete levels to design the numerical algorithms so that the resulting specialized technique will only work for the class of linear systems associated with the targeted PDE.

In that context, we intend to continue our effort on the design of algebraic non-overlapping domain decomposition techniques
that rely on the solution of a Schur complement system defined on the interface introduced by the partitioning of the
adjacency graph of the sparse matrix associated with the linear system.
Although it is better conditioned than the original system the Schur complement needs to be precondition to be
amenable to a solution using a Krylov subspace method.
Different hierarchical preconditioners will be considered, possibly multilevel, to improve the numerical behaviour
of the current approaches implemented in our software libraries `HIPS` and `MaPHyS`. This activity will be developed in the context of
the ANR DEDALES project.
In addition to this numerical studies, advanced parallel implementation will be developed that will involve close
collaborations between the hybrid and sparse direct activities.

Preconditioning is the main focus of the two activities described above. They aim at speeding up the convergence of a Krylov subspace method that is the complementary component involved in the solvers of interest for us. In that framework, we believe that various aspects deserve to be investigated; we will consider the following ones:

preconditioned block Krylov solvers for multiple right-hand sides.
In many large scientific and industrial applications, one has to solve
a sequence of linear systems with several right-hand sides given simultaneously or in sequence
(radar cross section calculation in electromagnetism, various source locations in seismic, parametric studies
in general, ...).
For “simultaneous" right-hand sides, the solvers of choice have been for years based on matrix factorizations
as the factorization is performed once and simple and cheap block forward/backward substitutions are then performed.
In order to effectively propose alternative to such solvers, we need to have efficient preconditioned Krylov subspace
solvers.
In that framework, block Krylov approaches, where the Krylov spaces associated with each right-hand side
are shared to enlarge the search space will be considered.
They are not only attractive because of this numerical feature (larger search space), but also from an
implementation point of view.
Their block-structures exhibit nice features with respect to data locality and re-usability that comply
with the memory constraint of multicore architectures.
Following the initial work by J. Yan Fei during his post-doc in HiePACS, we will continue the numerical study of the block GMRES variant that combines
inexact break-down detection and deflation at restart.
In addition a special attention will be paid to situations where a massive number of right-hand sides are given where
variants exploiting the possible sparsness (i.e., compression using

For right-hand sides available one after each other, various strategies that exploit the information available in the sequence of Krylov spaces (e.g. spectral information) will be considered that include for instance technique to perform incremental update of the preconditioner or to build augmented Krylov subspaces.

Extension or modification of Krylov subspace algorithms for multicore architectures: finally to match as much as possible to the computer architecture evolution and get as much as possible performance out of the computer, a particular attention will be paid to adapt, extend or develop numerical schemes that comply with the efficiency constraints associated with the available computers. Nowadays, multicore architectures seem to become widely used, where memory latency and bandwidth are the main bottlenecks; investigations on communication avoiding techniques will be undertaken in the framework of preconditioned Krylov subspace solvers as a general guideline for all the items mentioned above. This research activity will benefit from the FP7 EXA2CT project led by HiePACS on behalf of the IPL C2S@Exa that involves two other Inria projects namely ALPINES and SAGE.

Many eigensolvers also rely on Krylov subspace techniques. Naturally some links exist between the Krylov subspace linear solvers and the Krylov subspace eigensolvers. We plan to study the computation of eigenvalue problems with respect to the following two different axes:

Exploiting the link between Krylov subspace methods for linear system solution and eigensolvers, we intend to develop advanced iterative linear methods based on Krylov subspace methods that use some spectral information to build part of a subspace to be recycled, either though space augmentation or through preconditioner update. This spectral information may correspond to a certain part of the spectrum of the original large matrix or to some approximations of the eigenvalues obtained by solving a reduced eigenproblem. This technique will also be investigated in the framework of block Krylov subspace methods.

In the context of the calculation of the ground state of an atomistic system, eigenvalue computation is a critical step; more accurate and more efficient parallel and scalable eigensolvers are required.

.

In most scientific computing applications considered nowadays as
computational challenges (like biological and material systems,
astrophysics or electromagnetism), the introduction of hierarchical
methods based on an octree structure has dramatically reduced the
amount of computation needed to simulate those systems for a given
accuracy. For instance, in the N-body problem arising from
these application fields, we must compute all pairwise
interactions among N objects (particles, lines, ...) at every
timestep. Among these methods, the Fast Multipole
Method (FMM) developed for gravitational potentials in astrophysics
and for electrostatic (coulombic) potentials in molecular simulations
solves this N-body problem for any given precision with

The potential field is decomposed in a near field part, directly computed, and a far field part approximated thanks to multipole and local expansions. We introduced a matrix formulation of the FMM that exploits the cache hierarchy on a processor through the Basic Linear Algebra Subprograms (BLAS). Moreover, we developed a parallel adaptive version of the FMM algorithm for heterogeneous particle distributions, which is very efficient on parallel clusters of SMP nodes. Finally on such computers, we developed the first hybrid MPI-thread algorithm, which enables to reach better parallel efficiency and better memory scalability. We plan to work on the following points in HiePACS.

Nowadays, the high performance computing community is examining
alternative architectures that address the limitations of modern
cache-based designs. `GPU` (Graphics Processing Units) and the Cell
processor have thus already been used in astrophysics and in molecular
dynamics. The Fast Mutipole Method has also been implemented on `GPU`.
We intend to examine the
potential of using these forthcoming processors as a building block
for high-end parallel computing in N-body calculations. More
precisely, we want to take advantage of our specific underlying BLAS routines
to obtain an efficient and easily portable FMM for these new architectures.
Algorithmic issues such as dynamic load balancing among heterogeneous
cores will also have to be solved in order to gather all the available
computation power.
This research action will be conduced on close connection with the
activity described in
Section .

In many applications arising from material physics or astrophysics, the distribution of the data is highly non uniform and the data can grow between two time steps. As mentioned previously, we have proposed a hybrid MPI-thread algorithm to exploit the data locality within each node. We plan to further improve the load balancing for highly non uniform particle distributions with small computation grain thanks to dynamic load balancing at the thread level and thanks to a load balancing correction over several simulation time steps at the process level.

The engine that we develop will be extended to new potentials arising
from material physics such as those used in dislocation
simulations. The interaction between dislocations is long ranged
(

The boundary element method (BEM) is a well known
solution of boundary value problems appearing in various fields of
physics. With this approach, we only have to solve an integral
equation on the boundary. This implies an interaction that decreases in space, but results
in the solution of a dense linear system with

Many important physical phenomena in material physics and climatology are inherently complex applications. They often use multi-physics or multi-scale approaches, that couple different models and codes. The key idea is to reuse available legacy codes through a coupling framework instead of merging them into a standalone application. There is typically one model per different scale or physics; and each model is implemented by a parallel code. For instance, to model a crack propagation, one uses a molecular dynamic code to represent the atomistic scale and an elasticity code using a finite element method to represent the continuum scale. Indeed, fully microscopic simulations of most domains of interest are not computationally feasible. Combining such different scales or physics is still a challenge to reach high performance and scalability. If the model aspects are often well studied, there are several open algorithmic problems, that we plan to investigate in the HiePACS project-team.

As mentioned previously, many important physical phenomena, such as material deformation and failure (see Section ), are inherently multiscale processes that cannot always be modeled via continuum model. Fully microscopic simulations of most domains of interest are not computationally feasible. Therefore, researchers must look at multiscale methods that couple micro models and macro models. Combining different scales such as quantum-atomistic or atomistic, mesoscale and continuum, are still a challenge to obtain efficient and accurate schemes that efficiently and effectively exchange information between the different scales. We are currently involved in two national research projects, that focus on multiscale schemes. More precisely, the models that we start to study are the quantum to atomic coupling (QM/MM coupling) in the ANR NOSSI and the atomic to dislocation coupling in the ANR OPTIDIS.

In this context of code coupling, one crucial issue is undoubtedly the
load balancing of the whole coupled simulation that remains an open
question. The goal here is to find the best data distribution for the
whole coupled simulation and not only for each standalone code, as it
is most usually done. Indeed, the naive balancing of each code on its
own can lead to an important imbalance and to a communication
bottleneck during the coupling phase, that can drastically decrease
the overall performance. Therefore, one argues that it is required to
model the coupling itself in order to ensure a good scalability,
especially when running on massively parallel architectures (tens of
thousands of processors/cores). In other words, one must develop new
algorithms and software implementation to perform a *coupling-aware* partitioning of the whole application.

Another related problem is the problem of resource allocation. This is particularly important for the global coupling efficiency and scalabilty, because each code involved in the coupling can be more or less computationally intensive, and there is a good trade-off to find between resources assigned to each code to avoid that one of them waits for the other(s). And what happens if the load of one code dynamically changes relatively to the other? In such a case, it could be convenient to dynamically adapt the number of resources used at runtime.

For instance, the conjugate heat transfer simulation in complex geometries (as developed by the CFD team of CERFACS) requires to couple a fluid/convection solver (AVBP) with a solid/conduction solver (AVTP). The AVBP code is much more CPU consuming than the AVTP code. As a consequence, there is an important computational imbalance between the two solvers. The use of new algorithms to correctly load balance coupled simulations with enhanced graph partitioning techniques appears as a promising way to reach better performances of coupled application on massively parallel computers.

Graph handling and partitioning play a central role in the activity described here but also in other numerical techniques detailed in Section .

The Nested Dissection is now a well-known heuristic for sparse matrix
ordering to both reduce the fill-in during numerical factorization and
to maximize the number of independent computation tasks. By using the
block data structure induced by the partition of separators of the
original graph, very efficient parallel block solvers have been
designed and implemented according to supernodal or multifrontal
approaches. Considering hybrid methods mixing both direct and
iterative solvers such as `HIPS` or `MaPHyS`, obtaining a domain
decomposition leading to a good balancing of both the size of domain
interiors and the size of interfaces is a key point for load balancing
and efficiency in a parallel context.
We intend to revisit some well-known graph partitioning techniques in
the light of the hybrid solvers and design new algorithms to be tested
in the `Scotch` package.

.

Due to the increase of available computer power, new applications in nano science and physics appear such as study of properties of new materials (photovoltaic materials, bio- and environmental sensors, ...), failure in materials, nano-indentation. Chemists, physicists now commonly perform simulations in these fields. These computations simulate systems up to billion of atoms in materials, for large time scales up to several nanoseconds. The larger the simulation, the smaller the computational cost of the potential driving the phenomena, resulting in low precision results. So, if we need to increase the precision, there are two ways to decrease the computational cost. In the first approach, we improve algorithms and their parallelization and in the second way, we will consider a multiscale approach.

A domain of interest is the material aging for the nuclear industry. The materials are exposed to complex conditions due to the combination of thermo-mechanical loading, the effects of irradiation and the harsh operating environment. This operating regime makes experimentation extremely difficult and we must rely on multi-physics and multi-scale modeling for our understanding of how these materials behave in service. This fundamental understanding helps not only to ensure the longevity of existing nuclear reactors, but also to guide the development of new materials for 4th generation reactor programs and dedicated fusion reactors. For the study of crystalline materials, an important tool is dislocation dynamics (DD) modeling. This multiscale simulation method predicts the plastic response of a material from the underlying physics of dislocation motion. DD serves as a crucial link between the scale of molecular dynamics and macroscopic methods based on finite elements; it can be used to accurately describe the interactions of a small handful of dislocations, or equally well to investigate the global behavior of a massive collection of interacting defects.

To explore i.e. to simulate these new areas, we need to develop and/or to improve significantly models, schemes and solvers used in the classical codes. In the project, we want to accelerate algorithms arising in those fields. We will focus on the following topics (in particular in the currently under definition OPTIDIS project in collaboration with CEA Saclay, CEA Ile-de-france and SIMaP Laboratory in Grenoble) in connection with research described at Sections and .

The interaction between dislocations is long ranged (

In such simulations, the number of dislocations grows while the phenomenon occurs and these dislocations are not uniformly distributed in the domain. This means that strategies to dynamically construct a good load balancing are crucial to acheive high performance.

From a physical and a simulation point of view, it will be interesting to couple a molecular dynamics model (atomistic model) with a dislocation one (mesoscale model). In such three-dimensional coupling, the main difficulties are firstly to find and characterize a dislocation in the atomistic region, secondly to understand how we can transmit with consistency the information between the two micro and meso scales.

.

The research activities concerning the ITER challenge are involved in the Inria Project Lab (IPL) C2S@Exa.

The numerical simulations tools designed for ITER challenges aim at making a significant progress in
understanding active control
methods of plasma edge MHD instabilities Edge Localized Modes (ELMs)
which represent particular danger with respect to heat and particle
loads for Plasma Facing Components (PFC) in ITER. Project is focused
in particular on the numerical modeling study of such ELM control
methods as Resonant Magnetic Perturbations (RMPs) and pellet ELM
pacing both foreseen in ITER. The goals of the project are to improve
understanding the related physics and propose possible new
strategies to improve effectiveness of ELM control techniques. The
tool for the nonlinear MHD modeling (code `JOREK`) will be largely
developed within the present project to include corresponding new
physical models in conjunction with new developments in mathematics
and computer science strategy in order to progress in urgently needed
solutions for ITER.

The fully implicit time evolution scheme in the
`JOREK` code leads to large sparse linear systems that have to be solved at
every time step. The MHD model leads to very badly conditioned
matrices. In principle the `PaStiX` library can solve these large
sparse problems using a direct method. However, for large 3D problems the CPU
time for the direct solver becomes too large. Iterative solution
methods require a preconditioner adapted to the problem. Many of the
commonly used preconditioners have been tested but no satisfactory
solution has been found.
The research activities presented in Section
will contribute to design
new solution techniques best suited for this context.

In the context of the ITER challenge, the `GYSELA` project aims at
simulating the turbulence of plasma particules inside a tokamak. Thanks
to a better comprehension of this phenomenon, it would be possible to
design a new kind of source of energy based of nuclear fusion.
Currently, `GYSELA` is parallalized in a MPI/OpenMP way and can exploit
the power of the current greatest supercomputers (e.g., Juqueen). To
simulate faithfully the plasma physic, `GYSELA` handles a huge amount
of data. In fact, the memory consumption is a bottleneck on large
simulations (449 K cores). In the meantime all the reports on the
future Exascale machines expect a decrease of the memory per core. In
this context, mastering the memory consumption of the code becomes critical
to consolidate its scalability and to enable the implementation of
new features to fully benefit from the extreme scale architectures.

In addition to activities for designing advanced generic tools for
managing the memory optimisation, further algorithmic research will be
conduced to better predict and limit the memory peak in order to
reduce the memory footprint of `GYSELA`.

As part of its activity, EDF R&D is developing a new nuclear core
simulation code named `COCAGNE` that relies on a Simplified PN (SPN) method to compute
the neutron flux inside the core for eigenvalue calculations. In order
to assess the accuracy of SPN results, a 3D Cartesian model of PWR
nuclear cores has been designed and a reference neutron flux inside
this core has been computed with a Monte Carlo transport code
from Oak Ridge National Lab. This kind of 3D whole core probabilistic
evaluation of the flux is computationally very demanding. An efficient
deterministic approach is therefore required to reduce the computation
effort dedicated to reference simulations.

In this collaboration, we work on the parallelization (for shared and
distributed memories) of the `DOMINO` code, a parallel 3D Cartesian SN
solver specialized for PWR core reactivity computations which is fully
integrated in the `COCAGNE` system.

ASTRIUM has developped for 20 years the `FLUSEPA` code which focuses on
unsteady phenomenon with changing topology like stage separation or
rocket launch. The code is based on a finite volume formulation with
temporal adaptive time integration and supports bodies in relative
motion.
The temporal adaptive integration classifies cells in several temporal
levels, zero being the level with the slowest cells and each level being
twice as fast as the previous one. This repartition can evolve during
the computation, leading to load-balancing issues in a parallel
computation context.
Bodies in relative motion are managed through a CHIMERA-like technique
which allows building a composite mesh by merging multiple meshes. The
meshes with the highest priorities recover the least ones, and at the
boundaries of the covered mesh, an intersection is computed. Unlike
classical CHIMERA technique, no interpolation is performed, allowing
a conservative flow integration.
The main objective of this research is to design a scalable version of
`FLUSEPA` in order to run efficiently on modern parallel architectures
very large 3D simulations.

We describe in this section the software that we are developing. The first list will be the main milestones of our project. The other software developments will be conducted in collaboration with academic partners or in collaboration with some industrial partners in the context of their private R&D or production activities. For all these software developments, we will use first the various (very) large parallel platforms available through GENCI in France (CCRT, CINES and IDRIS Computational Centers), and next the high-end parallel platforms that will be available via European and US initiatives or projects such that PRACE.

`MaPHyS` (Massivelly Parallel Hybrid Solver) is a
software package
that implements a parallel linear solver coupling direct and iterative approaches. The underlying idea
is to apply to general unstructured linear systems domain
decomposition ideas developed for the solution of linear systems
arising from PDEs. The interface problem, associated with the so
called Schur complement system, is solved using a block preconditioner
with overlap between the blocks that is referred to as Algebraic
Additive Schwarz.

The `MaPHyS` package is very much a first outcome of the research activity
described in Section .
Finally, `MaPHyS` is a preconditioner that can be used to speed-up the convergence
of any Krylov subspace method.
We forsee to either embed in `MaPHyS` some Krylov solvers or to release them as
standalone packages, in particular for the block variants that will be some
outcome of the studies discussed in
Section .

`MaPHyS` can be found at
http://

Complete and incomplete supernodal sparse parallel factorizations.

`PaStiX` (Parallel Sparse matriX package) is a scientific library that provides
a high performance parallel solver for very large sparse linear
systems based on block direct and block ILU(k) iterative methods.
Numerical algorithms are implemented in single or double precision
(real or complex): LLt (Cholesky), LDLt (Crout) and LU with static
pivoting (for non symmetric matrices having a symmetric pattern).

The `PaStiX` library uses the graph partitioning and sparse matrix
block ordering package `Scotch`.
`PaStiX` is based on an efficient static scheduling and memory
manager, in order to solve 3D problems with more than 50 million of
unknowns. The mapping and scheduling algorithm handles a combination
of 1D and 2D block distributions. This algorithm computes an efficient
static scheduling of the block computations for our supernodal
parallel solver which uses a local aggregation of contribution
blocks. This can be done by taking into account very precisely the
computational costs of the BLAS 3 primitives, the communication costs
and the cost of local aggregations. We also improved this static
computation and communication scheduling algorithm to anticipate the
sending of partially aggregated blocks, in order to free memory
dynamically. By doing this, we are able to reduce the aggregated memory overhead, while keeping good performance.

Another important point is that our study is suitable for any heterogeneous parallel/distributed architecture when its performance is predictable, such as clusters of multicore nodes. In particular, we now offer a high performance version with a low memory overhead for multicore node architectures, which fully exploits the advantage of shared memory by using an hybrid MPI-thread implementation.

Direct methods are numerically robust methods, but the very large three dimensional problems may lead to systems that would require a huge amount of memory despite any memory optimization. A studied approach consists in defining an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. Such incomplete factorization can take advantage of the latest breakthroughs in sparse direct methods and particularly should be very competitive in CPU time (effective power used from processors and good scalability) while avoiding the memory limitation encountered by direct methods.

`PaStiX` is publicly available at
http://

Multilevel method, domain decomposition, Schur complement, parallel iterative solver.

`HIPS` (Hierarchical Iterative Parallel Solver) is a scientific
library that provides an efficient parallel iterative solver for very
large sparse linear systems.

The key point of the methods implemented in `HIPS` is to define an
ordering and a partition of the unknowns that relies on a form of
nested dissection ordering in which cross points in the separators
play a special role (Hierarchical Interface Decomposition ordering).
The subgraphs obtained by nested dissection correspond to the
unknowns that are eliminated using a direct method and the Schur
complement system on the remaining of the unknowns (that correspond to
the interface between the sub-graphs viewed as sub-domains) is solved
using an iterative method (GMRES or Conjugate Gradient at the time
being).
This special ordering and partitioning allows for the use of dense
block algorithms both in the direct and iterative part of the solver
and provides a high degree of parallelism to these algorithms.
The code provides a hybrid method which blends direct and iterative solvers.
`HIPS` exploits the partitioning and multistage ILU techniques
to enable a highly parallel scheme
where several subdomains can be assigned to the same process. It also
provides a scalar preconditioner based on the multistage ILUT
factorization.

`HIPS` can be used as a standalone program that reads a sparse linear
system from a file ; it also provides an interface to be called from
any C, C++ or Fortran code.
It handles symmetric, unsymmetric, real or complex matrices. Thus, `HIPS` is a
software library that provides several methods to build an efficient
preconditioner in almost all situations.

`HIPS` is publicly available at
http://

`MetaPart` is a library that adresses the challenge of (dynamic) load
balancing for emerging complex parallel simulations, such as
multi-physics or multi-scale coupling applications. First, it offers a
uniform API over state-of-the-art (hyper-) graph partitioning software
packages such as `Scotch`, `PaToH`, `METIS`, `Zoltan`, `Mondriaan`, etc.
etc. Based upon this API, it provides a framework that facilitates the
development and the evaluation of high-level partitioning methods,
such as MxN repartitioning or coupling-aware partitionining
(co-partitioning).

The framework is publicy available at Inria Gforge:
http://

MPICPL (MPI CouPLing) is a software library dedicated to the coupling
of parallel legacy codes, that are based on the well-known MPI
standard. It proposes a lightweight and comprehensive programing
interface that simplifies the coupling of several MPI codes (2, 3 or
more). MPICPL facilitates the deployment of these codes thanks to the
*mpicplrun* tool and it interconnects them automatically through
standard MPI inter-communicators. Moreover, it generates the universe
communicator, that merges the world communicators of all
coupled-codes. The coupling infrastructure is described by a simple
XML file, that is just loaded by the *mpicplrun* tool.

MPICPL was developed by HiePACS for the purpose
of the ANR NOSSI. It uses advanced features of MPI2 standard. The
framework is publicy available at Inria Gforge:
http://

`ScalFMM` (Parallel Fast Multipole Library for Large Scale Simulations)
is a software library to simulate N-body interactions using the Fast
Multipole Method.

`ScalFMM` intends to offer all the functionalities needed to perform large parallel simulations while enabling an easy customization of the
simulation components: kernels, particles and cells.
It works in parallel in a shared/distributed memory model using OpenMP and MPI.
The software architecture has been designed with two major objectives:
being easy to maintain and easy to understand.
There are two main parts: 1) the management of the octree and the parallelization of the method ; 2) the kernels. This new architecture allows us to easily add new FMM algorithm or kernels and new paradigm of parallelization.
The code is extremely documented and the naming convention fully respected.
Driven by its user-oriented philosophy, `ScalFMM` is using CMAKE as a compiler/installer tool. Even if `ScalFMM` is written in C++
it will support a C and fortran API soon.

The library offers two methods to compute interactions between bodies when the potential decays like

The `ScalFMM` package is available at http://

Visualization, Execution trace

`ViTE` is a trace explorer. It is a tool made to visualize execution
traces of large parallel programs. It supports Pajé, a trace
format created by Inria Grenoble, and OTF and OTF2 formats, developed
by the University of Dresden and allows the programmer a simpler way
to analyse, debug and/or profile large parallel applications. It is an
open source software licenced under CeCILL-A.

The `ViTE` software is available at
http://

In the same context we also contribute to the EZtrace and GTG
libraries in collaboration with F. Trahay from Telecom SudParis.
EZTrace (http://`ViTE`.

For the materials physics applications, a lot of development will be done in the context of ANR projects (NOSSI and OPTIDIS, see Section ) in collaboration with LaBRI, CPMOH, IPREM, EPFL and with CEA Saclay and Bruyère-le-Châtel.

**FAST**

`FAST` is a linear response time dependent density functional program for computing the electronic absorption spectrum of molecular systems. It uses an
O(`FAST` works only with LDA, which despite its limitations, has provided useful results on the systems to which the present authors have applied it.
The `FAST` library is available at
http://

**OptiDis**

OptiDis is a new code for large scale dislocation dynamics simulations. Its aim is to simulate real life dislocation densities (up until

The code is based on Numodis code developed at CEA Saclay and the `ScalFMM` library developed in our Inria project. The code is written in C++ language and using the last features of C++11. One of the main aspects is the hybrid parallelism MPI/OpenMP that gives the software the ability to scale on large cluster while the computation load rises. In order to achieve that, we use different levels of parallelism. First of all, the simulation box is spread over MPI processes, we then use a thinner level for threads, dividing the domain using an Octree representation. All theses parts are driven by the `ScalFMM` library. On the last level our data are stored in an adaptive structure absorbing dynamic of this kind of simulation and handling well task parallelism.

The two following packages are mainly designed and developed in the context of a US initiative led by ICL and to which we closely collaborate through the associate team MORSE.

**PLASMA**

The `PLASMA` (Parallel Linear Algebra for Scalable Multi-core Architectures)
project aims at addressing the critical and highly disruptive
situation that is facing the Linear Algebra and High Performance
Computing community due to the introduction of multi-core
architectures.

The `PLASMA` ultimate goal is to create software frameworks that enable
programmers to simplify the process of developing applications that
can achieve both high performance and portability across a range of
new architectures.

The development of programming models that enforce asynchronous, out of order scheduling of operations is the concept used as the basis for the definition of a scalable yet highly efficient software framework for Computational Linear Algebra applications.

The `PLASMA` library is available at
http://

**PaRSEC/DPLASMA**

`PaRSEC` Parallel Runtime Scheduling and Execution Controller, is a
generic framework for architecture aware scheduling and management of
micro-tasks on distributed many-core heterogeneous
architectures. Applications we consider can be expressed as a Direct
Acyclic Graph of tasks with labeled edges designating data
dependencies. DAGs are represented in a compact problem-size
independent format that can be queried on-demand to discover data
dependencies in a totally distributed fashion. `PaRSEC` assigns
computation threads to the cores, overlaps communications and
computations and uses a dynamic, fully-distributed scheduler based on
architectural features such as NUMA nodes and algorithmic features
such as data reuse.

The framework includes libraries, a runtime system, and development tools to help application developers tackle the difficult task of porting their applications to highly heterogeneous and diverse environments.

`DPLASMA` (Distributed Parallel Linear Algebra Software for Multicore
Architectures) is the leading implementation of a dense linear algebra
package for distributed heterogeneous systems. It is designed to
deliver sustained performance for distributed systems where each node
featuring multiple sockets of multicore processors, and if available,
accelerators like GPUs or Intel Xeon Phi. `DPLASMA` achieves this
objective through the state of the art `PaRSEC` runtime, porting the
`PLASMA` algorithms to the distributed memory realm.

The `PaRSEC` runtime and the `DPLASMA` library are
available at
http://

PlaFRIM is an experimental platform for research in modeling, simulations and high performance computing. This platform has been set up from 2009 under the leadership of Inria Bordeaux Sud-Ouest in collaboration with computer science and mathematics laboratories, respectively Labri and IMB with a strong support in the region Aquitaine.

It aggregates different kinds of computational resources for research and development purposes. The latest technologies in terms of processors, memories and architecture are added when they are available on the market. It is now more than 1,000 cores (excluding GPU and Xeon Phi ) that are available for all research teams of Inria Bordeaux, Labri and IMB. This computer is in particular used by all the engineers who work in HiePACS and are advised by F. Rue from the SED.

The PlaFRIM platform initiative is coordinated by O. Coulaud.

In the context of HPC-PME initiative, we started a collaboration with ALGO'TECH INFORMATIQUE and we have organised one of the first PhD-consultant action implemented by Xavier Lacoste led by Pierre Ramet. ALGO’TECH is one of the most innovative SMEs (small and medium sized enterprises) in the field of cabling embedded systems, and more broadly, automatic devices. The main target of the project is to validate the possibility to use the sparse linear solvers of our team in the area of electromagnetic simulation tools developped by ALGO'TECH. This collaboration will be developed next year in the context of the European project FORSTISSIMO. The principal objective of FORTISSIMO is to enable European manufacturing, particularly SMEs, to benefit from the efficiency and competitive advantage inherent in the use of simulation.

As a conclusion of the OPTIDIS project we organized the first International Workshop on Dislocation Dynamics Simulations that was devoted to the latest developments realized worldwide in the field of Discrete Dislocation Dynamics simulations. This international event held in December 10th to the 12th at “Maison de la Simulation” in Saclay, France and attracted 55 participants from many different countries including England, Germany, France, USA, ... The workshop gathered most of the active researchers working on dislocation dynamics from numerical simulations to experimentatios. Thanks to the success of this workshop, a second one will be scheduled in England during 2016.

Enabling HPC applications to perform efficiently when invoking multiple parallel libraries simultaneously is a great challenge. Even if a uniform runtime system is used underneath, scheduling tasks or threads coming from different libraries over the same set of hardware resources introduces many issues, such as resource oversubscription, undesirable cache flushes or memory bus contention.

This work presents an extension of `StarPU`, a runtime system
specifically designed for heterogeneous architectures, that allows
multiple parallel codes to run concurrently with minimal
interference. Such parallel codes run within *scheduling
contexts* that provide confined execution environments which can be
used to partition computing resources. Scheduling contexts can be
dynamically resized to optimize the allocation of computing resources
among concurrently running libraries. We introduce a *hypervisor*
that automatically expands or shrinks contexts using feedback from the
runtime system (e.g. resource utilization). We demonstrate the
relevance of our approach using benchmarks invoking multiple high
performance linear algebra kernels simultaneously on top of
heterogeneous multicore machines. We show that our mechanism can
dramatically improve the overall application run time (-34%), most
notably by reducing the average cache miss ratio (-50%).

This work is developed in the framework of Andra Hugo's PhD. These contributions have been published in the international journal of High Performance Computing Applications .

`StarPU`.
We have showed that our method leads to a highly efficient, fully pipelined computation on large real-world industrial test cases provided by Airbus Group.

This research activity has been conduced in the framework of the EADS-ASTRIUM, Inria, Conseil Régional initiative in collaboration with the RUNTIME Inria project, and is part of Benoit Lize's PhD.

Reverse Time Migration (RTM) technique produces underground images
using wave propagation. A discretization based on the Discontinuous
Galerkin (DG) method unleashes a massively parallel elastodynamics
simulation, an interesting feature for current and future
architectures. We have designed a task-based version of this scheme in
order to enable the use of manycore architectures. At this stage, we
have demonstrated the efficiency of the approach on homogeneous and
cache coherent Non Uniform Memory Access (ccNUMA) multicore platforms
(up to 160 cores) and designed a prototype version of a distributed
memory version that can exploit multiple instances of such
architectures. This work has been conducted in the context of the DIP Inria-Total strategic action in collaboration with the Magique3D Inria
project and thanks to the long-term visit of George Bosilca funded by TOTAL.
Geroge's expertise ensured an optimum usage of the `PaRSEC` runtime system onto
which our task-based scheme has been ported.

This work was presented during HPCC conference as well as during a TOTAL scientific event .

For the solution of systems of linear equations, various recovery-restart strategies have been investigated in the framework of Krylov subspace methods to address the situations of core failures. The basic underlying idea is to recover fault entries of the iterate via interpolation from existing values available on neighbor cores. In that resilience framework, we have extended the recovey-restart ideas to the solution of linear eigenvalue problems. Contrary to the linear system case, not only the current iterate can be interpolated but also part of the subspace where candidate eigenpairs are searched.

This work is developed in the framework of Mawussi Zounon's PhD funded by the ANR RESCUE. These contributions have been presented in particuler at the international SIAM workshop on Exascale Applied Mathematics Challenges and Opportunities in Chicago and the Householder symposium in Spa. Notice that theses activities are also part of our contribution to the G8 ESC (Enabling Climate Simulation at extreme scale).

Accelerator-enhanced computing platforms have drawn a lot of attention
due to their massive peak com-putational capacity. Despite significant
advances in the programming interfaces to such hybrid architectures,
traditional programming paradigms struggle mapping the resulting
multi-dimensional heterogeneity and the expression of algorithm
parallelism, resulting in sub-optimal effective
performance. Task-based programming paradigms have the capability to
alleviate some of the programming challenges on distributed hybrid
many-core architectures. In this work we take this concept a step
further by showing that the potential of task-based programming
paradigms can be greatly increased with minimal modification of the
underlying runtime combined with the right algorithmic changes. We
propose two novel recursive algorithmic variants for one-sided
factorizations and describe the changes to the `PaRSEC` task-scheduling
runtime to build a framework where the task granularity is dynamically
adjusted to adapt the degree of available parallelism and kernel
efficiency according to runtime conditions. Based on an extensive set
of results we show that, with one-sided factorizations, i.e. Cholesky
and QR, a carefully written algorithm, supported by an adaptive
tasks-based runtime, is capable of reaching a degree of performance
and scalability never achieved before in distributed hybrid
environments.

These contributions will be presented at the international conference IPDPS 2015 in Hyderabad.

The ongoing hardware evolution exhibits an escalation in the number,
as well as in the heterogeneity, of the computing resources. The
pressure to maintain reasonable levels of performance and portability,
forces the application developers to leave the traditional programming
paradigms and explore alternative solutions. `PaStiX` is a parallel
sparse direct solver, based on a dynamic scheduler for modern
hierarchical architectures. In this paper, we study the replacement of
the highly specialized internal scheduler in `PaStiX` by two generic
runtime frameworks: `PaRSEC` and `StarPU`. The tasks graph of the
factorization step is made available to the two runtimes, providing
them with the opportunity to optimize it in order to maximize the
algorithm efficiency for a predefined execution environment. A
comparative study of the performance of the `PaStiX` solver with the
three schedulers - native `PaStiX`, `StarPU` and `PaRSEC` schedulers - on
different execution contexts is performed. The analysis highlights the
similarities from a performance point of view between the different
execution supports. These results demonstrate that these generic
DAG-based runtimes provide a uniform and portable programming
interface across heterogeneous environments, and are, therefore, a
sustainable solution for hybrid environments.

This work has been developed in the framework of Xavier Lacoste's PhD funded by the ANR ANEMOS. These contributions have been presented at the Heterogeneous Computing Workshop held jointly with the international conference IPDPS 2014 . Xavier Lacoste will defend his PhD in February 2015.

In the framework of the hybrid direct/iterative `MaPHyS` solver, we have designed and implemented
an hybrid MPI-thread variant. More precisely, the implementation relies on the multi-threaded MKL library for all the dense linear algebra calculations and the multi-threaded version of `PaStiX`.
Among the technical difficulties, one was to make sure that the two multi-threaded libraries do not interfere with each other.
The resulting software prototype is currently experimented to study its new capability to get flexibility and trade-off between
the parallel and numerical efficiency.
Parallel experiments have been conducted on the Plafrim plateform as well as on a large scale machine located at the USA DOE NERSC, which has a large number of CPU cores per socket.

This work is developed in the framework of the PhD thesis of Stojce Nakov funded by TOTAL.

New hybrid LU-QR algorithms for solving dense linear systems of the
form `PaRSEC` software
to allow for dynamic choices during execution. Finally,
we analyze both stability and performance results compared to state-of-the-art linear solvers
on parallel distributed multicore platforms.

These contributions have been presented at the international conference IPDPS 2014 in Phoenix. An extended version has been submitted to JPDC journal.

Computing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automobiles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this work, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the INTEL MKL library, and outperforms the best MRRR implementation for many matrices.

These contributions will be presented at the international conference IPDPS 2015 in Hyderabad.

Last year we have worked primarily on developing an efficient fast multipole method for heterogeneous architecture. Some of the accomplishments for this year include:

implementation of some new features in the FMM library ScalFMM: adaptive variants of the Chebyshev and Lagrange interpolation based FMM kernels, multiple right-hand sides, generic tensorial nearfield...

The parallelization and the FMM core parts rely on ScalFMM (OpenMP/MPI) which has been updated all year round.
Finally, ScalFMM offers two new shared memory parallelization strategies using OpenMP 4 and `StarPU`.

New fast algorithms for the computation of low rank approximations of matrices were implemented in a -soon to be- open-source C++ library. These algorithms are based on randomized techniques combined with standard matrix decompositions (such as QR, Cholesky and SVD). The main contribution of this work is that we make use of ScalFMM parallel library in order to power the large amount of matrix to vector products involved in the algorithms. Applications to the fast generation of Gaussian random fields were adressed. Our methods compare good with the existing ones based on Cholesky or FFT and potentially outpass their performances for specific distributions. We are currently in the process of writing a paper on that topic. Extensions to fast Kalman filtering is now considered. This work is done in collaboration with Eric Darve (Stanford, Mechanical Engineering) in the context of the associate team FastLA.

The Time-domain Boundary Element Method (TD-BEM) has not been widely studied but represents an interesting alternative to its frequency counterpart. Usually based on inefficient Sparse Matrix Vector-product (SpMV), we investigate other approaches in order to increase the sequential flop-rate. We present a novel approach based on the re-ordering of the interaction matrices in slices. We end up with a custom multi-vectors/vector product operation and compute it using SIMD intrinsic functions. We take advantage of the new order of the computation to parallelize in shared and distributed memory. We demonstrate the performance of our system by studying the sequential Flop-rate and the parallel scalability, and provide results based on an industrial test-case with up to 32 nodes , . From the middle of year 2014, we started working on the TD FMM for the BEM problem. A non optimized version is able to solve the TD BEM with the FMM on parallel distributed nodes. All the implementations should be in high quality in the Software Engineering sense since the resulting library is going to be used by industrial applications.

This work is developed in the framework of Bérenger Bramas's PhD and contributes to the EADS-ASTRIUM, Inria, Conseil Régional initiative.

In the field of scientific computing, load balancing is a major issue
that determines the performance of parallel applications. Nowadays,
simulations of real-life problems are becoming more and more complex,
involving numerous coupled codes, representing different models. In
this context, reaching high performance can be a great challenge. In
the PhD of Maria Predari (started in october 2013), we develop new
graph partitioning techniques, called co-partitioning, that address
the problem of load balancing for two coupled codes: the key idea is
to perform a "coupling-aware" partitioning, instead of partitioning
these codes independently, as it is usually done.
More precisely, we propose to enrich the classic graph model with *interedges*, that represent the coupled code interactions. We
describe two new algorithms, called AWARE and PROJREPART, and compare
them to the currently used approach (called NAIVE). In recent
experimental results, we notice that both AWARE and PROJREPART
algorithms succeed to balance the computational load in the coupling
phase and in some cases they succeed to reduce the coupling
communications costs. Surprisingly we notice that our algorithms do
not degrade the global graph edgecut, despite the additional
constraints that they impose.
In future work, we aim at validating our results on real-life cases in
the field of aeronautic propulsion. In order to achieve that, we plan
to integrate our algorithms within the `Scotch` framework. Finally, our
algorithms should be implemented in parallel and should be extended in
order to manage more complex applications with more than two
interacting models.

Nested Dissection has been introduced by A. George and is a very popular
heuristic for sparse matrix ordering before numerical factorization. It allows to maximize
the number of parallel tasks, while reducing the fill-in and the operation count.
The basic standard idea is to build a "small separator" `HIPS`, `MaPHyS`,
obtaining a domain decomposition leading to a good balancing of both the size of
domain interiors and the Scalable numerical schemes for scientific applications size of interfaces is a key point for load balancing and efficiency in a parallel context.
This leads to the same issue: balancing the halo vertices to get balanced interfaces.
For this purpose, we revisit the algorithm introduced by Lipton, Rose and Tarjan which performed
the recursion of nested dissection in a different manner: at each level, we apply recursively the method to the sub-graphs
But, for each sub-graph, we keep track of halo vertices. We have implemented that in the Scotch framework,
and have studied its main algorithm to build a separator, called greedy graph growing.

This work is developed in the framework of Astrid Casadei's PhD. These contributions have been presented at the international conference HIPC 2014 in Goa.

Various optimizations have been performed in the Dislocation Dynamics code OptiDis for the long-ranged isotropic elastic force and energy models using a Fast Fourier based Fast Multipole Method (also known as Uniform FMM). Furthermore the anisotropic elastic force model was implemented using spherical harmonics expansions of angular functions known as Stroh matrices. Optimizations with respect to the crystallographic symmetries were also considered. Once the corresponding semi-analytic formulae for the force field are derived this method should compare well with existing approaches based on expanding the anisotropic elastic Green's function.

This year we have focused on the improvements of our hybrid MPI-OpenMP parallelism of the OptiDis code. More precisely, we have continued the development of the cache-conscious data structure to manage efficiently large set of data (segments and nodes) during all the steps of the algorithm. Moreover, we have tuned and improved our hybrid MPI-OpenMP parallelism to run simulations with large number of radiation induced defects forming our dislocation network. To obtain a good scalability, we have introduced a better load balancing at thread level as well as process level. By combining efficient data structure and hybrid parallelism we obtained a speedup of 112 on 160 cores for a simulation of half a million of segments.

These contributions have been presented in minisymposia at the 11th World Congress on Computational Mechanics , 7th MMM International Conference on Multiscale Materials Modeling , and at the International Workshop on DD simulations .

This work is developed in the framework of the ANR OPTIDIS.

The last contribution of Xavier Lacoste's thesis deals with the
integration of our work in `JOREK`, a
production controlled plasma fusion simulation code from CEA
Cadarache. We described a generic finite element oriented
distributed matrix assembly and solver management API. The goal of
this API is to optimize and simplify the construction of a distributed
matrix which, given as an input to `PaStiX`, can improve the memory
scaling of the application. Experiments exhibit that using this API we
could reduce the memory consumption by moving to a distributed matrix
input and improve the performance of the factorized matrix assembly by
reducing the volume of communication.
All this study is related to
`PaStiX` integration inside `JOREK` but the same API could be used to
produce a distributed assembly for another solver or/and another
finite elements based simulation code.

Concerning the `GYSELA` global non-linear electrostatic code, the
efforts during the period have concentrated on predicting memory
requirement and on the gyroaverage operator.

The Gysela program uses a mesh of 5 dimensions of the phase space (3
dimensions in configuration space and 2 dimensions in velocity
space). On the large cases, the memory consumption already reaches the
limit of the available memory on the supercomputers used in production
(Tier-1 and Tier-0 typically). Furthermore, to implement the next
features of Gysela (e.g. adding kinetic electrons in addition to
ions), the needs of memory will dramatically increase, the main
unknown will represents hundreds of TB. In this context, two tools
were created to analyze and decrease the memory consumption.
The first one is a tool that plots the memory consumption of the code
during a run. This tool helps the developer to localize where the
memory peak is located. The second tool is a prediction tool to
compute the peak memory in offline mode (for production use
mainly).
A post processing stage combined with some specific traces
generated on purpose during runtime allow the analysis of the memory
consumption. Low-level primitives are called to generate these traces
and to model memory consumption : they are included in the libMTM
library (Modeling and Tracing Memory). Thanks to this work on memory
consumption modeling, we have decreased the memory peak of the `GYSELA` code up to 50 % on a large case using 32,768 cores and memory
scalability improvement has been shown using these tools up to 65k cores.

The main unknown of the Gysela is a distribution function that represents
either the density of the guiding centers, either the density of the
particles in a tokamak (depending of the location in the code). The
switch between these two representations is done thanks to the
gyroaverage operator. In the actual version of Gysela, the computation
of this operator is achieved thanks to the so-called Padé
approximation. In order to improve the precision of the gyroaveraging,
a new implementation based on interpolation methods has been done
(mainly by researchers from the Inria Tonus project-team and IPP
Garching). We have performed the integration of this new
implementation in `GYSELA` and also some parallel benchmarks. However,
the new gyroaverage operator is approximatively 10 times slower than
the original one. Investigations and optimizations on this operator
are still a work in progress.

This work is carried on in the framework of Fabien Rozar's PhD in collaboration with CEA Cadarache.

High-fidelity nuclear power plant core simulations require solving the
Boltzmann transport equation. In discrete ordinate methods, the most
computationally demanding operation of this equation is the sweep
operation. Considering the evolution of computer architectures, we
propose in this work, as a first step toward heterogeneous
distributed architectures, a hybrid parallel implementation of the
sweep operation on top of the generic task-based runtime system:
`PaRSEC`. Such an implementation targets three nested levels of
parallelism: message passing, multi-threading, and vectorization. A
theoretical performance model was designed to validate the approach
and help the tuning of the multiple parameters involved in such an
approach. The proposed parallel implementation of the Sweep achieves a
sustained performance of 6.1 Tflop/s, corresponding to 33.9% of the
peak performance of the targeted supercomputer. This implementation
compares favorably with state-of-art solvers such as PARTISN; and it
can therefore serve as a building block for a massively parallel
version of the neutron transport solver `DOMINO` developed at EDF.

Preliminary results have been presented at the international HPCC workshop on HPC-CFD in Energy/Transport Domains in Paris. The main contribution will be presented at the international conference IPDPS 2015 in Hyderabad.

In the first part of our research work concerning the parallel
aerodynamic code `FLUSEPA`, a first OpenMP-MPI version based on the
previous one has been developped. By using an hybrid approach
based on a domain decomposition, we achieved a
faster version of the code and the temporal adaptive method used
without bodies in relative motion has been tested successfully for
real complex 3D-cases using up to 400 cores.
Moreover, an asynchronous strategy for computing bodies in relative
motion and mesh intersections has been developed and has been used for
actual 3D-cases. A journal article (for JCP) to sum-up this part of the work is
under redaction and a presentation at ISC at the "2nd International Workshop
on High Performance Computing Simulation in Energy/Transport Domains"
on July 2015 is scheduled.

This intermediate version exhibited synchronization problems for the
aerodynamic solver due to the time integration used by the code.
To tackle this issue, a task-based version over the runtime system
`StarPU` is currently under development and evaluation. This year was
mainly devoted to the realisation of this version.
Task generation function have been designed in order to maximize
asynchronism in execution. Those functions respect the data pattern
access of the code and led to the refactorization of the actual
kernels. A task-based version is now available for the aerodynamic
solver and is available for both shared and distributed memory.
This work will be presented as a poster during the SIAM CSE'15 conference
and we are in the process to submit a paper in the Parallel CFD'15 conference.

The next steps will be to validate the correction of this task-based version and to work on the performance of this new version on actual cases. Later, the task description should be extended to the motion and intersection operations.

This work is carried on in the framework of Jean-Marie Couteyen's PhD in collaboration with Airbus Defence and Space Les Mureaux.

Airbus Defence and Space research and development contract:

Design of a parallel version of the FLUSEPA software (Jean-Marie Couteyen (PhD); Pierre Brenner, Jean Roman).

CEA DPTA research and development contract:

Olivier merci de compléter, lien avec Runtime

CEA-CESTA research and development contract:

Performance analysis of the recent improvements in PaStiX sparse direct solver for matrices coming from different applications developped at CEA-CESTA.

CEA Cadarache (ITER) research and development contract:

Peta and exaflop algorithms for turbulence simulations of fusion plasmas (Fabien Rozar (PhD); Guillaume Latu, Jean Roman).

EDF R & D - SINETICS research and development contract:

Design of a massively parallel version of the SN method for neutronic simulations (Moustapha Salli (PhD); Mathieu Faverge, Pierre Ramet, Jean Roman).

TOTAL research and development contracts:

Parallel hybrid solver for massivelly heterogeneoux manycore platforms (Stojce Nakov (PhD); Emmanuel Agullo, Luc Giraud, Abdou Guermouche, Jean Roman).

Airbus Group Innovations research and development contract:

Design and implementation of FMM and block Krylov solver for BEM applications. The HiBox project is led by the SME IMACS and funded by the DGA Rapid programme.

.

**Grant:** Regional council

**Dates:** 2013 – 2015

**Partners:**
EPIs REALOPT, RUNTIME from Inria Bordeaux Sud-Ouest, CEA-CESTA and l’Institut pluridisciplinaire de recherche sur l'environnement et les matériaux (IPREM) .

**Overview:** Numerical simulation is now integrated into all the design levels and the scientific studies for both academic and industrial contexts. Given the increasing size and sophistication of the simulations carried out, the use of parallel computing is inescapable. The complexity of such achievements requires collaboration of multidisciplinary teams capable of mastering all the necessary scientific skills for each component constituting the chain of expertise.
In this project we consider each of these elements as well as efficient methods for parallel codes coupling.
All these works are intended to contribute to the design of large scale parallel multi-physics simulations.
In addition to this research human activities the regional council also support some innovative computing equipment that will be embedded in the PlaFRIM experimental plateform, project led by O. Coulaud.

Since January 2013, the team is participating to the C2S@Exa Inria Project Lab (IPL). This national initiative aims at the development of numerical modeling methodologies that fully exploit the processing capabilities of modern massively parallel architectures in the context of a number of selected applications related to important scientific and technological challenges for the quality and the security of life in our society. At the current state of the art in technologies and methodologies, a multidisciplinary approach is required to overcome the challenges raised by the development of highly scalable numerical simulation software that can exploit computing platforms offering several hundreds of thousands of cores. Hence, the main objective of C2S@Exa is the establishment of a continuum of expertise in the computer science and numerical mathematics domains, by gathering researchers from Inria project-teams whose research and development activities are tightly linked to high performance computing issues in these domains. More precisely, this collaborative effort involves computer scientists that are experts of programming models, environments and tools for harnessing massively parallel systems, algorithmists that propose algorithms and contribute to generic libraries and core solvers in order to take benefit from all the parallelism levels with the main goal of optimal scaling on very large numbers of computing entities and, numerical mathematicians that are studying numerical schemes and scalable solvers for systems of partial differential equations in view of the simulation of very large-scale problems.

.

**Grant:** ANR-MONU

**Dates:** 2013 – 2017

**Partners:**
Inria (REALOPT, RUNTIME Bordeaux Sud-Ouest et ROMA Rhone-Alpes), IRIT/INPT, CEA-CESTA et Airbus Group Innovations.

**Overview:**

During the last five years, the interest of the scientific computing community towards accelerating devices has been rapidly growing. The reason for this interest lies in the massive computational power delivered by these devices. Several software libraries for dense linear algebra have been produced; the related algorithms are extremely rich in computation and exhibit a very regular pattern of access to data which makes them extremely good candidates for GPU execution. On the contrary, methods for the direct solution of sparse linear systems have irregular, indirect memory access patterns that adversely interact with typical GPU throughput optimizations.

This project aims at studying and designing algorithms and parallel programming models for implementing direct methods for the solution of sparse linear systems on emerging computer equipped with accelerators. The ultimate aim of this project is to achieve the implementation of a software package providing a solver based on direct methods for sparse linear systems of equations. To date, the approaches proposed to achieve this objective are mostly based on a simple offloading of some computational tasks to the accelerators and rely on fine hand-tuning of the code and accurate performance modeling to achieve efficiency. This project proposes an innovative approach which relies on the efficiency and portability of runtime systems. The development of a production-quality, sparse direct solver requires a considerable research effort along three distinct axes:

linear algebra: algorithms have to be adapted or redesigned in order to exhibit properties that make their implementation and execution on heterogeneous computing platforms efficient and reliable. This may require the development of novel methods for defining data access patterns that are more suitable for the dynamic scheduling of computational tasks on processing units with considerably different capabilities as well as techniques for guaranteeing a reliable and robust behavior and accurate solutions. In addition, it will be necessary to develop novel and efficient accelerator implementations of the specific dense linear algebra kernels that are used within sparse, direct solvers;

runtime systems: tools such as the `StarPU` runtime system proved
to be extremely efficient and robust for the implementation of dense
linear algebra algorithms. Sparse linear algebra algorithms, however,
are commonly characterized by complicated data access patterns,
computational tasks with extremely variable granularity and complex
dependencies. Therefore, a substantial research effort is necessary
to design and implement features as well as interfaces to comply
with the needs formalized by the research activity on direct
methods;

scheduling: executing a heterogeneous workload with complex dependencies on a heterogeneous architecture is a very challenging problem that demands the development of effective scheduling algorithms. These will be confronted with possibly limited views of dependencies among tasks and multiple, and potentially conflicting objectives, such as minimizing the makespan, maximizing the locality of data or, where it applies, minimizing the memory consumption.

Given the wide availability of computing platforms equipped with accelerators and the numerical robustness of direct solution methods for sparse linear systems, it is reasonable to expect that the outcome of this project will have a considerable impact on both academic and industrial scientific computing. This project will moreover provide a substantial contribution to the computational science and high-performance computing communities, as it will deliver an unprecedented example of a complex numerical code whose parallelization completely relies on runtime scheduling systems and which is, therefore, extremely portable, maintainable and evolvable towards future computing architectures.

.

**Grant:** ANR 11 INFRA 13

**Dates:** 2011 – 2015

**Partners:**
Inria (Bordeaux Sud-Ouest, Nancy - Grand Est, Rhone-Alpes, Sophia Antipolis - Méditerranée), I3S, LSIIT

**Overview:**

The last decade has brought tremendous changes to the characteristics of large scale distributed computing platforms. Large grids processing terabytes of information a day and the peer-to-peer technology have become common even though understanding how to efficiently exploit such platforms still raises many challenges. As demonstrated by the USS SimGrid project funded by the ANR in 2008, simulation has proved to be a very effective approach for studying such platforms. Although even more challenging, we think the issues raised by petaflop/exaflop computers and emerging cloud infrastructures can be addressed using similar simulation methodology.

The goal of the SONGS project is to extend the applicability of the SimGrid simulation framework from Grids and Peer-to-Peer systems to Clouds and High Performance Computation systems. Each type of large-scale computing system will be addressed through a set of use cases and lead by researchers recognized as experts in this area.

Any sound study of such systems through simulations relies on the following pillars of simulation methodology: Efficient simulation kernel; Sound and validated models; Simulation analysis tools; Campaign simulation management.

.

**Grant:** ANR-MN

**Dates:** 2012 – 2016

**Partners:**
Univ. Nice, CEA/IRFM, CNRS/MDS.

**Overview:**
The main goal of the project is to make a significant progress in understanding of
active control methods of plasma edge MHD
instabilities Edge Localized Modes (ELMs) wich represent particular danger with respect to
heat and particle loads for Plasma Facing Components (PFC) in
ITER. The project is focused in
particular on the numerical modelling study of such ELM control methods as Resonant
Magnetic Perturbations (RMPs) and pellet ELM pacing both foreseen in ITER. The goals of
the project are to improve understanding of the related physics and propose possible new
strategies to improve effectiveness of ELM control techniques. The tool for the non-linear
MHD modeling is the `JOREK` code which was essentially developed within previous ANR
ASTER. `JOREK` will be largerly developed within the present project to
include corresponding new physical models in conjunction with new developments in
mathematics and computer science strategy. The present project will put the non-linear
MHD modeling of ELMs and ELM control on the solid ground theoretically,
computationally, and applications-wise in order to progress in urgently needed solutions for
ITER.

Regarding our contributions,
the `JOREK` code is mainly composed of numerical computations on 3D data. The
toroidal dimension of the tokamak is treated in Fourier space, while the poloidal plane is
decomposed in Bezier patches. The numerical scheme used involves a direct
solver on a large sparse matrix as a main computation of one time step. Two main costs are
clearly identified: the assembly of the sparse matrix, and the direct factorization and solve of
the system that includes communications between all processors. The efficient parallelization
of `JOREK` is one of our main goals, to do so we will reconsider: data distribution,
computation distribution or GMRES implementation. The quality of the sparse solver is also
crucial, both in term of performance and accuracy. In the current release of `JOREK`, the
memory scaling is not satisfactory to solve problems listed above , since at present as one
increases the number of processes for a given problem size, the memory footprint on each
process does not reduce as much as one can expect. In order to access finer meshes on
available supercomputers, memory savings have to be done in the whole code. Another key
point for improving parallelization is to carefully profile the application to understand the
regions of the code that do not scale well. Depending on the timings obtained, strategies to
diminish communication overheads will be evaluated and schemes that improve load
balancing will be initiated. `JOREK` uses `PaStiX` sparse matrix library
for matrix inversion.
However, large number of toroidal harmonics and particular thin structures to resolve for
realistic plasma parameters and ITER machine size still require more aggressive optimisation
in numeric dealing with numerical stability, adaptive meshes etc. However many possible
applications of `JOREK` code we proposed here which represent urgent ITER relevant issues
related to ELM control by RMPs and pellets remain to be solved.

.

**Grant:** ANR-COSINUS

**Dates:** 2010 – 2014

**Partners:**
CEA/DEN/DMN/SRMA (leader), SIMaP Grenoble INP and ICMPE / Paris-Est.

**Overview:**
Plastic deformation is mainly accommodated by dislocations glide in
the case of crystalline materials. The behavior of a single
dislocation segment is perfectly understood since 1960 and analytical
formulations are available in the literature.
However, to understand the behavior of a large population of
dislocations (inducing complex dislocations interactions) and its
effect on plastic deformation, massive numerical computation is necessary.
Since 1990, simulation codes have been developed by French researchers.
Among these codes, the code TRIDIS developed by the SIMAP laboratory
in Grenoble is the pioneer dynamic dislocation code.
In 2007, the project called NUMODIS had been set up as team
collaboration between the SIMAP and the SRMA CEA Saclay in order to
develop a new dynamics dislocation code using modern computer
architecture and advanced numerical methods.
The objective was to overcome the numerical and physical limits of the previous code TRIDIS.
The version NUMODIS 1.0 came out in December 2009, which confirms the feasibility of the project.
The project OPTIDIS is initiated when the code NUMODIS is mature enough to consider parallel computation.
The objective of the project is to develop and validate the algorithms
in order to optimize the numerical and performance efficiency of the
NUMODIS code.
We are aiming at developing a code able to tackle realistic material problems such as the interaction between dislocations
and irradiation defects in a grain plastic deformation after
irradiation.
These kinds of studies where “local mechanisms" are correlated with
macroscopic behavior is a key issue for nuclear industry in order
to understand material aging under irradiation, and hence predict power plant secured service life.
To carry out such studies, massive numerical optimizations of NUMODIS are required.
They involve complex algorithms lying on advanced computational science methods.
The project OPTIDIS will develop through joint collaborative studies
involving researchers specialized in dynamics dislocations and in
numerical methods.
This project is divided in 8 tasks over 4 years.
Two PhD theses will be directly funded by the project.
One will be dedicated to numerical development, validation of complex
algorithms and comparison with the performance of existing dynamics
dislocation codes.
The objective of the second is to carry out large scale simulations to
validate the performance of the numerical developments made in
OPTIDIS.
In both cases, these simulations will be compared with experimental
data obtained by experimentalists.

.

**Grant:** ANR-Blanc (computer science theme)

**Dates:** 2010 – 2015

**Partners:**
Inria EPI ROMA (leader) and GRAND LARGE.

**Overview:**
The advent of exascale machines will help solve new scientific
challenges only if the resilience of large scientific applications
deployed on these machines can be guaranteed.
With 10,000,000 core processors, or more, the time interval between
two consecutive failures is anticipated to be smaller than the typical duration
of a checkpoint, i.e., the time needed to save all necessary
application and system data. No actual progress can then be expected for a large-scale parallel
application. Current fault-tolerant techniques and tools can no longer be used.
The main objective of the RESCUE project is to develop new algorithmic techniques and software
tools to solve the exascale resilience problem. Solving this
problem implies a departure from current approaches,
and calls for yet-to-be-discovered algorithms, protocols and software tools.

This proposed research follows three main research thrusts. The first thrust deals with novel checkpoint protocols. This thrust will include the classification of relevant fault categories and the development of a software package for fault injection into application execution at runtime. The main research activity will be the design and development of scalable and light-weight checkpoint and migration protocols, with on-the-fly storing of key data, distributed but coordinated decisions, etc. These protocols will be validated via a prototype implementation integrated with the public-domain MPICH project. The second thrust entails the development of novel execution models, i.e., accurate stochastic models to predict (and, in turn, optimize) the expected performance (execution time or throughput) of large-scale parallel scientific applications. In the third thrust, we will develop novel parallel algorithms for scientific numerical kernels. We will profile a representative set of key large-scale applications to assess their resilience characteristics (e.g., identify specific patterns to reduce checkpoint overhead). We will also analyze execution trade-offs based on the replication of crucial kernels and on decentralized ABFT (Algorithm-Based Fault Tolerant) techniques. Finally, we will develop new numerical methods and robust algorithms that still converge in the presence of multiple failures. These algorithms will be implemented as part of a software prototype, which will be evaluated when confronted with realistic faults generated via our fault injection techniques.

We firmly believe that only the combination of these three thrusts (new checkpoint protocols, new execution models, and new parallel algorithms) can solve the exascale resilience problem. We hope to contribute to the solution of this critical problem by providing the community with new protocols, models and algorithms, as well as with a set of freely available public-domain software prototypes.

.

**Grant:** ANR-Blanc (applied math theme)

**Dates:** 2010 – 2014

**Partners:**
Institut de Mathématiques de Toulouse (leader);
Laboratoire d'Analyse, Topologie, Probabilités in Marseilles;
Institut de Recherche sur la Fusion Magnétique, CEA/IRFM
and HiePACS.

**Overview:**
This project regards the study and the development of a new class of
numerical methods to simulate natural or laboratory plasmas and in
particular magnetic fusion processes. In this context, we aim at
giving a contribution, from the mathematical, physical and algorithmic
point of view, to the ITER project.

The core of this project consists in the development, the analysis, the implementation and the testing on real physical problems of the so-called Asymptotic-Preserving methods which allow simulations over a large range of scales with the same model and numerical method. These methods represent a breakthrough with respect to the state-of-the art. They will be developed specifically to handle the various challenges related to the simulation of the ITER plasma. In parallel with this class of methodologies, we intend to design appropriate coupling techniques between macroscopic and microscopic models for all the cases in which a net distinction between different regimes can be done. This will permit to describe different regimes in different regions of the machine with a strong gain in term of computational efficiency, without losing accuracy in the description of the problem. We will develop full 3-D solver for the asymptotic preserving fluid as well as kinetic model. The Asymptotic-Preserving (AP) numerical strategy allows us to perform numerical simulations with very large time and mesh steps and leads to impressive computational saving. These advantages will be combined with the utilization of the last generation preconditioned fast linear solvers to produce a software with very high performance for plasma simulation. For HiePACS this project provides in particular a testbed for our expertise in parallel solution of large linear systems.

.

**Grant:** ANR-14‐CE23‐0005

**Dates:** 2014 – 2018

**Partners:**
Inria EPI Pomdapi (leader);
Université Paris 13 - Laboratoire Analyse, Géométrie et Applications;
Maison de la Simulation;
Andra.

**Overview:**
Project DEDALES aims at developing high performance software for the
simulation of two phase flow in porous media. The project will
specifically target parallel computers where each node is itself
composed of a large number of processing cores, such as are found in
new generation many-core architectures.
The project will be driven by an application to radioactive waste deep
geological disposal. Its main feature is phenomenological complexity:
water-gas flow in highly heterogeneous medium, with widely varying
space and time scales. The assessment of large scale model is of major
importance and issue for this application, and realistic geological
models have several million grid cells. Few, if at all, software codes
provide the necessary physical features with massively parallel
simulation capabilities. The aim of the DEDALES project is to study,
and experiment with, new approaches to develop effective simulation
tools with the capability to take advantage of modern computer
architectures and their hierarchical structure.
To achieve this goal, we will explore two complementary software
approaches that both match the hierarchical hardware architecture: on
the one hand, we will integrate a hybrid parallel linear solver into
an existing flow and transport code, and on the other hand, we will
explore a two level approach with the outer level using (space time)
domain decomposition, parallelized with a distributed memory approach,
and the inner level as a subdomain solver that will exploit thread
level parallelism.
Linear solvers have always been, and will continue to be, at the
center of simulation codes. However, parallelizing implicit methods on
unstructured meshes, such as are required to accurately represent the
fine geological details of the heterogeneous media considered, is
notoriously difficult. It has also been suggested that time level
parallelism could be a useful avenue to provide an extra degree of
parallelism, so as to exploit the very large number of computing
elements that will be part of these next generation computers. Project
DEDALES will show that space-time DD methods can provide this extra
level, and can usefully be combined with parallel linear solvers at
the subdomain level.
For all tasks, realistic test cases will be used to show the validity and
the parallel scalability of the chosen approach. The
most demanding models will be at the frontier of what is currently
feasible for the size of models.

Type: FP7

Defi: Special action

Instrument: Specific Targeted Research Project

Objectif: Exascale computing platforms, software and applications

Duration: September 2013 - August 2016

Coordinator: IMEC, Belgium

Partner: Particular specializations and experience of the partners are:

Applications:

NAG - long experience in consultancy for HPC applications

Intel France - collaboration with industry on the migration of software for future HPC systems

TS-SFR - long experience in consultancy for HPC applications in Aerospace and Oil & Gas

Algorithms – primarily numerical:

UA - broad experience in numerical solvers, with some taken up by the PETSc numerical library and other work published in high-ranking journals such as Science.

USI - expertise in parallel many-core algorithms for real-world applications on emerging architectures

Inria - expertise on large scale parallel numerical algorithms

IT4I - experience in the development of scalable solvers for large HPC systems (e.g. PRACE)

Programming Models & Runtime Environments:

Imec - leads the programming model research within the Flanders ExaScience Lab

UVSQ - specialized in code optimization and performance evaluation in the area of HPC

TS-SFR - leading the BMBF funded GASPI project

Fraunhofer - developed a GASPI runtime environment used in industrial applications

Hardware Optimization:

Intel France - investigates workloads for new hardware architectures within the context of the Exascale Computing Research centre

Inria contact: Luc Giraud

Abstract: The EXA2CT project brings together experts at the cutting edge of the development of solvers, related algorithmic techniques, and HPC software architects for programming models and communication. We will produce modular open source proto-applications that demonstrate the algorithms and programming techniques developed in the project, to help boot-strap the creation of genuine exascale codes.

Numerical simulation is a crucial part of science and industry in Europe. The advancement of simulation as a discipline relies on increasingly compute intensive models that require more computational resources to run. This is the driver for the evolution to exascale. Due to limits in the increase in single processor performance, exascale machines will rely on massive parallelism on and off chip, with a complex hierarchy of resources. The large number of components and the machine complexity introduce severe problems for reliability and programmability.

We are involved in the Inria@SiliconValley initiative through the associate team FASTLA described below.

Title: Matrices Over Runtime Systems @ Exascale

International Partner (Institution - Laboratory - Researcher):

KAUST Supercomputing Laboratory (ÉTATS-UNIS)

Duration: 2014 - 2016

See also: http://

The goal of Matrices Over Runtime Systems at Exascale (MORSE) project is to design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale multicore systems with GPU accelerators, using all the processing power that future high end systems can make available. To develop software that will perform well on petascale and exascale systems with thousands of nodes and millions of cores, several daunting challenges have to be overcome, both by the numerical linear algebra and the runtime system communities. By designing a research framework for describing linear algebra algorithms at a high level of abstraction,the MORSE team will enable the strong collaboration between research groups in linear algebra, runtime systems and scheduling needed to develop methods and libraries that fully benefit from the potential of future large-scale machines. Our project will take a pioneering step in the effort to bridge the immense software gap that has opened up in front of the High-Performance Computing (HPC) community.

Title: Fast and Scalable Hierarchical Algorithms for Computational Linear Algebra

International Partner (Institution - Laboratory - Researcher):

Stanford University (ÉTATS-UNIS)

Lawrence Berkeley National Laboratory (ÉTATS-UNIS)

Duration: 2014 - 2016

See also: http://

In this project, we propose to study fast and scalable
hierarchical numerical kernels and their implementations on
heterogeneous manycore platforms for two major computational
kernels in intensive challenging applications. Namely, fast
multipole methods (FMM) and sparse hybrid linear solvers, that
appear in many intensive numerical simulations in computational
sciences. Regarding the FMM we plan to study novel generic
formulations based on

We are involved in the Inria-CNPq HOSCAR project led by Stéphane Lanteri.

The general objective of the project is to setup a multidisciplinary Brazil-France collaborative effort for taking full benefits of future high-performance massively parallel architectures. The targets are the very large-scale datasets and numerical simulations relevant to a selected set of applications in natural sciences: (i) resource prospection, (ii) reservoir simulation, (iii) ecological modeling, (iv) astronomy data management, and (v) simulation data management. The project involves computer scientists and numerical mathematicians divided in 3 fundamental research groups: (i) numerical schemes for PDE models (Group 1), (ii) scientific data management (Group 2), and (iii) high-performance software systems (Group 3).

An annual meeting has been organized in Gramado, Brazil on September, 2014.

Title: Enabling Climate Simulations at Extreme Scale

Inria principal investigator: Luc Giraud

International Partners (Institution - Researcher):

Univ. Illinois at Urbanna Champaign & Argonne National Lab. - Franck Cappello,

Univ. Tennessee at Knoxville - George Bosilca,

German Research School for Simulation Sciences - Felix Wolf,

Univ. Victoria - Andrew Weaver,

Titech - Satoshi Matsuoka,

Univ. Tsukuba - Mitsuhisa Sato,

NCAR - Rich Loft,

Barcelona Supercomputing Center - Jesus Labarta.

Duration: 2011 - 2014

See also: G8 ESC-Enabling Climate Simulations at Extreme Scale

Exascale systems will allow unprecedented reduction of the uncertainties in climate change predictions via ultra-high resolution models, fewer simplifying assumptions, large climate ensembles and simulation at a scale needed to predict local effects. This is essential given the cost and consequences of inaction or wrong actions about climate change. To achieve this, we need careful co-design of future exascale systems and climate codes, to handle lower reliability, increased heterogeneity, and increased importance of locality. Our effort will initiate an international collaboration of climate and computer scientists that will identify the main roadblocks and analyze and test initial solutions for the execution of climate codes at extreme scale. This work will provide guidance to the future evolution of climate codes. We will pursue research projects to handle known roadblocks on resilience, scalability, and use of accelerators and organize international, interdisciplinary workshops to gather and disseminate information. The global nature of the climate challenge and the magnitude of the task strongly favor an international collaboration. The consortium gathers senior and early career researchers from USA, France, Germany, Spain, Japan and Canada and involves teams working on four major climate codes (CESM1, EC-EARTH, ECSM, NICAM).

Olivier Coulaud has been member of the organizing committee of the first International Workshop on Dislocation Dynamics Simulations that was devoted to the latest developments realized worldwide in the field of Discrete Dislocation Dynamics simulations. This international event held in December 10th to the 12th at “Maison de la Simulation” in Saclay, France and attended by 55 participants.

Mathieu Faverge has been member of the technical program committee of the international conference HiPC'14.

Luc Giraud has been member of the scientific program committee of the international conferences HiPC'14, ICS'14, IPDPS'14, VecPar'14, PDCN'14.

Jean Roman has been member of the scientific program committee of the international conference IEEE PDP'14.

Luc Giraud has been involved in the first round of ANR evaluation and has performed reviewing for PRACE.

Furthermore, the HiePACS members have contributed to the reviewing process of several international conferences: IEEE HiPC 2014, CCGRID 2015, IEEE IPDPS 2015, IEEE PDP 2014, ....

Luc Giraud is member of the SIAM J. Matrix Analysis and Applications.

The HiePACS members have contributed to the reviewing process of several international journals (ACM Trans. on Mathematical Software, IEEE Trans. on Parallel and Distributed Systems, Journal of Engineering Mathematics, Parallel Computing, SIAM J. Matrix Analysis and Applications , SIAM J. Scientific Comp., ...).

Undergraduate level/Licence

A. Esnard: Operating system programming, 36h, University Bordeaux I; Using network, 23h, University Bordeaux I.

He is also in charge of the computer science certificate for Internet (C2i) at the University Bordeaux I.

M. Faverge: Programming Environment, 26h, L3; Numerical Algorithmic, 30h, L3; C Projects, 20h, L3, ENSEIRB-MatMeca, France

P. Ramet: System programming 24h, Databases 32h, Objet programming 48h, Distributed programming 32h, Cryptography 32h at Bordeaux University.

Post graduate level/Master

O. Coulaud: Paradigms for parallel computing, 28h, ENSEIRB-MatMeca, Talence; Code coupling, 6h, ENSEIRB-MatMeca, Talence.

E. Agullo: Operating sysems, 24h, University Bordeaux I; Dense linear algebra kernels, 8h, ENSEIRB-MatMeca; Numerical Algorithms, 30h; ENSEIRB-MatMeca, Talence.

A. Esnard: Network management, 27h, University Bordeaux I; Network security, 27h, University Bordeaux I; Programming distributed applications, 35h, ENSEIRB-MatMeca, Talence.

M. Faverge: System Programming, 74h, M1; Load Balancing and Scheduling, 19h, M2, ENSEIRB-MatMeca, Talence.

He is also in charge of the second year of Embedded Electronic Systems option at ENSEIRB-MatMeca, Talence.

P. Ramet: Scheduling, 8h; Numerical Algorithmic, 30h; ENSEIRB-MatMeca, Talence.

He also give classes on Cryptography, 30h, Ho Chi Minh City, Vietnam.

L. Giraud: Introduction to intensive computing and related programming tools, 20h, INSA Toulouse; Introduction to high performance computing and applications, 20h, ISAE-ENSICA; On mathematical tools for numerical simulations, 10h, ENSEEIHT Toulouse; Parallel sparse linear algebra, 11h, ENSEIRB-MatMeca, Talence.

A. Guermouche: Network management, 92h, University Bordeaux I; Network security, 64h, University Bordeaux I; Operating system, 24h, University Bordeaux I.

J. Roman: Parallel sparse linear algebra, 10h, ENSEIRB-MatMeca, Talence; Parallel algorithms, 22h, ENSEIRB-MatMeca, Talence.

Defended PhD thesis

Yohann Dudouit, *Scalable parallel elastodynamic
solver with local refinment in geophysics*,
defended on December 8^{th}, advisors: L. Giraud and S. Pernet (ONERA).

Andra Hugo *Composabilité de codes parallèles
sur plateformes hétérogènes*,
defended on December 12^{th} 2014,
advisors: A. Guermouche, R. Namyst and P-A. Wacrenier.

Clément Vuchener, *Algorithmique de
l'équilibrage de charge pour des couplages de codes
complexes*,
defended on February 7^{th} 2014.
advisors: A. Esnard and J. Roman.

PhD in progress :

Pierre Blanchard, *Fast and accurate methods for
dislocation dynamics*, starting Oct. 2013, advisors:
O. Coulaud and E. Darve (Stanford Univ.).

Bérenger Bramas, *Optimization of time domain BEM
solvers*, starting Jan 2013, advisors: O. Coulaud and
G. Sylvand.

Astrid Casadei, *Scalabilité et robustesse
numérique des solveurs hybrides pour machines
massivement parallèles*, starting Oct. 2011, advisors:
F. Pellegrini and P. Ramet.

Jean-Marie Couteyen, *Parallélisation et passage à
l'échelle du code FLUSEPA*, starting Feb 2013, advisors :
P. Brenner (Airbus Defence and Space) and J. Roman.

Arnaud Etcheverry, *Toward large scale dynamic
dislocation simulation on petaflop computers*, starting
Oct. 2011, advisor: O. Coulaud.

Xavier Lacoste, *Scheduling and memory
optimizations for sparse direct solver on
multicore/multigpu cluster systems*, starting Jan. 2012,
advisors: F. Pellegrini and P. Ramet.

Alexis Praga, *Parallel atmospheric chemistry
and transport model solver for massivelly platforms*,
starting Oct. 2011, advisors: D. Cariolle (CERFACS) and
L. Giraud.

Stojce Nakov, *Parallel hybrid solver for
heterogeneous manycores: application to
geophysics*, starting Oct. 2011, advisors:
E. Agullo and J. Roman.

Maria Predari, *Dynamic Load Balancing for
Massively Parallel Coupled Codes*, starting Oct. 2013,
advisors: A. Esnard and J. Roman.

Louis Poirel, *Two level hybrid linear solver*, starting Nov. 2014, advisors: E. Agullo, M. Faverge and L. Giraud.

Fabien Rozar, *Peta and exaflop algorithms for
turbulence simulations of fusion plasmas*, starting
Nov. 2012, advisors: G. Latu (CEA Cadarache) and J. Roman.

Moustapha Salli, *Design of a massively parallel
version of the SN method for neutronic simulations*, starting
Oct. 2012, advisors: L. Plagne (EDF), P. Ramet and J. Roman.

Mawussi Zounon, *Numerical resilient algorithms
for exascale*, starting Oct. 2011, advisors: E. Agullo
and L. Giraud.

HDR of B. Goglin (Université de Bordeaux) entitled “Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques" defended April 2014. J. Roman (examinator).

PhD of M. Dorier (Ecole Normale Supérieure de Rennes) entitled “Addressing the challenges of I/O variability in post-petascale HPC simulations" defended December 2014. J. Roman (external referee).

PhD of D. Genet (Université de Bordeaux) entitled “Design of generic modular solutions for PDE solvers for modern architectures" defended December 2014. J. Roman (examinator).

PhD of P. Jacq (Université de Bordeaux) entitled “Méthodes numériques de type volumes finis sur maillages non-structurés pour la résolution de la thermique anisotrope et des équations de Navier-Stokes compressibles" defended July 2014. J. Roman (examinator).

PhD of B. Lizé (Université Paris 13) entitled “Résolution Directe Rapide pour les Éléments Finis de Frontière en Électromagnétisme et
Acoustique :

PhD of P. Jolivet (Université de Grenoble et LJLL) entitled “Méthodes de décomposition de domaine. Application au calcul haute performance" defended Ocotber 2104. L. Giraud (examinator).

PhD of R. Kanna (Manchester University) entitled “Numerical linear algebra problems in structural analysis" defended Octobre 2014. Jury: D. Silvester (internal referee) L. Giraud (external referee).

PhD of L. Boillot (Université de Pau et des Pays de l'Adour), entitled “Contributions à la modélisation mathématique et à l’algorithmique parallèle pour l’optimisation d’un propagateur d’ondes élastiques en milieu anisotrope” defended December 2014, E. Agullo (examinator).

In the context of HPC-PME initiative, we started a collaboration with ALGO'TECH INFORMATIQUE and we have organised one of the first PhD-consultant action implemented by Xavier Lacoste led by Pierre Ramet. ALGO’TECH is one of the most innovative SMEs (small and medium sized enterprises) in the field of cabling embedded systems, and more broadly, automatic devices. The main target of the project is to validate the possibility to use the sparse linear solvers of our team in the area of electromagnetic simulation tools developped by ALGO'TECH.

The HiePACS members have organized the PATC training session on Parallel Linear algebra at “Maison de la simulation" in Saclay from March 26th to March 28th.