## Section: Research Program

### Load balancing algorithms for complex simulations

Participants : Astrid Casadei, Olivier Coulaud, Aurélien Esnard, Maria Predari, Pierre Ramet, Jean Roman.

Many important physical phenomena in material physics and climatology are inherently complex applications. They often use multi-physics or multi-scale approaches, which couple different models and codes. The key idea is to reuse available legacy codes through a coupling framework instead of merging them into a stand-alone application. There is typically one model per different scale or physics and each model is implemented by a parallel code.

For instance, to model a crack propagation, one uses a molecular dynamic code to represent the atomistic scale and an elasticity code using a finite element method to represent the continuum scale. Indeed, fully microscopic simulations of most domains of interest are not computationally feasible. Combining such different scales or physics is still a challenge to reach high performance and scalability.

Another prominent example is found in the field of aeronautic propulsion: the conjugate heat transfer simulation in complex geometries (as developed by the CFD team of CERFACS) requires to couple a fluid/convection solver (AVBP) with a solid/conduction solver (AVTP). As the AVBP code is much more CPU consuming than the AVTP code, there is an important computational imbalance between the two solvers.

In this context, one crucial issue is undoubtedly the load balancing
of the whole coupled simulation that remains an open question. The
goal here is to find the best data distribution for the whole coupled
simulation and not only for each stand-alone code, as it is most
usually done. Indeed, the naive balancing of each code on its own can
lead to an important imbalance and to a communication bottleneck
during the coupling phase, which can drastically decrease the overall
performance. Therefore, we argue that it is required to model the
coupling itself in order to ensure a good scalability, especially when
running on massively parallel architectures (tens of thousands of
processors/cores). In other words, one must develop new algorithms and
software implementation to perform a *coupling-aware* partitioning
of the whole application.
Another related problem is the problem of resource allocation. This is
particularly important for the global coupling efficiency and
scalability, because each code involved in the coupling can be more or
less computationally intensive, and there is a good trade-off to find
between resources assigned to each code to avoid that one of them
waits for the other(s). What does furthermore happen if the load of one code
dynamically changes relatively to the other one? In such a case, it could
be convenient to dynamically adapt the number of resources used
during the execution.

There are several open algorithmic problems that we investigate in the HiePACS project-team. All these problems uses a similar methodology based upon the graph model and are expressed as variants of the classic graph partitioning problem, using additional constraints or different objectives.

#### Dynamic load-balancing with variable number of processors

As a preliminary step related to the dynamic load balancing of coupled
codes, we focus on the problem of dynamic load balancing of a single
parallel code, with variable number of processors. Indeed, if the
workload varies drastically during the simulation, the load must be
redistributed regularly among the processors. Dynamic load balancing
is a well studied subject but most studies are limited to an initially
fixed number of processors. Adjusting the number of processors at
runtime allows one to preserve the parallel code efficiency or keep
running the simulation when the current memory resources are
exceeded. We call this problem, *MxN graph repartitioning*.

We propose some methods based on graph repartitioning in order to re-balance the load while changing the number of processors. These methods are split in two main steps. Firstly, we study the migration phase and we build a “good” migration matrix minimizing several metrics like the migration volume or the number of exchanged messages. Secondly, we use graph partitioning heuristics to compute a new distribution optimizing the migration according to the previous step results.

#### Load balancing of coupled codes

As stated above, the load balancing of coupled code is a major
issue, that determines the performance of the complex simulation, and
reaching high performance can be a great challenge. In this context,
we develop new graph partitioning techniques, called *co-partitioning*. They address the problem of load balancing for two
coupled codes: the key idea is to perform a "coupling-aware"
partitioning, instead of partitioning these codes independently, as it
is classically done. More precisely, we propose to enrich the classic
graph model with *inter-edges*, which represent the coupled code
interactions. We describe two new algorithms, and compare them to the
naive approach. In the preliminary experiments we perform on
synthetically-generated graphs, we notice that our algorithms succeed
to balance the computational load in the coupling phase and in some
cases they succeed to reduce the coupling communications costs.
Surprisingly, we notice that our algorithms do not degrade
significantly the global graph edge-cut, despite the additional
constraints that they impose.

Besides this, our co-partitioning technique requires to use graph
partitioning with *fixed vertices*, that raises serious issues
with state-of-the-art software, that are classically based on the
well-known recursive bisection paradigm (RB). Indeed, the RB method
often fails to produce partitions of good quality. To overcome this
issue, we propose a *new* direct $k$-way greedy graph growing
algorithm, called KGGGP, that overcomes this issue and succeeds to
produce partition with better quality than RB while respecting the
constraint of fixed vertices. Experimental results compare KGGGP
against state-of-the-art methods, such as `Scotch`, for real-life
graphs available from the popular *DIMACS'10* collection.

#### Load balancing strategies for hybrid sparse linear solvers

Graph handling and partitioning play a central role in the activity
described here but also in other numerical techniques detailed in
sparse linear algebra Section.
The Nested Dissection is now a well-known heuristic for sparse matrix
ordering to both reduce the fill-in during numerical factorization and
to maximize the number of independent computation tasks. By using the
block data structure induced by the partition of separators of the
original graph, very efficient parallel block solvers have been
designed and implemented according to super-nodal or multi-frontal
approaches. Considering hybrid methods mixing both direct and
iterative solvers such as `HIPS` or `MaPHyS`, obtaining a domain
decomposition leading to a good balancing of both the size of domain
interiors and the size of interfaces is a key point for load balancing
and efficiency in a parallel context.

We intend to revisit some well-known graph partitioning techniques in
the light of the hybrid solvers and design new algorithms to be tested
in the `Scotch` package.