Over the last few decades, there have been innumerable science,
engineering and societal breakthroughs enabled by the development of
high performance computing (HPC) applications, algorithms and
architectures.
These powerful tools have provided researchers with the ability to
computationally find efficient solutions for some of the most
challenging scientific questions and problems in medicine and biology,
climatology, nanotechnology, energy and environment.
It is admitted today that numerical simulation is the third pillar
for the development of scientific discovery at the same level as
theory and experimentation.
Numerous reports and papers also confirmed that very high performance
simulation will open new opportunities not only for research but also
for a large spectrum of industrial sectors (see for example the
documents available on the web link
http://
An important force which has continued to drive HPC has been to focus on frontier milestones which consist in technical goals that symbolize the next stage of progress in the field. In the 1990s, the HPC community sought to achieve computing at a teraflop rate and currently we are able to compute on the first leading architectures at a petaflop rate. Generalist petaflop supercomputers are available and exaflop computers are foreseen in early 2020.
For application codes to sustain petaflops and more in the next few years, hundreds of thousands of processor cores or more will be needed, regardless of processor technology. Currently, a few HPC simulation codes easily scale to this regime and major algorithms and codes development efforts are critical to achieve the potential of these new systems. Scaling to a petaflop and more will involve improving physical models, mathematical modeling, super scalable algorithms that will require paying particular attention to acquisition, management and visualization of huge amounts of scientific data.
In this context, the purpose of the HiePACS project is to contribute performing efficiently frontier simulations arising from challenging academic and industrial research that are likely to be multiscale and coupled applications. The solution of these challenging problems require a multidisciplinary approach involving applied mathematics, computational and computer sciences. In applied mathematics, it essentially involves advanced numerical schemes. In computational science, it involves massively parallel computing and the design of highly scalable algorithms and codes to be executed on emerging hierarchical many-core platforms. Through this approach, HiePACS intends to contribute to all steps that go from the design of new high-performance more scalable, robust and more accurate numerical schemes to the optimized implementations of the associated algorithms and codes on very high performance supercomputers. This research will be conduced on close collaboration in particular with European and US initiatives or projects such as PRACE (Partnership for Advanced Computing in Europe) EESI-2 (European Exascale Software Initiative 2) and likely in the framework of H2020 European collaborative projects.
The methodological part of HiePACS covers several topics. First, we address generic studies concerning massively parallel computing, the design of high-end performance algorithms and software to be executed on future extreme scale platforms. Next, several research prospectives in scalable parallel linear algebra techniques are addressed, ranging from dense direct, sparse direct, iterative and hybrid approaches for large linear systems. Then we consider research plans for N-body interaction computations based on efficient parallel fast multipole methods and finally, we adress research tracks related to the algorithmic challenges for complex code couplings in multiscale/multiphysic simulations.
Currently, we have one major multiscale application that is in material physics. We contribute to all steps of the design of the parallel simulation tool. More precisely, our applied mathematics skill will contribute to the modeling and our advanced numerical schemes will help in the design and efficient software implementation for very large parallel multiscale simulations. Moreover, the robustness and efficiency of our algorithmic research in linear algebra are validated through industrial and academic collaborations with different partners involved in various application fields. Finally, we are also involved in a few collaborative intiatives in various application domains in a co-design like framework. These research activities are conducted in a wider multi-disciplinary context with collegues in other academic or industrial groups where our contribution is related to our expertises. Not only these collaborations enable our knowledges to have a stronger impact in various application domains through the promotion of advanced algorithms, methodologies or tools, but in return they open new avenues for research in the continuity of our core research activities.
Thanks to the two Inria collaborative agreements such as with EADS-Astrium/Conseil Régional Aquitaine and with CEA, we have joint research efforts in a co-design framework enabling efficient and effective technological transfer towards industrial R&D. Furthermore, thanks to two ongoing associated teams, namely MORSE and FastLA we contribute with world leading groups to the design of fast numerical solvers and their parallel implementations.
Our high performance software packages are integrated in several academic or industrial complex codes and are validated on very large scale simulations. For all our software developments, we use first the experimental platform PlaFRIM, the various large parallel platforms available through GENCI in France (CCRT, CINES and IDRIS Computational Centers), and next the high-end parallel platforms that will be available via European and US initiatives or projects such that PRACE.
The methodological component of HiePACS concerns the expertise for the design as well as the efficient and scalable implementation of highly parallel numerical algorithms to perform frontier simulations. In order to address these computational challenges a hierarchical organization of the research is considered. In this bottom-up approach, we first consider in Section generic topics concerning high performance computational science. The activities described in this section are transversal to the overall project and their outcome will support all the other research activities at various levels in order to ensure the parallel scalability of the algorithms. The aim of this activity is not to study general purpose solution but rather to address these problems in close relation with specialists of the field in order to adapt and tune advanced approaches in our algorithmic designs. The next activity, described in Section , is related to the study of parallel linear algebra techniques that currently appear as promising approaches to tackle huge problems on extreme scale platforms. We highlight the linear problems (linear systems or eigenproblems) because they are in many large scale applications the main computational intensive numerical kernels and often the main performance bottleneck. These parallel numerical techniques, which are involved in the IPL C2S@Exa, will be the basis of both academic and industrial collaborations, some are described in Section , but will also be closely related to some functionalities developed in the parallel fast multipole activity described in Section . Finally, as the accuracy of the physical models increases, there is a real need to go for parallel efficient algorithm implementation for multiphysics and multiscale modeling in particular in the context of code coupling. The challenges associated with this activity will be addressed in the framework of the activity described in Section .
Currently, we have one major application (see Section ) that is in material physics. We will contribute to all steps of the design of the parallel simulation tool. More precisely, our applied mathematics skill will contribute to the modelling, our advanced numerical schemes will help in the design and efficient software implementation for very large parallel multi-scale simulations. We also participate to a few co-design actions in close collaboration with some applicative groups. The objective of this activity is to instanciate our expertise in fields where they are critical for designing scalable simulation tools. We refer to Section for a detailed description of these activities.
.
The research directions proposed in HiePACS are strongly influenced by both the applications we are studying and the architectures that we target (i.e., massively parallel many-core architectures, ...). Our main goal is to study the methodology needed to efficiently exploit the new generation of high-performance computers with all the constraints that it induces. To achieve this high-performance with complex applications we have to study both algorithmic problems and the impact of the architectures on the algorithm design.
From the application point of view, the project will be interested in multiresolution, multiscale and hierarchical approaches which lead to multi-level parallelism schemes. This hierarchical parallelism approach is necessary to achieve good performance and high-scalability on modern massively parallel platforms. In this context, more specific algorithmic problems are very important to obtain high performance. Indeed, the kind of applications we are interested in are often based on data redistribution for example (e.g. code coupling applications). This well-known issue becomes very challenging with the increase of both the number of computational nodes and the amount of data. Thus, we have both to study new algorithms and to adapt the existing ones. In addition, some issues like task scheduling have to be restudied in this new context. It is important to note that the work done in this area will be applied for example in the context of code coupling (see Section ).
Considering the complexity of modern architectures like massively parallel architectures or new generation heterogeneous multicore architectures, task scheduling becomes a challenging problem which is central to obtain a high efficiency. Of course, this work requires the use/design of scheduling algorithms and models specifically to tackle our target problems. This has to be done in collaboration with our colleagues from the scheduling community like for example O. Beaumont (Inria REALOPT Project-Team). It is important to note that this topic is strongly linked to the underlying programming model. Indeed, considering multicore architectures, it has appeared, in the last five years, that the best programming model is an approach mixing multi-threading within computational nodes and message passing between them. In the last five years, a lot of work has been developed in the high-performance computing community to understand what is critic to efficiently exploit massively multicore platforms that will appear in the near future. It appeared that the key for the performance is firstly the grain of computations. Indeed, in such platforms the grain of the parallelism must be small so that we can feed all the processors with a sufficient amount of work. It is thus very crucial for us to design new high performance tools for scientific computing in this new context. This will be developed in the context of our solvers, for example, to adapt to this new parallel scheme. Secondly, the larger the number of cores inside a node, the more complex the memory hierarchy. This remark impacts the behaviour of the algorithms within the node. Indeed, on this kind of platforms, NUMA effects will be more and more problematic. Thus, it is very important to study and design data-aware algorithms which take into account the affinity between computational threads and the data they access. This is particularly important in the context of our high-performance tools. Note that this work has to be based on an intelligent cooperative underlying run-time (like the tools developed by the Inria RUNTIME Project-Team) which allows a fine management of data distribution within a node.
Another very important issue concerns high-performance computing using “heterogeneous” resources within a computational node. Indeed, with the emergence of the GPU and the use of more specific co-processors, it is important for our algorithms to efficiently exploit these new kind of architectures. To adapt our algorithms and tools to these accelerators, we need to identify what can be done on the GPU for example and what cannot. Note that recent results in the field have shown the interest of using both regular cores and GPU to perform computations. Note also that in opposition to the case of the parallelism granularity needed by regular multicore architectures, GPU requires coarser grain parallelism. Thus, making both GPU and regular cores work all together will lead to two types of tasks in terms of granularity. This represents a challenging problem especially in terms of scheduling. From this perspective, we investigate new approaches for composing parallel applications within a runtime system for heterogeneous platforms.
The SOLHAR project aims at studying and designing algorithms and parallel programming models for implementing direct methods for the solution of sparse linear systems on emerging computers equipped with accelerators. Several attempts have been made to accomplish the porting of these methods on such architectures; the proposed approaches are mostly based on a simple offloading of some computational tasks (the coarsest grained ones) to the accelerators and rely on fine hand-tuning of the code and accurate performance modeling to achieve efficiency. SOLHAR proposes an innovative approach which relies on the efficiency and portability of runtime systems, such as the StarPU tool developed in the RUNTIME team. Although the SOLHAR project will focus on heterogeneous computers equipped with GPUs due to their wide availability and affordable cost, the research accomplished on algorithms, methods and programming models will be readily applicable to other accelerator devices. Our final goal would be to have high performance solvers and tools which can efficiently run on all these types of complex architectures by exploiting all the resources of the platform (even if they are heterogeneous).
In order to achieve an advanced knowledge concerning the design of efficient computational kernels to be used on our high performance algorithms and codes, we will develop research activities first on regular frameworks before extending them to more irregular and complex situations. In particular, we will work first on optimized dense linear algebra kernels and we will use them in our more complicated direct and hybrid solvers for sparse linear algebra and in our fast multipole algorithms for interaction computations. In this context, we will participate to the development of those kernels in collaboration with groups specialized in dense linear algebra. In particular, we intend develop a strong collaboration with the group of Jack Dongarra at the University of Tennessee and collaborating research groups. The objectives will be to develop dense linear algebra algorithms and libraries for multicore architectures in the context the PLASMA project and for GPU and hybrid multicore/GPU architectures in the context of the MAGMA project. The framework that hosts all these research activities is the associate team MORSE.
A more prospective objective is to study the resiliency in the context of large-scale scientific applications for massively parallel architectures. Indeed, with the increase of the number of computational cores per node, the probability of a hardware crash on a core or of a memory corruption is dramatically increased. This represents a crucial problem that needs to be addressed. However, we will only study it at the algorithmic/application level even if it needed lower-level mechanisms (at OS level or even hardware level). Of course, this work can be performed at lower levels (at operating system) level for example but we do believe that handling faults at the application level provides more knowledge about what has to be done (at application level we know what is critical and what is not). The approach that we will follow will be based on the use of a combination of fault-tolerant implementations of the run-time environments we use (like for example FT-MPI) and an adaptation of our algorithms to try to manage this kind of faults. This topic represents a very long range objective which needs to be addressed to guaranty the robustness of our solvers and applications. In that respect, we are involved in a ANR-Blanc project entitles RESCUE jointly with two other Inria EPI, namely ROMA and GRAND-LARGE and the G8 ESC international initiative as well as in the EXA2CT FP7 project. The main objective of the RESCUE project is to develop new algorithmic techniques and software tools to solve the exascale resilience problem. Solving this problem implies a departure from current approaches, and calls for yet-to-be- discovered algorithms, protocols and software tools.
Finally, it is important to note that the main goal of HiePACS is to design tools and algorithms that will be used within complex simulation frameworks on next-generation parallel machines. Thus, we intend with our partners to use the proposed approach in complex scientific codes and to validate them within very large scale simulations as well as designing parallel solution in co-design collaborations.
Starting with the developments of basic linear algebra kernels tuned for
various classes of computers, a significant knowledge on
the basic concepts for implementations on high-performance scientific computers has been accumulated.
Further knowledge has been acquired through the design of more sophisticated linear algebra algorithms
fully exploiting those basic intensive computational kernels.
In that context, we still look at the development of new computing platforms and their associated programming
tools.
This enables us to identify the possible bottlenecks of new computer architectures
(memory path, various level of caches, inter processor or node network) and to propose
ways to overcome them in algorithmic design.
With the goal of designing efficient scalable linear algebra solvers for large scale applications, various
tracks will be followed in order to investigate different complementary approaches.
Sparse direct solvers have been for years the methods of choice for solving linear systems of equations,
it is nowadays admitted that classical approaches are not scalable neither from a computational complexity
nor from a memory view point for large problems such as those arising from the discretization of large 3D PDE problems.
We will continue to work on sparse direct solvers on the one hand to make sure they fully benefit from most advanced computing platforms
and on the other hand to attempt to reduce their memory and computational costs for some classes of problems where
data sparse ideas can be considered.
Furthermore, sparse direct solvers are a key building boxes for the
design of some of our parallel algorithms such as the hybrid solvers described in the sequel of this section.
Our activities in that context will mainly address preconditioned Krylov subspace methods; both components,
preconditioner and Krylov solvers, will be investigated.
In this framework, and possibly in relation with the research activity on fast multipole, we intend to study how emerging
Solving large sparse systems
Sparse direct solvers are mandatory when the linear system is very ill-conditioned; such a situation is often encountered in structural mechanics codes, for example. Therefore, to obtain an industrial software tool that must be robust and versatile, high-performance sparse direct solvers are mandatory, and parallelism is then necessary for reasons of memory capability and acceptable solution time. Moreover, in order to solve efficiently 3D problems with more than 50 million unknowns, which is now a reachable challenge with new multicore supercomputers, we must achieve good scalability in time and control memory overhead. Solving a sparse linear system by a direct method is generally a highly irregular problem that induces some challenging algorithmic problems and requires a sophisticated implementation scheme in order to fully exploit the capabilities of modern supercomputers.
New supercomputers incorporate many microprocessors which are composed of one or many computational cores. These new architectures induce strongly hierarchical topologies. These are called NUMA architectures. In the context of distributed NUMA architectures, in collaboration with the Inria RUNTIME team, we study optimization strategies to improve the scheduling of communications, threads and I/O. We have developed dynamic scheduling designed for NUMA architectures in the PaStiX solver. The data structures of the solver, as well as the patterns of communication have been modified to meet the needs of these architectures and dynamic scheduling. We are also interested in the dynamic adaptation of the computation grain to use efficiently multi-core architectures and shared memory. Experiments on several numerical test cases have been performed to prove the efficiency of the approach on different architectures.
In collaboration with the ICL team from the University of Tennessee, and the RUNTIME team from Inria, we are evaluating the way to replace the embedded scheduling driver of the PaStiX solver by one of the generic frameworks, PaRSEC or StarPU, to execute the task graph corresponding to a sparse factorization. The aim is to design algorithms and parallel programming models for implementing direct methods for the solution of sparse linear systems on emerging computer equipped with GPU accelerators. More generally, this work will be performed in the context of the associate team MORSE and the ANR SOLHAR project which aims at designing high performance sparse direct solvers for modern heterogeneous systems. This ANR project involves several groups working either on the sparse linear solver aspects (HiePACS and ROMA from Inria and APO from IRIT), on runtime systems (RUNTIME from Inria) or scheduling algorithms (REALOPT and ROMA from Inria). The results of these efforts will be validated in the applications provided by the industrial project members, namely CEA-CESTA and Airbus Group Innovations.
On the numerical side, we are studying how the data sparsness that might exist in
some dense blocks appearing during the factorization can be exploited using different
compression techniques based on
One route to the parallel scalable solution of large sparse linear systems in parallel scientific computing is the use of hybrid methods that hierarchically combine direct and iterative methods. These techniques inherit the advantages of each approach, namely the limited amount of memory and natural parallelization for the iterative component and the numerical robustness of the direct part. The general underlying ideas are not new since they have been intensively used to design domain decomposition techniques; those approaches cover a fairly large range of computing techniques for the numerical solution of partial differential equations (PDEs) in time and space. Generally speaking, it refers to the splitting of the computational domain into sub-domains with or without overlap. The splitting strategy is generally governed by various constraints/objectives but the main one is to express parallelism. The numerical properties of the PDEs to be solved are usually intensively exploited at the continuous or discrete levels to design the numerical algorithms so that the resulting specialized technique will only work for the class of linear systems associated with the targeted PDE.
In that context, we intend to continue our effort on the design of algebraic non-overlapping domain decomposition techniques that rely on the solution of a Schur complement system defined on the interface introduced by the partitioning of the adjacency graph of the sparse matrix associated with the linear system. Although it is better conditioned than the original system the Schur complement needs to be precondition to be amenable to a solution using a Krylov subspace method. Different hierarchical preconditioners will be considered, possibly multilevel, to improve the numerical behaviour of the current approaches implemented in our software libraries HIPS and MaPHyS. This activity will be developed in the context of the ANR DEDALES project. In addition to this numerical studies, advanced parallel implementation will be developed that will involve close collaborations between the hybrid and sparse direct activities.
Preconditioning is the main focus of the two activities described above. They aim at speeding up the convergence of a Krylov subspace method that is the complementary component involved in the solvers of interest for us. In that framework, we believe that various aspects deserve to be investigated; we will consider the following ones:
preconditioned block Krylov solvers for multiple right-hand sides.
In many large scientific and industrial applications, one has to solve
a sequence of linear systems with several right-hand sides given simultaneously or in sequence
(radar cross section calculation in electromagnetism, various source locations in seismic, parametric studies
in general, ...).
For “simultaneous" right-hand sides, the solvers of choice have been for years based on matrix factorizations
as the factorization is performed once and simple and cheap block forward/backward substitutions are then performed.
In order to effectively propose alternative to such solvers, we need to have efficient preconditioned Krylov subspace
solvers.
In that framework, block Krylov approaches, where the Krylov spaces associated with each right-hand side
are shared to enlarge the search space will be considered.
They are not only attractive because of this numerical feature (larger search space), but also from an
implementation point of view.
Their block-structures exhibit nice features with respect to data locality and re-usability that comply
with the memory constraint of multicore architectures.
Following the initial work by J. Yan Fei during his post-doc in HiePACS, we will continue the numerical study of the block GMRES variant that combines
inexact break-down detection and deflation at restart.
In addition a special attention will be paid to situations where a massive number of right-hand sides are given where
variants exploiting the possible sparsness (i.e., compression using
For right-hand sides available one after each other, various strategies that exploit the information available in the sequence of Krylov spaces (e.g. spectral information) will be considered that include for instance technique to perform incremental update of the preconditioner or to build augmented Krylov subspaces.
Extension or modification of Krylov subspace algorithms for multicore architectures: finally to match as much as possible to the computer architecture evolution and get as much as possible performance out of the computer, a particular attention will be paid to adapt, extend or develop numerical schemes that comply with the efficiency constraints associated with the available computers. Nowadays, multicore architectures seem to become widely used, where memory latency and bandwidth are the main bottlenecks; investigations on communication avoiding techniques will be undertaken in the framework of preconditioned Krylov subspace solvers as a general guideline for all the items mentioned above. This research activity will benefit from the FP7 EXA2CT project led by HiePACS on behalf of the IPL C2S@Exa that involves two other Inria projects namely ALPINES and SAGE.
Many eigensolvers also rely on Krylov subspace techniques. Naturally some links exist between the Krylov subspace linear solvers and the Krylov subspace eigensolvers. We plan to study the computation of eigenvalue problems with respect to the following two different axes:
Exploiting the link between Krylov subspace methods for linear system solution and eigensolvers, we intend to develop advanced iterative linear methods based on Krylov subspace methods that use some spectral information to build part of a subspace to be recycled, either though space augmentation or through preconditioner update. This spectral information may correspond to a certain part of the spectrum of the original large matrix or to some approximations of the eigenvalues obtained by solving a reduced eigenproblem. This technique will also be investigated in the framework of block Krylov subspace methods.
In the context of the calculation of the ground state of an atomistic system, eigenvalue computation is a critical step; more accurate and more efficient parallel and scalable eigensolvers are required.
.
In most scientific computing applications considered nowadays as
computational challenges (like biological and material systems,
astrophysics or electromagnetism), the introduction of hierarchical
methods based on an octree structure has dramatically reduced the
amount of computation needed to simulate those systems for a given
accuracy. For instance, in the N-body problem arising from
these application fields, we must compute all pairwise
interactions among N objects (particles, lines, ...) at every
timestep. Among these methods, the Fast Multipole
Method (FMM) developed for gravitational potentials in astrophysics
and for electrostatic (coulombic) potentials in molecular simulations
solves this N-body problem for any given precision with
The potential field is decomposed in a near field part, directly computed, and a far field part approximated thanks to multipole and local expansions. We introduced a matrix formulation of the FMM that exploits the cache hierarchy on a processor through the Basic Linear Algebra Subprograms (BLAS). Moreover, we developed a parallel adaptive version of the FMM algorithm for heterogeneous particle distributions, which is very efficient on parallel clusters of SMP nodes. Finally on such computers, we developed the first hybrid MPI-thread algorithm, which enables to reach better parallel efficiency and better memory scalability. We plan to work on the following points in HiePACS.
Nowadays, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. GPU (Graphics Processing Units) and the Cell processor have thus already been used in astrophysics and in molecular dynamics. The Fast Mutipole Method has also been implemented on GPU. We intend to examine the potential of using these forthcoming processors as a building block for high-end parallel computing in N-body calculations. More precisely, we want to take advantage of our specific underlying BLAS routines to obtain an efficient and easily portable FMM for these new architectures. Algorithmic issues such as dynamic load balancing among heterogeneous cores will also have to be solved in order to gather all the available computation power. This research action will be conduced on close connection with the activity described in Section .
In many applications arising from material physics or astrophysics, the distribution of the data is highly non uniform and the data can grow between two time steps. As mentioned previously, we have proposed a hybrid MPI-thread algorithm to exploit the data locality within each node. We plan to further improve the load balancing for highly non uniform particle distributions with small computation grain thanks to dynamic load balancing at the thread level and thanks to a load balancing correction over several simulation time steps at the process level.
The engine that we develop will be extended to new potentials arising
from material physics such as those used in dislocation
simulations. The interaction between dislocations is long ranged
(
The boundary element method (BEM) is a well known
solution of boundary value problems appearing in various fields of
physics. With this approach, we only have to solve an integral
equation on the boundary. This implies an interaction that decreases in space, but results
in the solution of a dense linear system with
Many important physical phenomena in material physics and climatology are inherently complex applications. They often use multi-physics or multi-scale approaches, that couple different models and codes. The key idea is to reuse available legacy codes through a coupling framework instead of merging them into a standalone application. There is typically one model per different scale or physics; and each model is implemented by a parallel code. For instance, to model a crack propagation, one uses a molecular dynamic code to represent the atomistic scale and an elasticity code using a finite element method to represent the continuum scale. Indeed, fully microscopic simulations of most domains of interest are not computationally feasible. Combining such different scales or physics is still a challenge to reach high performance and scalability. If the model aspects are often well studied, there are several open algorithmic problems, that we plan to investigate in the HiePACS project-team.
As mentioned previously, many important physical phenomena, such as material deformation and failure (see Section ), are inherently multiscale processes that cannot always be modeled via continuum model. Fully microscopic simulations of most domains of interest are not computationally feasible. Therefore, researchers must look at multiscale methods that couple micro models and macro models. Combining different scales such as quantum-atomistic or atomistic, mesoscale and continuum, are still a challenge to obtain efficient and accurate schemes that efficiently and effectively exchange information between the different scales. We are currently involved in two national research projects, that focus on multiscale schemes. More precisely, the models that we start to study are the quantum to atomic coupling (QM/MM coupling) in the ANR NOSSI and the atomic to dislocation coupling in the ANR OPTIDIS.
In this context of code coupling, one crucial issue is undoubtedly the load balancing of the whole coupled simulation that remains an open question. The goal here is to find the best data distribution for the whole coupled simulation and not only for each standalone code, as it is most usually done. Indeed, the naive balancing of each code on its own can lead to an important imbalance and to a communication bottleneck during the coupling phase, that can drastically decrease the overall performance. Therefore, one argues that it is required to model the coupling itself in order to ensure a good scalability, especially when running on massively parallel architectures (tens of thousands of processors/cores). In other words, one must develop new algorithms and software implementation to perform a coupling-aware partitioning of the whole application.
Another related problem is the problem of resource allocation. This is particularly important for the global coupling efficiency and scalabilty, because each code involved in the coupling can be more or less computationally intensive, and there is a good trade-off to find between resources assigned to each code to avoid that one of them waits for the other(s). And what happens if the load of one code dynamically changes relatively to the other? In such a case, it could be convenient to dynamically adapt the number of resources used at runtime.
For instance, the conjugate heat transfer simulation in complex geometries (as developed by the CFD team of CERFACS) requires to couple a fluid/convection solver (AVBP) with a solid/conduction solver (AVTP). The AVBP code is much more CPU consuming than the AVTP code. As a consequence, there is an important computational imbalance between the two solvers. The use of new algorithms to correctly load balance coupled simulations with enhanced graph partitioning techniques appears as a promising way to reach better performances of coupled application on massively parallel computers.
Graph handling and partitioning play a central role in the activity described here but also in other numerical techniques detailed in Section .
The Nested Dissection is now a well-known heuristic for sparse matrix ordering to both reduce the fill-in during numerical factorization and to maximize the number of independent computation tasks. By using the block data structure induced by the partition of separators of the original graph, very efficient parallel block solvers have been designed and implemented according to supernodal or multifrontal approaches. Considering hybrid methods mixing both direct and iterative solvers such as HIPS or MaPHyS, obtaining a domain decomposition leading to a good balancing of both the size of domain interiors and the size of interfaces is a key point for load balancing and efficiency in a parallel context. We intend to revisit some well-known graph partitioning techniques in the light of the hybrid solvers and design new algorithms to be tested in the Scotch package.
.
Due to the increase of available computer power, new applications in nano science and physics appear such as study of properties of new materials (photovoltaic materials, bio- and environmental sensors, ...), failure in materials, nano-indentation. Chemists, physicists now commonly perform simulations in these fields. These computations simulate systems up to billion of atoms in materials, for large time scales up to several nanoseconds. The larger the simulation, the smaller the computational cost of the potential driving the phenomena, resulting in low precision results. So, if we need to increase the precision, there are two ways to decrease the computational cost. In the first approach, we improve algorithms and their parallelization and in the second way, we will consider a multiscale approach.
A domain of interest is the material aging for the nuclear industry. The materials are exposed to complex conditions due to the combination of thermo-mechanical loading, the effects of irradiation and the harsh operating environment. This operating regime makes experimentation extremely difficult and we must rely on multi-physics and multi-scale modeling for our understanding of how these materials behave in service. This fundamental understanding helps not only to ensure the longevity of existing nuclear reactors, but also to guide the development of new materials for 4th generation reactor programs and dedicated fusion reactors. For the study of crystalline materials, an important tool is dislocation dynamics (DD) modeling. This multiscale simulation method predicts the plastic response of a material from the underlying physics of dislocation motion. DD serves as a crucial link between the scale of molecular dynamics and macroscopic methods based on finite elements; it can be used to accurately describe the interactions of a small handful of dislocations, or equally well to investigate the global behavior of a massive collection of interacting defects.
To explore i.e. to simulate these new areas, we need to develop and/or to improve significantly models, schemes and solvers used in the classical codes. In the project, we want to accelerate algorithms arising in those fields. We will focus on the following topics (in particular in the currently under definition OPTIDIS project in collaboration with CEA Saclay, CEA Ile-de-france and SIMaP Laboratory in Grenoble) in connection with research described at Sections and .
The interaction between dislocations is long ranged (
In such simulations, the number of dislocations grows while the phenomenon occurs and these dislocations are not uniformly distributed in the domain. This means that strategies to dynamically construct a good load balancing are crucial to acheive high performance.
From a physical and a simulation point of view, it will be interesting to couple a molecular dynamics model (atomistic model) with a dislocation one (mesoscale model). In such three-dimensional coupling, the main difficulties are firstly to find and characterize a dislocation in the atomistic region, secondly to understand how we can transmit with consistency the information between the two micro and meso scales.
.
The research activities concerning the ITER challenge are involved in the Inria Project Lab (IPL) C2S@Exa.
The numerical simulations tools designed for ITER challenges aim at making a significant progress in understanding active control methods of plasma edge MHD instabilities Edge Localized Modes (ELMs) which represent particular danger with respect to heat and particle loads for Plasma Facing Components (PFC) in ITER. Project is focused in particular on the numerical modeling study of such ELM control methods as Resonant Magnetic Perturbations (RMPs) and pellet ELM pacing both foreseen in ITER. The goals of the project are to improve understanding the related physics and propose possible new strategies to improve effectiveness of ELM control techniques. The tool for the nonlinear MHD modeling (code JOREK) will be largely developed within the present project to include corresponding new physical models in conjunction with new developments in mathematics and computer science strategy in order to progress in urgently needed solutions for ITER.
The fully implicit time evolution scheme in the JOREK code leads to large sparse linear systems that have to be solved at every time step. The MHD model leads to very badly conditioned matrices. In principle the PaStiX library can solve these large sparse problems using a direct method. However, for large 3D problems the CPU time for the direct solver becomes too large. Iterative solution methods require a preconditioner adapted to the problem. Many of the commonly used preconditioners have been tested but no satisfactory solution has been found. The research activities presented in Section will contribute to design new solution techniques best suited for this context.
In the context of the ITER challenge, the GYSELA project aims at simulating the turbulence of plasma particules inside a tokamak. Thanks to a better comprehension of this phenomenon, it would be possible to design a new kind of source of energy based of nuclear fusion. Currently, GYSELA is parallalized in a MPI/OpenMP way and can exploit the power of the current greatest supercomputers (e.g., Juqueen). To simulate faithfully the plasma physic, GYSELA handles a huge amount of data. In fact, the memory consumption is a bottleneck on large simulations (449 K cores). In the meantime all the reports on the future Exascale machines expect a decrease of the memory per core. In this context, mastering the memory consumption of the code becomes critical to consolidate its scalability and to enable the implementation of new features to fully benefit from the extreme scale architectures.
In addition to activities for designing advanced generic tools for managing the memory optimisation, further algorithmic research will be conduced to better predict and limit the memory peak in order to reduce the memory footprint of GYSELA.
As part of its activity, EDF R&D is developing a new nuclear core simulation code named COCAGNE that relies on a Simplified PN (SPN) method to compute the neutron flux inside the core for eigenvalue calculations. In order to assess the accuracy of SPN results, a 3D Cartesian model of PWR nuclear cores has been designed and a reference neutron flux inside this core has been computed with a Monte Carlo transport code from Oak Ridge National Lab. This kind of 3D whole core probabilistic evaluation of the flux is computationally very demanding. An efficient deterministic approach is therefore required to reduce the computation effort dedicated to reference simulations.
In this collaboration, we work on the parallelization (for shared and distributed memories) of the DOMINO code, a parallel 3D Cartesian SN solver specialized for PWR core reactivity computations which is fully integrated in the COCAGNE system.
ASTRIUM has developped for 20 years the FLUSEPA code which focuses on unsteady phenomenon with changing topology like stage separation or rocket launch. The code is based on a finite volume formulation with temporal adaptive time integration and supports bodies in relative motion. The temporal adaptive integration classifies cells in several temporal levels, zero being the level with the slowest cells and each level being twice as fast as the previous one. This repartition can evolve during the computation, leading to load-balancing issues in a parallel computation context. Bodies in relative motion are managed through a CHIMERA-like technique which allows building a composite mesh by merging multiple meshes. The meshes with the highest priorities recover the least ones, and at the boundaries of the covered mesh, an intersection is computed. Unlike classical CHIMERA technique, no interpolation is performed, allowing a conservative flow integration. The main objective of this research is to design a scalable version of FLUSEPA in order to run efficiently on modern parallel architectures very large 3D simulations.
We describe in this section the software that we are developing. The first list will be the main milestones of our project. The other software developments will be conducted in collaboration with academic partners or in collaboration with some industrial partners in the context of their private R&D or production activities. For all these software developments, we will use first the various (very) large parallel platforms available through GENCI in France (CCRT, CINES and IDRIS Computational Centers), and next the high-end parallel platforms that will be available via European and US initiatives or projects such that PRACE.
MaPHyS (Massivelly Parallel Hybrid Solver) is a software package that implements a parallel linear solver coupling direct and iterative approaches. The underlying idea is to apply to general unstructured linear systems domain decomposition ideas developed for the solution of linear systems arising from PDEs. The interface problem, associated with the so called Schur complement system, is solved using a block preconditioner with overlap between the blocks that is referred to as Algebraic Additive Schwarz.
The MaPHyS package is very much a first outcome of the research activity described in Section . Finally, MaPHyS is a preconditioner that can be used to speed-up the convergence of any Krylov subspace method. We forsee to either embed in MaPHyS some Krylov solvers or to release them as standalone packages, in particular for the block variants that will be some outcome of the studies discussed in Section .
MaPHyS can be found at
http://
Complete and incomplete supernodal sparse parallel factorizations.
PaStiX (Parallel Sparse matriX package) is a scientific library that provides a high performance parallel solver for very large sparse linear systems based on block direct and block ILU(k) iterative methods. Numerical algorithms are implemented in single or double precision (real or complex): LLt (Cholesky), LDLt (Crout) and LU with static pivoting (for non symmetric matrices having a symmetric pattern).
The PaStiX library uses the graph partitioning and sparse matrix block ordering package Scotch. PaStiX is based on an efficient static scheduling and memory manager, in order to solve 3D problems with more than 50 million of unknowns. The mapping and scheduling algorithm handles a combination of 1D and 2D block distributions. This algorithm computes an efficient static scheduling of the block computations for our supernodal parallel solver which uses a local aggregation of contribution blocks. This can be done by taking into account very precisely the computational costs of the BLAS 3 primitives, the communication costs and the cost of local aggregations. We also improved this static computation and communication scheduling algorithm to anticipate the sending of partially aggregated blocks, in order to free memory dynamically. By doing this, we are able to reduce the aggregated memory overhead, while keeping good performance.
Another important point is that our study is suitable for any heterogeneous parallel/distributed architecture when its performance is predictable, such as clusters of multicore nodes. In particular, we now offer a high performance version with a low memory overhead for multicore node architectures, which fully exploits the advantage of shared memory by using an hybrid MPI-thread implementation.
Direct methods are numerically robust methods, but the very large three dimensional problems may lead to systems that would require a huge amount of memory despite any memory optimization. A studied approach consists in defining an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. Such incomplete factorization can take advantage of the latest breakthroughs in sparse direct methods and particularly should be very competitive in CPU time (effective power used from processors and good scalability) while avoiding the memory limitation encountered by direct methods.
PaStiX is publicly available at
http://
Multilevel method, domain decomposition, Schur complement, parallel iterative solver.
HIPS (Hierarchical Iterative Parallel Solver) is a scientific library that provides an efficient parallel iterative solver for very large sparse linear systems.
The key point of the methods implemented in HIPS is to define an ordering and a partition of the unknowns that relies on a form of nested dissection ordering in which cross points in the separators play a special role (Hierarchical Interface Decomposition ordering). The subgraphs obtained by nested dissection correspond to the unknowns that are eliminated using a direct method and the Schur complement system on the remaining of the unknowns (that correspond to the interface between the sub-graphs viewed as sub-domains) is solved using an iterative method (GMRES or Conjugate Gradient at the time being). This special ordering and partitioning allows for the use of dense block algorithms both in the direct and iterative part of the solver and provides a high degree of parallelism to these algorithms. The code provides a hybrid method which blends direct and iterative solvers. HIPS exploits the partitioning and multistage ILU techniques to enable a highly parallel scheme where several subdomains can be assigned to the same process. It also provides a scalar preconditioner based on the multistage ILUT factorization.
HIPS can be used as a standalone program that reads a sparse linear system from a file ; it also provides an interface to be called from any C, C++ or Fortran code. It handles symmetric, unsymmetric, real or complex matrices. Thus, HIPS is a software library that provides several methods to build an efficient preconditioner in almost all situations.
HIPS is publicly available at
http://
MetaPart is a library that adresses the challenge of (dynamic) load balancing for emerging complex parallel simulations, such as multi-physics or multi-scale coupling applications. First, it offers a uniform API over state-of-the-art (hyper-) graph partitioning software packages such as Scotch, PaToH, METIS, Zoltan, Mondriaan, etc. etc. Based upon this API, it provides a framework that facilitates the development and the evaluation of high-level partitioning methods, such as MxN repartitioning or coupling-aware partitionining (co-partitioning).
The framework is publicy available at Inria Gforge:
http://
MPICPL (MPI CouPLing) is a software library dedicated to the coupling of parallel legacy codes, that are based on the well-known MPI standard. It proposes a lightweight and comprehensive programing interface that simplifies the coupling of several MPI codes (2, 3 or more). MPICPL facilitates the deployment of these codes thanks to the mpicplrun tool and it interconnects them automatically through standard MPI inter-communicators. Moreover, it generates the universe communicator, that merges the world communicators of all coupled-codes. The coupling infrastructure is described by a simple XML file, that is just loaded by the mpicplrun tool.
MPICPL was developed by HiePACS for the purpose
of the ANR NOSSI. It uses advanced features of MPI2 standard. The
framework is publicy available at Inria Gforge:
http://
ScalFMM (Parallel Fast Multipole Library for Large Scale Simulations) is a software library to simulate N-body interactions using the Fast Multipole Method.
ScalFMM intends to offer all the functionalities needed to perform large parallel simulations while enabling an easy customization of the simulation components: kernels, particles and cells. It works in parallel in a shared/distributed memory model using OpenMP and MPI. The software architecture has been designed with two major objectives: being easy to maintain and easy to understand. There are two main parts: 1) the management of the octree and the parallelization of the method ; 2) the kernels. This new architecture allows us to easily add new FMM algorithm or kernels and new paradigm of parallelization. The code is extremely documented and the naming convention fully respected. Driven by its user-oriented philosophy, ScalFMM is using CMAKE as a compiler/installer tool. Even if ScalFMM is written in C++ it will support a C and fortran API soon.
The library offers two methods to compute interactions between bodies when the potential decays like
The ScalFMM package is available at http://
Visualization, Execution trace
ViTE is a trace explorer. It is a tool made to visualize execution traces of large parallel programs. It supports Pajé, a trace format created by Inria Grenoble, and OTF and OTF2 formats, developed by the University of Dresden and allows the programmer a simpler way to analyse, debug and/or profile large parallel applications. It is an open source software licenced under CeCILL-A.
The ViTE software is available at
http://
In the same context we also contribute to the EZtrace and GTG
libraries in collaboration with F. Trahay from Telecom SudParis.
EZTrace (http://
For the materials physics applications, a lot of development will be done in the context of ANR projects (NOSSI and OPTIDIS, see Section ) in collaboration with LaBRI, CPMOH, IPREM, EPFL and with CEA Saclay and Bruyère-le-Châtel.
FAST
FAST is a linear response time dependent density functional program for computing the electronic absorption spectrum of molecular systems. It uses an
O(
OptiDis
OptiDis is a new code for large scale dislocation dynamics simulations. Its aim is to simulate real life dislocation densities (up until
The code is based on Numodis code developed at CEA Saclay and the ScalFMM library developed in our Inria project. The code is written in C++ language and using the last features of C++11. One of the main aspects is the hybrid parallelism MPI/OpenMP that gives the software the ability to scale on large cluster while the computation load rises. In order to achieve that, we use different levels of parallelism. First of all, the simulation box is spread over MPI processes, we then use a thinner level for threads, dividing the domain using an Octree representation. All theses parts are driven by the ScalFMM library. On the last level our data are stored in an adaptive structure absorbing dynamic of this kind of simulation and handling well task parallelism.
The two following packages are mainly designed and developed in the context of a US initiative led by ICL and to which we closely collaborate through the associate team MORSE.
PLASMA
The PLASMA (Parallel Linear Algebra for Scalable Multi-core Architectures) project aims at addressing the critical and highly disruptive situation that is facing the Linear Algebra and High Performance Computing community due to the introduction of multi-core architectures.
The PLASMA ultimate goal is to create software frameworks that enable programmers to simplify the process of developing applications that can achieve both high performance and portability across a range of new architectures.
The development of programming models that enforce asynchronous, out of order scheduling of operations is the concept used as the basis for the definition of a scalable yet highly efficient software framework for Computational Linear Algebra applications.
The PLASMA library is available at
http://
PaRSEC/DPLASMA
PaRSEC Parallel Runtime Scheduling and Execution Controller, is a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider can be expressed as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact problem-size independent format that can be queried on-demand to discover data dependencies in a totally distributed fashion. PaRSEC assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.
The framework includes libraries, a runtime system, and development tools to help application developers tackle the difficult task of porting their applications to highly heterogeneous and diverse environments.
DPLASMA (Distributed Parallel Linear Algebra Software for Multicore Architectures) is the leading implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators like GPUs or Intel Xeon Phi. DPLASMA achieves this objective through the state of the art PaRSEC runtime, porting the PLASMA algorithms to the distributed memory realm.
The PaRSEC runtime and the DPLASMA library are
available at
http://
PlaFRIM is an experimental platform for research in modeling, simulations and high performance computing. This platform has been set up from 2009 under the leadership of Inria Bordeaux Sud-Ouest in collaboration with computer science and mathematics laboratories, respectively Labri and IMB with a strong support in the region Aquitaine.
It aggregates different kinds of computational resources for research and development purposes. The latest technologies in terms of processors, memories and architecture are added when they are available on the market. It is now more than 1,000 cores (excluding GPU and Xeon Phi ) that are available for all research teams of Inria Bordeaux, Labri and IMB. This computer is in particular used by all the engineers who work in HiePACS and are advised by F. Rue from the SED.
The PlaFRIM platform initiative is coordinated by O. Coulaud.
In the context of HPC-PME initiative, we started a collaboration with ALGO'TECH INFORMATIQUE and we have organised one of the first PhD-consultant action implemented by Xavier Lacoste led by Pierre Ramet. ALGO’TECH is one of the most innovative SMEs (small and medium sized enterprises) in the field of cabling embedded systems, and more broadly, automatic devices. The main target of the project is to validate the possibility to use the sparse linear solvers of our team in the area of electromagnetic simulation tools developped by ALGO'TECH. This collaboration will be developed next year in the context of the European project FORSTISSIMO. The principal objective of FORTISSIMO is to enable European manufacturing, particularly SMEs, to benefit from the efficiency and competitive advantage inherent in the use of simulation.
As a conclusion of the OPTIDIS project we organized the first International Workshop on Dislocation Dynamics Simulations that was devoted to the latest developments realized worldwide in the field of Discrete Dislocation Dynamics simulations. This international event held in December 10th to the 12th at “Maison de la Simulation” in Saclay, France and attracted 55 participants from many different countries including England, Germany, France, USA, ... The workshop gathered most of the active researchers working on dislocation dynamics from numerical simulations to experimentatios. Thanks to the success of this workshop, a second one will be scheduled in England during 2016.
Enabling HPC applications to perform efficiently when invoking multiple parallel libraries simultaneously is a great challenge. Even if a uniform runtime system is used underneath, scheduling tasks or threads coming from different libraries over the same set of hardware resources introduces many issues, such as resource oversubscription, undesirable cache flushes or memory bus contention.
This work presents an extension of StarPU, a runtime system specifically designed for heterogeneous architectures, that allows multiple parallel codes to run concurrently with minimal interference. Such parallel codes run within scheduling contexts that provide confined execution environments which can be used to partition computing resources. Scheduling contexts can be dynamically resized to optimize the allocation of computing resources among concurrently running libraries. We introduce a hypervisor that automatically expands or shrinks contexts using feedback from the runtime system (e.g. resource utilization). We demonstrate the relevance of our approach using benchmarks invoking multiple high performance linear algebra kernels simultaneously on top of heterogeneous multicore machines. We show that our mechanism can dramatically improve the overall application run time (-34%), most notably by reducing the average cache miss ratio (-50%).
This work is developed in the framework of Andra Hugo's PhD. These contributions have been published in the international journal of High Performance Computing Applications .
This research activity has been conduced in the framework of the EADS-ASTRIUM, Inria, Conseil Régional initiative in collaboration with the RUNTIME Inria project, and is part of Benoit Lize's PhD.
Reverse Time Migration (RTM) technique produces underground images using wave propagation. A discretization based on the Discontinuous Galerkin (DG) method unleashes a massively parallel elastodynamics simulation, an interesting feature for current and future architectures. We have designed a task-based version of this scheme in order to enable the use of manycore architectures. At this stage, we have demonstrated the efficiency of the approach on homogeneous and cache coherent Non Uniform Memory Access (ccNUMA) multicore platforms (up to 160 cores) and designed a prototype version of a distributed memory version that can exploit multiple instances of such architectures. This work has been conducted in the context of the DIP Inria-Total strategic action in collaboration with the Magique3D Inria project and thanks to the long-term visit of George Bosilca funded by TOTAL. Geroge's expertise ensured an optimum usage of the PaRSEC runtime system onto which our task-based scheme has been ported.
This work was presented during HPCC conference as well as during a TOTAL scientific event .
For the solution of systems of linear equations, various recovery-restart strategies have been investigated in the framework of Krylov subspace methods to address the situations of core failures. The basic underlying idea is to recover fault entries of the iterate via interpolation from existing values available on neighbor cores. In that resilience framework, we have extended the recovey-restart ideas to the solution of linear eigenvalue problems. Contrary to the linear system case, not only the current iterate can be interpolated but also part of the subspace where candidate eigenpairs are searched.
This work is developed in the framework of Mawussi Zounon's PhD funded by the ANR RESCUE. These contributions have been presented in particuler at the international SIAM workshop on Exascale Applied Mathematics Challenges and Opportunities in Chicago and the Householder symposium in Spa. Notice that theses activities are also part of our contribution to the G8 ESC (Enabling Climate Simulation at extreme scale).
Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the programming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this work we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel efficiency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments.
These contributions will be presented at the international conference IPDPS 2015 in Hyderabad.
The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of the computing resources. The pressure to maintain reasonable levels of performance and portability, forces the application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical architectures. In this paper, we study the replacement of the highly specialized internal scheduler in PaStiX by two generic runtime frameworks: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them with the opportunity to optimize it in order to maximize the algorithm efficiency for a predefined execution environment. A comparative study of the performance of the PaStiX solver with the three schedulers - native PaStiX, StarPU and PaRSEC schedulers - on different execution contexts is performed. The analysis highlights the similarities from a performance point of view between the different execution supports. These results demonstrate that these generic DAG-based runtimes provide a uniform and portable programming interface across heterogeneous environments, and are, therefore, a sustainable solution for hybrid environments.
This work has been developed in the framework of Xavier Lacoste's PhD funded by the ANR ANEMOS. These contributions have been presented at the Heterogeneous Computing Workshop held jointly with the international conference IPDPS 2014 . Xavier Lacoste will defend his PhD in February 2015.
In the framework of the hybrid direct/iterative MaPHyS solver, we have designed and implemented an hybrid MPI-thread variant. More precisely, the implementation relies on the multi-threaded MKL library for all the dense linear algebra calculations and the multi-threaded version of PaStiX. Among the technical difficulties, one was to make sure that the two multi-threaded libraries do not interfere with each other. The resulting software prototype is currently experimented to study its new capability to get flexibility and trade-off between the parallel and numerical efficiency. Parallel experiments have been conducted on the Plafrim plateform as well as on a large scale machine located at the USA DOE NERSC, which has a large number of CPU cores per socket.
This work is developed in the framework of the PhD thesis of Stojce Nakov funded by TOTAL.
New hybrid LU-QR algorithms for solving dense linear systems of the
form
These contributions have been presented at the international conference IPDPS 2014 in Phoenix. An extended version has been submitted to JPDC journal.
Computing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automobiles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this work, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the INTEL MKL library, and outperforms the best MRRR implementation for many matrices.
These contributions will be presented at the international conference IPDPS 2015 in Hyderabad.
Last year we have worked primarily on developing an efficient fast multipole method for heterogeneous architecture. Some of the accomplishments for this year include:
implementation of some new features in the FMM library ScalFMM: adaptive variants of the Chebyshev and Lagrange interpolation based FMM kernels, multiple right-hand sides, generic tensorial nearfield...
The parallelization and the FMM core parts rely on ScalFMM (OpenMP/MPI) which has been updated all year round. Finally, ScalFMM offers two new shared memory parallelization strategies using OpenMP 4 and StarPU.
New fast algorithms for the computation of low rank approximations of matrices were implemented in a -soon to be- open-source C++ library. These algorithms are based on randomized techniques combined with standard matrix decompositions (such as QR, Cholesky and SVD). The main contribution of this work is that we make use of ScalFMM parallel library in order to power the large amount of matrix to vector products involved in the algorithms. Applications to the fast generation of Gaussian random fields were adressed. Our methods compare good with the existing ones based on Cholesky or FFT and potentially outpass their performances for specific distributions. We are currently in the process of writing a paper on that topic. Extensions to fast Kalman filtering is now considered. This work is done in collaboration with Eric Darve (Stanford, Mechanical Engineering) in the context of the associate team FastLA.
The Time-domain Boundary Element Method (TD-BEM) has not been widely studied but represents an interesting alternative to its frequency counterpart. Usually based on inefficient Sparse Matrix Vector-product (SpMV), we investigate other approaches in order to increase the sequential flop-rate. We present a novel approach based on the re-ordering of the interaction matrices in slices. We end up with a custom multi-vectors/vector product operation and compute it using SIMD intrinsic functions. We take advantage of the new order of the computation to parallelize in shared and distributed memory. We demonstrate the performance of our system by studying the sequential Flop-rate and the parallel scalability, and provide results based on an industrial test-case with up to 32 nodes , . From the middle of year 2014, we started working on the TD FMM for the BEM problem. A non optimized version is able to solve the TD BEM with the FMM on parallel distributed nodes. All the implementations should be in high quality in the Software Engineering sense since the resulting library is going to be used by industrial applications.
This work is developed in the framework of Bérenger Bramas's PhD and contributes to the EADS-ASTRIUM, Inria, Conseil Régional initiative.
In the field of scientific computing, load balancing is a major issue that determines the performance of parallel applications. Nowadays, simulations of real-life problems are becoming more and more complex, involving numerous coupled codes, representing different models. In this context, reaching high performance can be a great challenge. In the PhD of Maria Predari (started in october 2013), we develop new graph partitioning techniques, called co-partitioning, that address the problem of load balancing for two coupled codes: the key idea is to perform a "coupling-aware" partitioning, instead of partitioning these codes independently, as it is usually done. More precisely, we propose to enrich the classic graph model with interedges, that represent the coupled code interactions. We describe two new algorithms, called AWARE and PROJREPART, and compare them to the currently used approach (called NAIVE). In recent experimental results, we notice that both AWARE and PROJREPART algorithms succeed to balance the computational load in the coupling phase and in some cases they succeed to reduce the coupling communications costs. Surprisingly we notice that our algorithms do not degrade the global graph edgecut, despite the additional constraints that they impose. In future work, we aim at validating our results on real-life cases in the field of aeronautic propulsion. In order to achieve that, we plan to integrate our algorithms within the Scotch framework. Finally, our algorithms should be implemented in parallel and should be extended in order to manage more complex applications with more than two interacting models.
Nested Dissection has been introduced by A. George and is a very popular
heuristic for sparse matrix ordering before numerical factorization. It allows to maximize
the number of parallel tasks, while reducing the fill-in and the operation count.
The basic standard idea is to build a "small separator"
This work is developed in the framework of Astrid Casadei's PhD. These contributions have been presented at the international conference HIPC 2014 in Goa.
Various optimizations have been performed in the Dislocation Dynamics code OptiDis for the long-ranged isotropic elastic force and energy models using a Fast Fourier based Fast Multipole Method (also known as Uniform FMM). Furthermore the anisotropic elastic force model was implemented using spherical harmonics expansions of angular functions known as Stroh matrices. Optimizations with respect to the crystallographic symmetries were also considered. Once the corresponding semi-analytic formulae for the force field are derived this method should compare well with existing approaches based on expanding the anisotropic elastic Green's function.
This year we have focused on the improvements of our hybrid MPI-OpenMP parallelism of the OptiDis code. More precisely, we have continued the development of the cache-conscious data structure to manage efficiently large set of data (segments and nodes) during all the steps of the algorithm. Moreover, we have tuned and improved our hybrid MPI-OpenMP parallelism to run simulations with large number of radiation induced defects forming our dislocation network. To obtain a good scalability, we have introduced a better load balancing at thread level as well as process level. By combining efficient data structure and hybrid parallelism we obtained a speedup of 112 on 160 cores for a simulation of half a million of segments.
These contributions have been presented in minisymposia at the 11th World Congress on Computational Mechanics , 7th MMM International Conference on Multiscale Materials Modeling , and at the International Workshop on DD simulations .
This work is developed in the framework of the ANR OPTIDIS.
The last contribution of Xavier Lacoste's thesis deals with the integration of our work in JOREK, a production controlled plasma fusion simulation code from CEA Cadarache. We described a generic finite element oriented distributed matrix assembly and solver management API. The goal of this API is to optimize and simplify the construction of a distributed matrix which, given as an input to PaStiX, can improve the memory scaling of the application. Experiments exhibit that using this API we could reduce the memory consumption by moving to a distributed matrix input and improve the performance of the factorized matrix assembly by reducing the volume of communication. All this study is related to PaStiX integration inside JOREK but the same API could be used to produce a distributed assembly for another solver or/and another finite elements based simulation code.
Concerning the GYSELA global non-linear electrostatic code, the efforts during the period have concentrated on predicting memory requirement and on the gyroaverage operator.
The Gysela program uses a mesh of 5 dimensions of the phase space (3 dimensions in configuration space and 2 dimensions in velocity space). On the large cases, the memory consumption already reaches the limit of the available memory on the supercomputers used in production (Tier-1 and Tier-0 typically). Furthermore, to implement the next features of Gysela (e.g. adding kinetic electrons in addition to ions), the needs of memory will dramatically increase, the main unknown will represents hundreds of TB. In this context, two tools were created to analyze and decrease the memory consumption. The first one is a tool that plots the memory consumption of the code during a run. This tool helps the developer to localize where the memory peak is located. The second tool is a prediction tool to compute the peak memory in offline mode (for production use mainly). A post processing stage combined with some specific traces generated on purpose during runtime allow the analysis of the memory consumption. Low-level primitives are called to generate these traces and to model memory consumption : they are included in the libMTM library (Modeling and Tracing Memory). Thanks to this work on memory consumption modeling, we have decreased the memory peak of the GYSELA code up to 50 % on a large case using 32,768 cores and memory scalability improvement has been shown using these tools up to 65k cores.
The main unknown of the Gysela is a distribution function that represents either the density of the guiding centers, either the density of the particles in a tokamak (depending of the location in the code). The switch between these two representations is done thanks to the gyroaverage operator. In the actual version of Gysela, the computation of this operator is achieved thanks to the so-called Padé approximation. In order to improve the precision of the gyroaveraging, a new implementation based on interpolation methods has been done (mainly by researchers from the Inria Tonus project-team and IPP Garching). We have performed the integration of this new implementation in GYSELA and also some parallel benchmarks. However, the new gyroaverage operator is approximatively 10 times slower than the original one. Investigations and optimizations on this operator are still a work in progress.
This work is carried on in the framework of Fabien Rozar's PhD in collaboration with CEA Cadarache.
High-fidelity nuclear power plant core simulations require solving the Boltzmann transport equation. In discrete ordinate methods, the most computationally demanding operation of this equation is the sweep operation. Considering the evolution of computer architectures, we propose in this work, as a first step toward heterogeneous distributed architectures, a hybrid parallel implementation of the sweep operation on top of the generic task-based runtime system: PaRSEC. Such an implementation targets three nested levels of parallelism: message passing, multi-threading, and vectorization. A theoretical performance model was designed to validate the approach and help the tuning of the multiple parameters involved in such an approach. The proposed parallel implementation of the Sweep achieves a sustained performance of 6.1 Tflop/s, corresponding to 33.9% of the peak performance of the targeted supercomputer. This implementation compares favorably with state-of-art solvers such as PARTISN; and it can therefore serve as a building block for a massively parallel version of the neutron transport solver DOMINO developed at EDF.
Preliminary results have been presented at the international HPCC workshop on HPC-CFD in Energy/Transport Domains in Paris. The main contribution will be presented at the international conference IPDPS 2015 in Hyderabad.
In the first part of our research work concerning the parallel aerodynamic code FLUSEPA, a first OpenMP-MPI version based on the previous one has been developped. By using an hybrid approach based on a domain decomposition, we achieved a faster version of the code and the temporal adaptive method used without bodies in relative motion has been tested successfully for real complex 3D-cases using up to 400 cores. Moreover, an asynchronous strategy for computing bodies in relative motion and mesh intersections has been developed and has been used for actual 3D-cases. A journal article (for JCP) to sum-up this part of the work is under redaction and a presentation at ISC at the "2nd International Workshop on High Performance Computing Simulation in Energy/Transport Domains" on July 2015 is scheduled.
This intermediate version exhibited synchronization problems for the aerodynamic solver due to the time integration used by the code. To tackle this issue, a task-based version over the runtime system StarPU is currently under development and evaluation. This year was mainly devoted to the realisation of this version. Task generation function have been designed in order to maximize asynchronism in execution. Those functions respect the data pattern access of the code and led to the refactorization of the actual kernels. A task-based version is now available for the aerodynamic solver and is available for both shared and distributed memory. This work will be presented as a poster during the SIAM CSE'15 conference and we are in the process to submit a paper in the Parallel CFD'15 conference.
The next steps will be to validate the correction of this task-based version and to work on the performance of this new version on actual cases. Later, the task description should be extended to the motion and intersection operations.
This work is carried on in the framework of Jean-Marie Couteyen's PhD in collaboration with Airbus Defence and Space Les Mureaux.
Airbus Defence and Space research and development contract:
Design of a parallel version of the FLUSEPA software (Jean-Marie Couteyen (PhD); Pierre Brenner, Jean Roman).
CEA DPTA research and development contract:
Olivier merci de compléter, lien avec Runtime
CEA-CESTA research and development contract:
Performance analysis of the recent improvements in PaStiX sparse direct solver for matrices coming from different applications developped at CEA-CESTA.
CEA Cadarache (ITER) research and development contract:
Peta and exaflop algorithms for turbulence simulations of fusion plasmas (Fabien Rozar (PhD); Guillaume Latu, Jean Roman).
EDF R & D - SINETICS research and development contract:
Design of a massively parallel version of the SN method for neutronic simulations (Moustapha Salli (PhD); Mathieu Faverge, Pierre Ramet, Jean Roman).
TOTAL research and development contracts:
Parallel hybrid solver for massivelly heterogeneoux manycore platforms (Stojce Nakov (PhD); Emmanuel Agullo, Luc Giraud, Abdou Guermouche, Jean Roman).
Airbus Group Innovations research and development contract:
Design and implementation of FMM and block Krylov solver for BEM applications. The HiBox project is led by the SME IMACS and funded by the DGA Rapid programme.
.
Grant: Regional council
Dates: 2013 – 2015
Partners: EPIs REALOPT, RUNTIME from Inria Bordeaux Sud-Ouest, CEA-CESTA and l’Institut pluridisciplinaire de recherche sur l'environnement et les matériaux (IPREM) .
Overview: Numerical simulation is now integrated into all the design levels and the scientific studies for both academic and industrial contexts. Given the increasing size and sophistication of the simulations carried out, the use of parallel computing is inescapable. The complexity of such achievements requires collaboration of multidisciplinary teams capable of mastering all the necessary scientific skills for each component constituting the chain of expertise. In this project we consider each of these elements as well as efficient methods for parallel codes coupling. All these works are intended to contribute to the design of large scale parallel multi-physics simulations. In addition to this research human activities the regional council also support some innovative computing equipment that will be embedded in the PlaFRIM experimental plateform, project led by O. Coulaud.
Since January 2013, the team is participating to the C2S@Exa Inria Project Lab (IPL). This national initiative aims at the development of numerical modeling methodologies that fully exploit the processing capabilities of modern massively parallel architectures in the context of a number of selected applications related to important scientific and technological challenges for the quality and the security of life in our society. At the current state of the art in technologies and methodologies, a multidisciplinary approach is required to overcome the challenges raised by the development of highly scalable numerical simulation software that can exploit computing platforms offering several hundreds of thousands of cores. Hence, the main objective of C2S@Exa is the establishment of a continuum of expertise in the computer science and numerical mathematics domains, by gathering researchers from Inria project-teams whose research and development activities are tightly linked to high performance computing issues in these domains. More precisely, this collaborative effort involves computer scientists that are experts of programming models, environments and tools for harnessing massively parallel systems, algorithmists that propose algorithms and contribute to generic libraries and core solvers in order to take benefit from all the parallelism levels with the main goal of optimal scaling on very large numbers of computing entities and, numerical mathematicians that are studying numerical schemes and scalable solvers for systems of partial differential equations in view of the simulation of very large-scale problems.
.
Grant: ANR-MONU
Dates: 2013 – 2017
Partners: Inria (REALOPT, RUNTIME Bordeaux Sud-Ouest et ROMA Rhone-Alpes), IRIT/INPT, CEA-CESTA et Airbus Group Innovations.
Overview:
During the last five years, the interest of the scientific computing community towards accelerating devices has been rapidly growing. The reason for this interest lies in the massive computational power delivered by these devices. Several software libraries for dense linear algebra have been produced; the related algorithms are extremely rich in computation and exhibit a very regular pattern of access to data which makes them extremely good candidates for GPU execution. On the contrary, methods for the direct solution of sparse linear systems have irregular, indirect memory access patterns that adversely interact with typical GPU throughput optimizations.
This project aims at studying and designing algorithms and parallel programming models for implementing direct methods for the solution of sparse linear systems on emerging computer equipped with accelerators. The ultimate aim of this project is to achieve the implementation of a software package providing a solver based on direct methods for sparse linear systems of equations. To date, the approaches proposed to achieve this objective are mostly based on a simple offloading of some computational tasks to the accelerators and rely on fine hand-tuning of the code and accurate performance modeling to achieve efficiency. This project proposes an innovative approach which relies on the efficiency and portability of runtime systems. The development of a production-quality, sparse direct solver requires a considerable research effort along three distinct axes:
linear algebra: algorithms have to be adapted or redesigned in order to exhibit properties that make their implementation and execution on heterogeneous computing platforms efficient and reliable. This may require the development of novel methods for defining data access patterns that are more suitable for the dynamic scheduling of computational tasks on processing units with considerably different capabilities as well as techniques for guaranteeing a reliable and robust behavior and accurate solutions. In addition, it will be necessary to develop novel and efficient accelerator implementations of the specific dense linear algebra kernels that are used within sparse, direct solvers;
runtime systems: tools such as the StarPU runtime system proved to be extremely efficient and robust for the implementation of dense linear algebra algorithms. Sparse linear algebra algorithms, however, are commonly characterized by complicated data access patterns, computational tasks with extremely variable granularity and complex dependencies. Therefore, a substantial research effort is necessary to design and implement features as well as interfaces to comply with the needs formalized by the research activity on direct methods;
scheduling: executing a heterogeneous workload with complex dependencies on a heterogeneous architecture is a very challenging problem that demands the development of effective scheduling algorithms. These will be confronted with possibly limited views of dependencies among tasks and multiple, and potentially conflicting objectives, such as minimizing the makespan, maximizing the locality of data or, where it applies, minimizing the memory consumption.
Given the wide availability of computing platforms equipped with accelerators and the numerical robustness of direct solution methods for sparse linear systems, it is reasonable to expect that the outcome of this project will have a considerable impact on both academic and industrial scientific computing. This project will moreover provide a substantial contribution to the computational science and high-performance computing communities, as it will deliver an unprecedented example of a complex numerical code whose parallelization completely relies on runtime scheduling systems and which is, therefore, extremely portable, maintainable and evolvable towards future computing architectures.
.
Grant: ANR 11 INFRA 13
Dates: 2011 – 2015
Partners: Inria (Bordeaux Sud-Ouest, Nancy - Grand Est, Rhone-Alpes, Sophia Antipolis - Méditerranée), I3S, LSIIT
Overview:
The last decade has brought tremendous changes to the characteristics of large scale distributed computing platforms. Large grids processing terabytes of information a day and the peer-to-peer technology have become common even though understanding how to efficiently exploit such platforms still raises many challenges. As demonstrated by the USS SimGrid project funded by the ANR in 2008, simulation has proved to be a very effective approach for studying such platforms. Although even more challenging, we think the issues raised by petaflop/exaflop computers and emerging cloud infrastructures can be addressed using similar simulation methodology.
The goal of the SONGS project is to extend the applicability of the SimGrid simulation framework from Grids and Peer-to-Peer systems to Clouds and High Performance Computation systems. Each type of large-scale computing system will be addressed through a set of use cases and lead by researchers recognized as experts in this area.
Any sound study of such systems through simulations relies on the following pillars of simulation methodology: Efficient simulation kernel; Sound and validated models; Simulation analysis tools; Campaign simulation management.
.
Grant: ANR-MN
Dates: 2012 – 2016
Partners: Univ. Nice, CEA/IRFM, CNRS/MDS.
Overview: The main goal of the project is to make a significant progress in understanding of active control methods of plasma edge MHD instabilities Edge Localized Modes (ELMs) wich represent particular danger with respect to heat and particle loads for Plasma Facing Components (PFC) in ITER. The project is focused in particular on the numerical modelling study of such ELM control methods as Resonant Magnetic Perturbations (RMPs) and pellet ELM pacing both foreseen in ITER. The goals of the project are to improve understanding of the related physics and propose possible new strategies to improve effectiveness of ELM control techniques. The tool for the non-linear MHD modeling is the JOREK code which was essentially developed within previous ANR ASTER. JOREK will be largerly developed within the present project to include corresponding new physical models in conjunction with new developments in mathematics and computer science strategy. The present project will put the non-linear MHD modeling of ELMs and ELM control on the solid ground theoretically, computationally, and applications-wise in order to progress in urgently needed solutions for ITER.
Regarding our contributions, the JOREK code is mainly composed of numerical computations on 3D data. The toroidal dimension of the tokamak is treated in Fourier space, while the poloidal plane is decomposed in Bezier patches. The numerical scheme used involves a direct solver on a large sparse matrix as a main computation of one time step. Two main costs are clearly identified: the assembly of the sparse matrix, and the direct factorization and solve of the system that includes communications between all processors. The efficient parallelization of JOREK is one of our main goals, to do so we will reconsider: data distribution, computation distribution or GMRES implementation. The quality of the sparse solver is also crucial, both in term of performance and accuracy. In the current release of JOREK, the memory scaling is not satisfactory to solve problems listed above , since at present as one increases the number of processes for a given problem size, the memory footprint on each process does not reduce as much as one can expect. In order to access finer meshes on available supercomputers, memory savings have to be done in the whole code. Another key point for improving parallelization is to carefully profile the application to understand the regions of the code that do not scale well. Depending on the timings obtained, strategies to diminish communication overheads will be evaluated and schemes that improve load balancing will be initiated. JOREK uses PaStiX sparse matrix library for matrix inversion. However, large number of toroidal harmonics and particular thin structures to resolve for realistic plasma parameters and ITER machine size still require more aggressive optimisation in numeric dealing with numerical stability, adaptive meshes etc. However many possible applications of JOREK code we proposed here which represent urgent ITER relevant issues related to ELM control by RMPs and pellets remain to be solved.
.
Grant: ANR-COSINUS
Dates: 2010 – 2014
Partners: CEA/DEN/DMN/SRMA (leader), SIMaP Grenoble INP and ICMPE / Paris-Est.
Overview: Plastic deformation is mainly accommodated by dislocations glide in the case of crystalline materials. The behavior of a single dislocation segment is perfectly understood since 1960 and analytical formulations are available in the literature. However, to understand the behavior of a large population of dislocations (inducing complex dislocations interactions) and its effect on plastic deformation, massive numerical computation is necessary. Since 1990, simulation codes have been developed by French researchers. Among these codes, the code TRIDIS developed by the SIMAP laboratory in Grenoble is the pioneer dynamic dislocation code. In 2007, the project called NUMODIS had been set up as team collaboration between the SIMAP and the SRMA CEA Saclay in order to develop a new dynamics dislocation code using modern computer architecture and advanced numerical methods. The objective was to overcome the numerical and physical limits of the previous code TRIDIS. The version NUMODIS 1.0 came out in December 2009, which confirms the feasibility of the project. The project OPTIDIS is initiated when the code NUMODIS is mature enough to consider parallel computation. The objective of the project is to develop and validate the algorithms in order to optimize the numerical and performance efficiency of the NUMODIS code. We are aiming at developing a code able to tackle realistic material problems such as the interaction between dislocations and irradiation defects in a grain plastic deformation after irradiation. These kinds of studies where “local mechanisms" are correlated with macroscopic behavior is a key issue for nuclear industry in order to understand material aging under irradiation, and hence predict power plant secured service life. To carry out such studies, massive numerical optimizations of NUMODIS are required. They involve complex algorithms lying on advanced computational science methods. The project OPTIDIS will develop through joint collaborative studies involving researchers specialized in dynamics dislocations and in numerical methods. This project is divided in 8 tasks over 4 years. Two PhD theses will be directly funded by the project. One will be dedicated to numerical development, validation of complex algorithms and comparison with the performance of existing dynamics dislocation codes. The objective of the second is to carry out large scale simulations to validate the performance of the numerical developments made in OPTIDIS. In both cases, these simulations will be compared with experimental data obtained by experimentalists.
.
Grant: ANR-Blanc (computer science theme)
Dates: 2010 – 2015
Partners: Inria EPI ROMA (leader) and GRAND LARGE.
Overview: The advent of exascale machines will help solve new scientific challenges only if the resilience of large scientific applications deployed on these machines can be guaranteed. With 10,000,000 core processors, or more, the time interval between two consecutive failures is anticipated to be smaller than the typical duration of a checkpoint, i.e., the time needed to save all necessary application and system data. No actual progress can then be expected for a large-scale parallel application. Current fault-tolerant techniques and tools can no longer be used. The main objective of the RESCUE project is to develop new algorithmic techniques and software tools to solve the exascale resilience problem. Solving this problem implies a departure from current approaches, and calls for yet-to-be-discovered algorithms, protocols and software tools.
This proposed research follows three main research thrusts. The first thrust deals with novel checkpoint protocols. This thrust will include the classification of relevant fault categories and the development of a software package for fault injection into application execution at runtime. The main research activity will be the design and development of scalable and light-weight checkpoint and migration protocols, with on-the-fly storing of key data, distributed but coordinated decisions, etc. These protocols will be validated via a prototype implementation integrated with the public-domain MPICH project. The second thrust entails the development of novel execution models, i.e., accurate stochastic models to predict (and, in turn, optimize) the expected performance (execution time or throughput) of large-scale parallel scientific applications. In the third thrust, we will develop novel parallel algorithms for scientific numerical kernels. We will profile a representative set of key large-scale applications to assess their resilience characteristics (e.g., identify specific patterns to reduce checkpoint overhead). We will also analyze execution trade-offs based on the replication of crucial kernels and on decentralized ABFT (Algorithm-Based Fault Tolerant) techniques. Finally, we will develop new numerical methods and robust algorithms that still converge in the presence of multiple failures. These algorithms will be implemented as part of a software prototype, which will be evaluated when confronted with realistic faults generated via our fault injection techniques.
We firmly believe that only the combination of these three thrusts (new checkpoint protocols, new execution models, and new parallel algorithms) can solve the exascale resilience problem. We hope to contribute to the solution of this critical problem by providing the community with new protocols, models and algorithms, as well as with a set of freely available public-domain software prototypes.
.
Grant: ANR-Blanc (applied math theme)
Dates: 2010 – 2014
Partners: Institut de Mathématiques de Toulouse (leader); Laboratoire d'Analyse, Topologie, Probabilités in Marseilles; Institut de Recherche sur la Fusion Magnétique, CEA/IRFM and HiePACS.
Overview: This project regards the study and the development of a new class of numerical methods to simulate natural or laboratory plasmas and in particular magnetic fusion processes. In this context, we aim at giving a contribution, from the mathematical, physical and algorithmic point of view, to the ITER project.
The core of this project consists in the development, the analysis, the implementation and the testing on real physical problems of the so-called Asymptotic-Preserving methods which allow simulations over a large range of scales with the same model and numerical method. These methods represent a breakthrough with respect to the state-of-the art. They will be developed specifically to handle the various challenges related to the simulation of the ITER plasma. In parallel with this class of methodologies, we intend to design appropriate coupling techniques between macroscopic and microscopic models for all the cases in which a net distinction between different regimes can be done. This will permit to describe different regimes in different regions of the machine with a strong gain in term of computational efficiency, without losing accuracy in the description of the problem. We will develop full 3-D solver for the asymptotic preserving fluid as well as kinetic model. The Asymptotic-Preserving (AP) numerical strategy allows us to perform numerical simulations with very large time and mesh steps and leads to impressive computational saving. These advantages will be combined with the utilization of the last generation preconditioned fast linear solvers to produce a software with very high performance for plasma simulation. For HiePACS this project provides in particular a testbed for our expertise in parallel solution of large linear systems.
.
Grant: ANR-14‐CE23‐0005
Dates: 2014 – 2018
Partners: Inria EPI Pomdapi (leader); Université Paris 13 - Laboratoire Analyse, Géométrie et Applications; Maison de la Simulation; Andra.
Overview: Project DEDALES aims at developing high performance software for the simulation of two phase flow in porous media. The project will specifically target parallel computers where each node is itself composed of a large number of processing cores, such as are found in new generation many-core architectures. The project will be driven by an application to radioactive waste deep geological disposal. Its main feature is phenomenological complexity: water-gas flow in highly heterogeneous medium, with widely varying space and time scales. The assessment of large scale model is of major importance and issue for this application, and realistic geological models have several million grid cells. Few, if at all, software codes provide the necessary physical features with massively parallel simulation capabilities. The aim of the DEDALES project is to study, and experiment with, new approaches to develop effective simulation tools with the capability to take advantage of modern computer architectures and their hierarchical structure. To achieve this goal, we will explore two complementary software approaches that both match the hierarchical hardware architecture: on the one hand, we will integrate a hybrid parallel linear solver into an existing flow and transport code, and on the other hand, we will explore a two level approach with the outer level using (space time) domain decomposition, parallelized with a distributed memory approach, and the inner level as a subdomain solver that will exploit thread level parallelism. Linear solvers have always been, and will continue to be, at the center of simulation codes. However, parallelizing implicit methods on unstructured meshes, such as are required to accurately represent the fine geological details of the heterogeneous media considered, is notoriously difficult. It has also been suggested that time level parallelism could be a useful avenue to provide an extra degree of parallelism, so as to exploit the very large number of computing elements that will be part of these next generation computers. Project DEDALES will show that space-time DD methods can provide this extra level, and can usefully be combined with parallel linear solvers at the subdomain level. For all tasks, realistic test cases will be used to show the validity and the parallel scalability of the chosen approach. The most demanding models will be at the frontier of what is currently feasible for the size of models.
Type: FP7
Defi: Special action
Instrument: Specific Targeted Research Project
Objectif: Exascale computing platforms, software and applications
Duration: September 2013 - August 2016
Coordinator: IMEC, Belgium
Partner: Particular specializations and experience of the partners are:
Applications:
NAG - long experience in consultancy for HPC applications
Intel France - collaboration with industry on the migration of software for future HPC systems
TS-SFR - long experience in consultancy for HPC applications in Aerospace and Oil & Gas
Algorithms – primarily numerical:
UA - broad experience in numerical solvers, with some taken up by the PETSc numerical library and other work published in high-ranking journals such as Science.
USI - expertise in parallel many-core algorithms for real-world applications on emerging architectures
Inria - expertise on large scale parallel numerical algorithms
IT4I - experience in the development of scalable solvers for large HPC systems (e.g. PRACE)
Programming Models & Runtime Environments:
Imec - leads the programming model research within the Flanders ExaScience Lab
UVSQ - specialized in code optimization and performance evaluation in the area of HPC
TS-SFR - leading the BMBF funded GASPI project
Fraunhofer - developed a GASPI runtime environment used in industrial applications
Hardware Optimization:
Intel France - investigates workloads for new hardware architectures within the context of the Exascale Computing Research centre
Inria contact: Luc Giraud
Abstract: The EXA2CT project brings together experts at the cutting edge of the development of solvers, related algorithmic techniques, and HPC software architects for programming models and communication. We will produce modular open source proto-applications that demonstrate the algorithms and programming techniques developed in the project, to help boot-strap the creation of genuine exascale codes.
Numerical simulation is a crucial part of science and industry in Europe. The advancement of simulation as a discipline relies on increasingly compute intensive models that require more computational resources to run. This is the driver for the evolution to exascale. Due to limits in the increase in single processor performance, exascale machines will rely on massive parallelism on and off chip, with a complex hierarchy of resources. The large number of components and the machine complexity introduce severe problems for reliability and programmability.
We are involved in the Inria@SiliconValley initiative through the associate team FASTLA described below.
Title: Matrices Over Runtime Systems @ Exascale
International Partner (Institution - Laboratory - Researcher):
KAUST Supercomputing Laboratory (ÉTATS-UNIS)
Duration: 2014 - 2016
See also: http://
The goal of Matrices Over Runtime Systems at Exascale (MORSE) project is to design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale multicore systems with GPU accelerators, using all the processing power that future high end systems can make available. To develop software that will perform well on petascale and exascale systems with thousands of nodes and millions of cores, several daunting challenges have to be overcome, both by the numerical linear algebra and the runtime system communities. By designing a research framework for describing linear algebra algorithms at a high level of abstraction,the MORSE team will enable the strong collaboration between research groups in linear algebra, runtime systems and scheduling needed to develop methods and libraries that fully benefit from the potential of future large-scale machines. Our project will take a pioneering step in the effort to bridge the immense software gap that has opened up in front of the High-Performance Computing (HPC) community.
Title: Fast and Scalable Hierarchical Algorithms for Computational Linear Algebra
International Partner (Institution - Laboratory - Researcher):
Stanford University (ÉTATS-UNIS)
Lawrence Berkeley National Laboratory (ÉTATS-UNIS)
Duration: 2014 - 2016
See also: http://
In this project, we propose to study fast and scalable
hierarchical numerical kernels and their implementations on
heterogeneous manycore platforms for two major computational
kernels in intensive challenging applications. Namely, fast
multipole methods (FMM) and sparse hybrid linear solvers, that
appear in many intensive numerical simulations in computational
sciences. Regarding the FMM we plan to study novel generic
formulations based on
We are involved in the Inria-CNPq HOSCAR project led by Stéphane Lanteri.
The general objective of the project is to setup a multidisciplinary Brazil-France collaborative effort for taking full benefits of future high-performance massively parallel architectures. The targets are the very large-scale datasets and numerical simulations relevant to a selected set of applications in natural sciences: (i) resource prospection, (ii) reservoir simulation, (iii) ecological modeling, (iv) astronomy data management, and (v) simulation data management. The project involves computer scientists and numerical mathematicians divided in 3 fundamental research groups: (i) numerical schemes for PDE models (Group 1), (ii) scientific data management (Group 2), and (iii) high-performance software systems (Group 3).
An annual meeting has been organized in Gramado, Brazil on September, 2014.
Title: Enabling Climate Simulations at Extreme Scale
Inria principal investigator: Luc Giraud
International Partners (Institution - Researcher):
Univ. Illinois at Urbanna Champaign & Argonne National Lab. - Franck Cappello,
Univ. Tennessee at Knoxville - George Bosilca,
German Research School for Simulation Sciences - Felix Wolf,
Univ. Victoria - Andrew Weaver,
Titech - Satoshi Matsuoka,
Univ. Tsukuba - Mitsuhisa Sato,
NCAR - Rich Loft,
Barcelona Supercomputing Center - Jesus Labarta.
Duration: 2011 - 2014
See also: G8 ESC-Enabling Climate Simulations at Extreme Scale
Exascale systems will allow unprecedented reduction of the uncertainties in climate change predictions via ultra-high resolution models, fewer simplifying assumptions, large climate ensembles and simulation at a scale needed to predict local effects. This is essential given the cost and consequences of inaction or wrong actions about climate change. To achieve this, we need careful co-design of future exascale systems and climate codes, to handle lower reliability, increased heterogeneity, and increased importance of locality. Our effort will initiate an international collaboration of climate and computer scientists that will identify the main roadblocks and analyze and test initial solutions for the execution of climate codes at extreme scale. This work will provide guidance to the future evolution of climate codes. We will pursue research projects to handle known roadblocks on resilience, scalability, and use of accelerators and organize international, interdisciplinary workshops to gather and disseminate information. The global nature of the climate challenge and the magnitude of the task strongly favor an international collaboration. The consortium gathers senior and early career researchers from USA, France, Germany, Spain, Japan and Canada and involves teams working on four major climate codes (CESM1, EC-EARTH, ECSM, NICAM).
Olivier Coulaud has been member of the organizing committee of the first International Workshop on Dislocation Dynamics Simulations that was devoted to the latest developments realized worldwide in the field of Discrete Dislocation Dynamics simulations. This international event held in December 10th to the 12th at “Maison de la Simulation” in Saclay, France and attended by 55 participants.
Mathieu Faverge has been member of the technical program committee of the international conference HiPC'14.
Luc Giraud has been member of the scientific program committee of the international conferences HiPC'14, ICS'14, IPDPS'14, VecPar'14, PDCN'14.
Jean Roman has been member of the scientific program committee of the international conference IEEE PDP'14.
Luc Giraud has been involved in the first round of ANR evaluation and has performed reviewing for PRACE.
Furthermore, the HiePACS members have contributed to the reviewing process of several international conferences: IEEE HiPC 2014, CCGRID 2015, IEEE IPDPS 2015, IEEE PDP 2014, ....
Luc Giraud is member of the SIAM J. Matrix Analysis and Applications.
The HiePACS members have contributed to the reviewing process of several international journals (ACM Trans. on Mathematical Software, IEEE Trans. on Parallel and Distributed Systems, Journal of Engineering Mathematics, Parallel Computing, SIAM J. Matrix Analysis and Applications , SIAM J. Scientific Comp., ...).
Undergraduate level/Licence
A. Esnard: Operating system programming, 36h, University Bordeaux I; Using network, 23h, University Bordeaux I.
He is also in charge of the computer science certificate for Internet (C2i) at the University Bordeaux I.
M. Faverge: Programming Environment, 26h, L3; Numerical Algorithmic, 30h, L3; C Projects, 20h, L3, ENSEIRB-MatMeca, France
P. Ramet: System programming 24h, Databases 32h, Objet programming 48h, Distributed programming 32h, Cryptography 32h at Bordeaux University.
Post graduate level/Master
O. Coulaud: Paradigms for parallel computing, 28h, ENSEIRB-MatMeca, Talence; Code coupling, 6h, ENSEIRB-MatMeca, Talence.
E. Agullo: Operating sysems, 24h, University Bordeaux I; Dense linear algebra kernels, 8h, ENSEIRB-MatMeca; Numerical Algorithms, 30h; ENSEIRB-MatMeca, Talence.
A. Esnard: Network management, 27h, University Bordeaux I; Network security, 27h, University Bordeaux I; Programming distributed applications, 35h, ENSEIRB-MatMeca, Talence.
M. Faverge: System Programming, 74h, M1; Load Balancing and Scheduling, 19h, M2, ENSEIRB-MatMeca, Talence.
He is also in charge of the second year of Embedded Electronic Systems option at ENSEIRB-MatMeca, Talence.
P. Ramet: Scheduling, 8h; Numerical Algorithmic, 30h; ENSEIRB-MatMeca, Talence.
He also give classes on Cryptography, 30h, Ho Chi Minh City, Vietnam.
L. Giraud: Introduction to intensive computing and related programming tools, 20h, INSA Toulouse; Introduction to high performance computing and applications, 20h, ISAE-ENSICA; On mathematical tools for numerical simulations, 10h, ENSEEIHT Toulouse; Parallel sparse linear algebra, 11h, ENSEIRB-MatMeca, Talence.
A. Guermouche: Network management, 92h, University Bordeaux I; Network security, 64h, University Bordeaux I; Operating system, 24h, University Bordeaux I.
J. Roman: Parallel sparse linear algebra, 10h, ENSEIRB-MatMeca, Talence; Parallel algorithms, 22h, ENSEIRB-MatMeca, Talence.
Defended PhD thesis
Yohann Dudouit, Scalable parallel elastodynamic solver with local refinment in geophysics, defended on December 8th, advisors: L. Giraud and S. Pernet (ONERA).
Andra Hugo Composabilité de codes parallèles sur plateformes hétérogènes, defended on December 12th 2014, advisors: A. Guermouche, R. Namyst and P-A. Wacrenier.
Clément Vuchener, Algorithmique de l'équilibrage de charge pour des couplages de codes complexes, defended on February 7th 2014. advisors: A. Esnard and J. Roman.
PhD in progress :
Pierre Blanchard, Fast and accurate methods for dislocation dynamics, starting Oct. 2013, advisors: O. Coulaud and E. Darve (Stanford Univ.).
Bérenger Bramas, Optimization of time domain BEM solvers, starting Jan 2013, advisors: O. Coulaud and G. Sylvand.
Astrid Casadei, Scalabilité et robustesse numérique des solveurs hybrides pour machines massivement parallèles, starting Oct. 2011, advisors: F. Pellegrini and P. Ramet.
Jean-Marie Couteyen, Parallélisation et passage à l'échelle du code FLUSEPA, starting Feb 2013, advisors : P. Brenner (Airbus Defence and Space) and J. Roman.
Arnaud Etcheverry, Toward large scale dynamic dislocation simulation on petaflop computers, starting Oct. 2011, advisor: O. Coulaud.
Xavier Lacoste, Scheduling and memory optimizations for sparse direct solver on multicore/multigpu cluster systems, starting Jan. 2012, advisors: F. Pellegrini and P. Ramet.
Alexis Praga, Parallel atmospheric chemistry and transport model solver for massivelly platforms, starting Oct. 2011, advisors: D. Cariolle (CERFACS) and L. Giraud.
Stojce Nakov, Parallel hybrid solver for heterogeneous manycores: application to geophysics, starting Oct. 2011, advisors: E. Agullo and J. Roman.
Maria Predari, Dynamic Load Balancing for Massively Parallel Coupled Codes, starting Oct. 2013, advisors: A. Esnard and J. Roman.
Louis Poirel, Two level hybrid linear solver, starting Nov. 2014, advisors: E. Agullo, M. Faverge and L. Giraud.
Fabien Rozar, Peta and exaflop algorithms for turbulence simulations of fusion plasmas, starting Nov. 2012, advisors: G. Latu (CEA Cadarache) and J. Roman.
Moustapha Salli, Design of a massively parallel version of the SN method for neutronic simulations, starting Oct. 2012, advisors: L. Plagne (EDF), P. Ramet and J. Roman.
Mawussi Zounon, Numerical resilient algorithms for exascale, starting Oct. 2011, advisors: E. Agullo and L. Giraud.
HDR of B. Goglin (Université de Bordeaux) entitled “Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques" defended April 2014. J. Roman (examinator).
PhD of M. Dorier (Ecole Normale Supérieure de Rennes) entitled “Addressing the challenges of I/O variability in post-petascale HPC simulations" defended December 2014. J. Roman (external referee).
PhD of D. Genet (Université de Bordeaux) entitled “Design of generic modular solutions for PDE solvers for modern architectures" defended December 2014. J. Roman (examinator).
PhD of P. Jacq (Université de Bordeaux) entitled “Méthodes numériques de type volumes finis sur maillages non-structurés pour la résolution de la thermique anisotrope et des équations de Navier-Stokes compressibles" defended July 2014. J. Roman (examinator).
PhD of B. Lizé (Université Paris 13) entitled “Résolution Directe Rapide pour les Éléments Finis de Frontière en Électromagnétisme et
Acoustique :
PhD of P. Jolivet (Université de Grenoble et LJLL) entitled “Méthodes de décomposition de domaine. Application au calcul haute performance" defended Ocotber 2104. L. Giraud (examinator).
PhD of R. Kanna (Manchester University) entitled “Numerical linear algebra problems in structural analysis" defended Octobre 2014. Jury: D. Silvester (internal referee) L. Giraud (external referee).
PhD of L. Boillot (Université de Pau et des Pays de l'Adour), entitled “Contributions à la modélisation mathématique et à l’algorithmique parallèle pour l’optimisation d’un propagateur d’ondes élastiques en milieu anisotrope” defended December 2014, E. Agullo (examinator).
In the context of HPC-PME initiative, we started a collaboration with ALGO'TECH INFORMATIQUE and we have organised one of the first PhD-consultant action implemented by Xavier Lacoste led by Pierre Ramet. ALGO’TECH is one of the most innovative SMEs (small and medium sized enterprises) in the field of cabling embedded systems, and more broadly, automatic devices. The main target of the project is to validate the possibility to use the sparse linear solvers of our team in the area of electromagnetic simulation tools developped by ALGO'TECH.
The HiePACS members have organized the PATC training session on Parallel Linear algebra at “Maison de la simulation" in Saclay from March 26th to March 28th.