In TADaaM, we propose a new approach where we allow the application to explicitly express its resource needs about its execution. The application needs to express its behavior, but in a different way from the compute-centric approach, as the additional information is not necessarily focused on computation and on instructions execution, but follows a high-level semantics (needs of large memory for some processes, start of a communication phase, need to refine the granularity, beginning of a storage access phase, description of data affinity, etc.). These needs will be expressed to a service layer though an API. The service layer will be system-wide (able to gather a global knowledge) and stateful (able to take decision based on the current request but also on previous ones). The API shall enable the application to access this service layer through a well-defined set of functions, based on carefully designed abstractions.
Hence, the goal of TADaaM is to design a stateful system-wide service layer for HPC systems, in order to optimize applications execution according to their needs.
This layer will abstract low-level details of the architecture and the software stack, and will allow applications to register their needs. Then, according to these requests and to the environment characteristics, this layer will feature an engine to optimize the execution of the applications at system-scale, taking into account the gathered global knowledge and previous requests.
This approach exhibits several key characteristics:
It is independent from the application parallelization, the programming model, the numerical scheme and, largely, from the data layout. Indeed, high-level semantic requests can easily be added to the application code after the problem has been modeled, parallelized, and most of the time after the data layout has been designed and optimized. Therefore, this approach is – to a large extent – orthogonal to other optimization mechanisms and does not require application developers to rewrite their code.
Application developers are the persons who know best their code and therefore the needs of their application. They can easily (if the interface is well designed and the abstractions are correctly exposed), express the application needs in terms of resource usage and interaction with the whole environment.
Being stateful and shared by all the applications in the parallel environment, the proposed layer will therefore enable optimizations that:
cannot be performed statically but require information only known at launch- or run-time,
are incremental and require minimal changes to the application execution scheme,
deal with several parts of the environment at the same time (e.g., batch scheduler, I/O, process manager and storage),
take into account the needs of several applications at the same time and deal with their interaction. This will be useful, for instance, to handle network contention, storage access or any other shared resources.
Firstly, in order for applications to make the best possible use of the available resources, it is impossible to expose all the low-level details of the hardware to the program, as it would make impossible to achieve portability. Hence, the standard approach is to add intermediate layers (programming models, libraries, compilers, runtime systems, etc.) to the software stack so as to bridge the gap between the application and the hardware. With this approach, optimizing the application requires to express its parallelism (within the imposed programming model), organize the code, schedule and load-balance the computations, etc. In other words, in this approach, the way the code is written and the way it is executed and interpreted by the lower layers drives the optimization. In any case, this approach is centered on how computations are performed. Such an approach is therefore no longer sufficient, as the way an application is executing does depend less and less on the organization of computation and more and more on the way its data is managed.
Secondly, modern large-scale parallel platforms comprise tens to hundreds of
thousand nodes
Lastly, even if an application is running alone, each element of the software stack often performs its own optimization independently. For instance, when considering an hybrid MPI/OpenMP application, one may realize that threads are concurrently used within the OpenMP runtime system, within the MPI library for communication progression, and possibly within the computation library (BLAS) and even within the application itself (pthreads). However, none of these different classes of threads are aware of the existence of the others. Consequently, the way they are executed, scheduled, prioritized does not depend on their relative roles, their locations in the software stack nor on the state of the application.
The above remarks show that in order to go beyond the state-of-the-art, it is necessary to design a new set of mechanisms allowing cross-layer and system-wide optimizations so as to optimize the way data is allocated, accessed and transferred by the application.
In TADaaM, we will tackle the problem of efficiently executing an application, at system-scale, on an HPC machine. We assume that the application is already optimized (efficient data layout, use of effective libraries, usage of state-of-the-art compilation techniques, etc.). Nevertheless, even a statically optimized application will not be able to be executed at scale without considering the following dynamic constraints: machine topology, allocated resources, data movement and contention, other running applications, access to storage, etc. Thanks to the proposed layer, we will provide a simple and efficient way for already existing applications, as well as new ones, to express their needs in terms of resource usage, locality and topology, using a high-level semantic.
It is important to note that we target the optimization of each application independently but also several applications at the same time and at system-scale, taking into account their resource requirement, their network usage or their storage access. Furthermore, dealing with code-coupling application is an intermediate use-case that will also be considered.
Several issues have to be considered. The first one consists in providing relevant abstractions and models to describe the topology of the available resources and the application behavior.
Therefore, the first question we want to answer is: “How to build scalable models and efficient abstractions enabling to understand the impact of data movement, topology and locality on performance?” These models must be sufficiently precise to grasp the reality, tractable enough to enable efficient solutions and algorithms, and simple enough to remain usable by non-hardware experts. We will work on (1) better describing the memory hierarchy, considering new memory technologies; (2) providing an integrated view of the nodes, the network and the storage; (3) exhibiting qualitative knowledge; (4) providing ways to express the multi-scale properties of the machine. Concerning abstractions, we will work on providing general concepts to be integrated at the application or programming model layers. The goal is to offer means, for the application, to express its high-level requirements in terms of data access, locality and communication, by providing abstractions on the notion of hierarchy, mesh, affinity, traffic metrics, etc.
In addition to the abstractions and the aforementioned models we need to define a clean and expressive API in a scalable way, in order for applications to express their needs (memory usage, affinity, network, storage access, model refinement, etc.).
Therefore, the second question we need to answer is: “how to build a system-scale, stateful, shared layer that can gather applications needs expressed with a high-level semantic?”. This work will require not only to define a clean API where applications will express their needs, but also to define how such a layer will be shared across applications and will scale on future systems. The API will provide a simple yet effective way to express different needs such as: memory usage of a given portion of the code; start of a compute intensive part; phase where the network is accessed intensively; topology-aware affinity management; usage of storage (in read and/or write mode); change of the data layout after mesh refinement, etc. From an engineering point of view, the layer will have a hierarchical design matching the hardware hierarchy, so as to achieve scalability.
Once this has been done, the service layer, will have all the information about the environment characteristics and application requirements. We therefore need to design a set of mechanisms to optimize applications execution: communication, mapping, thread scheduling, data partitioning/mapping/movement, etc.
Hence, the last scientific question we will address is: “How to design fast and efficient algorithms, mechanisms and tools to enable execution of applications at system-scale, in full a HPC ecosystem, taking into account topology and locality?” A first set of research is related to thread and process placement according to the topology and the affinity. Another large field of study is related to data placement, allocation and partitioning: optimizing the way data is accessed and processed especially for mesh-based applications. The issues of transferring data across the network will also be tackled, thanks to the global knowledge we have on the application behavior and the data layout. Concerning the interaction with other applications, several directions will be tackled. Among these directions we will deal with matching process placement with resource allocation given by the batch scheduler or with the storage management: switching from a best-effort application centric strategy to global optimization scheme.
TADaaM targets scientific simulation applications on large-scale
systems, as these applications present huge challenges in terms of
performance, locality, scalability, parallelism and data management.
Many of these HPC applications use meshes as the basic model for their
computation. For instance, PDE-based simulations using finite
differences, finite volumes, or finite elements methods operate on meshes
that describe the geometry and the physical properties of the
simulated objects. This is the case for at least two thirds of the
applications selected in the 9th PRACE.
call
Mesh-based applications not only represent the majority of HPC applications running on existing supercomputing systems, yet also feature properties that should be taken into account to achieve scalability and performance on future large-scale systems. These properties are the following:
Datasets are large: some meshes comprise hundreds of millions of elements, or even billions.
In many simulations, meshes are refined or coarsened at each time step, so as to account for the evolution of the physical simulation (moving parts, shockwaves, structural changes in the model resulting from collisions between mesh parts, etc.).
Many meshes are unstructured, and require advanced data structures so as to manage irregularity in data storage.
Due to their rooting in the physical world, meshes exhibit interesting topological properties (low dimensionality embedding, small maximum degree, large diameter, etc.). It is very important to take advantage of these properties when laying out mesh data on systems where communication locality matters.
All these features make mesh-based applications a very interesting and challenging use-case for the research we want to carry out in this project. Moreover, we believe that our proposed approach and solutions will contribute to enhance these applications and allow them to achieve the best possible usage of the available resources of future high-end systems.
The netloc (See Section ) tools have been run on one of the largest European supercomputers (the TGCC/Genci CURIE machine) and successfully modeled its 5200 nodes and its interconnection network (more than 800 switches). This is a joint work with CEA and the COLOC European project.
Network Locality
Functional Description
netloc (Network Locality) is a library that extends hwloc to network topology information by assembling hwloc knowledge of server internals within graphs of inter-node fabrics such as Infiniband, Intel OmniPath or Cray networks. netloc builds a software representation of the entire cluster so as to help application properly place their tasks on the nodes. It may also help communication libraries optimize their strategies according to the wires and switches. netloc targets the same challenges as hwloc but focuses on a wider spectrum by enabling cluster-wide solutions such as process placement. netloc is distributed within hwloc releases starting with hwloc 2.0.
Participants: Cyril Bordage and Brice Goglin
Contact: Brice Goglin
Keywords: High-performance calculation - MPI communication
Functional Description
NewMadeleine is the fourth incarnation of the Madeleine communication library. The new architecture aims at enabling the use of a much wider range of communication flow optimization techniques. Its design is entirely modular: drivers and optimization strategies are dynamically loadable software components, allowing experimentations with multiple approaches or on multiple issues with regard to processing communication flows.
The optimizing scheduler SchedOpt targets applications with irregular, multi-flow communication schemes such as found in the increasingly common application conglomerates made of multiple programming environments and coupled pieces of code, for instance. SchedOpt itself is easily extensible through the concepts of optimization strategies (what to optimize for, what the optimization goal is) expressed in terms of tactics (how to optimize to reach the optimization goal). Tactics themselves are made of basic communication flows operations such as packet merging or reordering.
The communication library is fully multi-threaded through its close integration with PIOMan. It manages concurrent communication operations from multiple libraries and from multiple threads. Its MPI implementation Mad-MPI fully supports the MPI_THREAD_MULTIPLE multi-threading level.
Participants: Alexandre Denis, Nathalie Furmento, Raymond Namyst and Clement Foyer
Contact: Alexandre Denis
Parallel Mesh Partitioning and Adaptation
Keywords: Dynamic load balancing - Unstructured heterogeneous meshes - Parallel remeshing - Subdomain decomposition - Parallel numerical solvers
Scientific Description
PaMPA is a parallel library for handling, redistributing and remeshing unstructured meshes on distributed-memory architectures. PaMPA dramatically eases and speeds-up the development of parallel numerical solvers for compact schemes. It provides solver writers with a distributed mesh abstraction and an API to:
describe unstructured and possibly heterogeneous meshes, on the form of a graph of interconnected entities of different kinds (e.g. elements, faces, edges, nodes);
attach values to the mesh entities;
distribute such meshes across processing elements, with an overlap of variable width;
perform synchronous or asynchronous data exchanges of values across processing elements;
describe numerical schemes by means of iterators over mesh entities and their connected neighbors of a given kind;
redistribute meshes so as to balance computational load;
perform parallel dynamic remeshing, by applying adequately a user-provided sequential remesher to relevant areas of the distributed mesh.
PaMPA runs concurrently multiple sequential remeshing tasks to perform dynamic parallel remeshing and redistribution of very large unstructured meshes. E.g., it can remesh a tetrahedral mesh from 43 millio elements to more than 1 billion elements on 280 Broadwell processors in 20 minutes.
Functional Description
Parallel library for handling, redistributing and remeshing unstructured, heterogeneous meshes on distributed-memory architectures. PaMPA dramatically eases and speeds-up the development of parallel numerical solvers for compact schemes.
Participants: Cedric Lachat, François Pellegrini and Cécile Dobrzynski
Partners: CNRS - IPB - Université de Bordeaux
Contact: Cedric Lachat
Keywords: High-performance computing - Graph algorithms - Domain decomposition - Static mapping - Mesh partitioning - Sparse matrix ordering
Scientific Description
Scotch is a software package and libraries for sequential and parallel graph partitioning, static mapping and clustering; sequential mesh and hypergraph partitioning; and sequential and parallel sparse matrix block ordering.
Its main use is to subdivise a scientific problem, expressed as a graph, into a set of subproblems as independent as possible from each other (in terms of connecting edges).
Functional Description
Scotch takes the form of a set of libraries, plus additional standalone programs. The sequential and parallel libraries provide a set of interfaces to describe centralized and distributed graphs to partition, the target architectures to map onto, the resulting centralized and distributed mapping and ordering structures, etc. Scotch takes advantage of Posix threads, and its parallel version, PT-Scotch, uses the MPI interface.
Participants: François Pellegrini, Cédric Lachat, Rémi Barat and Cédric Chevalier
Partners: CNRS - IPB - Region Aquitaine
Contact: François Pellegrini
Keywords: Intensive parallel computing - High-Performance Computing - Hierarchical architecture - Placement
Scientific Description
TreeMatch provides a permutation of the processes to the processors/cores in order to minimize the communication cost of the application.
Important features are : the number of processors can be greater than the number of applications processes , it assumes that the topology is a tree and does not require valuation of the topology (e.g. communication speeds) , it implements different placement algorithms that are switched according to the input size.
Some core algorithms are parallel to speed-up the execution.
TreeMatch is integrated into various software such as the Charm++ programming environment as well as in both major open-source MPI implementations: Open MPI and MPICH2.
Functional Description
TreeMatch is a library for performing process placement based on the topology of the machine and the communication pattern of the application.
Participants: Emmanuel Jeannot, François Tessier, Adele Villiermet, Guillaume Mercier and Pierre Celor
Partners: CNRS - IPB - Université de Bordeaux
Contact: Emmanuel Jeannot
Hardware Locality
Keywords: HPC - Topology - Open MPI - Affinities - GPU - Multicore - NUMA - Locality
Functional Description
Hardware Locality (hwloc) is a library and set of tools aiming at discovering and exposing the topology of machines, including processors, cores, threads, shared caches, NUMA memory nodes and I/O devices. It builds a widely-portable abstraction of these resources and exposes it to applications so as to help them adapt their behavior to the hardware characteristics. They may consult the hierarchy of resources, their attributes, and bind task or memory on them.
hwloc targets many types of high-performance computing applications, from thread scheduling to placement of MPI processes. Most existing MPI implementations, several resource managers and task schedulers, and multiple other parallel libraries already use hwloc.
Participants: Brice Goglin and Samuel Thibault
Partners: AMD - Intel - Open MPI consortium
Contact: Brice Goglin
netloc (see Section ) is a tool in hwloc to discover the network topology. Our first work with netloc was to redesign it to be more efficient and more adapted to the needs. The code was cleaned and some dependencies were removed. We have added a display tool, that is able to show a network topology in a web browser where a user can interact with. It ran on one of the largest European supercomputer (the TGCC/Genci CURIE machine) and successfully modeled its 5200 nodes and its interconnection network (more than 800 switches).
Moreover, it is now possible to interact with Scotch from netloc. The first feature is to export a network topology, or even the current available topology given by the resource manager, into a Scotch architecture. Conversely, we can use Scotch tools in netloc for building a process mapping based on resources found by netloc and a process graph describing communications between processes. Tests conducted on a stencil mini-app have shown that the benefits are real and still needs more work.
To amortize the cost of communication in HPC application, programmers want to overlap communications with computation. To do so, they assume non-blocking MPI communications will progress in background. NewMadeleine, our communication library, is actually able to make communication progress in background so as to actually have overlap happen. However, not all MPI implementations are able to overlap communication and computation.
We have proposed a benchmark to measure what really happens when trying to overlap non-blocking point-to-point communications with computation. The benchmark measures how much overlap happen in various cases: sender-side, receiver-side, datatypes likely to be offloaded onto NIC or not, multi-threaded computation, multi-threaded communication or not. We have benchmarked a wide panel of MPI libraries and hardware platforms, and thanks to low-level traces, explained the results.
A tool has been developed to abstract performance metrics and map them onto the hwloc (see Section ) topology model of the system. During the year 2016, the tool has been entirely rewritten to release a more meaningful and stable programming abstraction, with off the shelf performance abstraction plugins and raw performance acquisition plugin . A special effort has been carried out on output presentation by extending lstopo tool from hwloc into a library embedded in the monitoring tool to display performance metrics on the system topology. Another backend using R has also been developed for the purpose of post-mortem analysis and model extraction from abstract metrics of the topology.
The years 2016 marked the achievement of our extension of the famous Cache Aware Roofline Model(CARM) and the associate tool. The latter model targets deep plateform and application analysis on multicore processors. Its model consist into a two-dimensions plane bound by several machine ceils and representative of scientific application workloads. Our extension validate the use of the CARM on emerging processors with heterogeneous memory subsystem, and extend the CARM methodology to encompass interconnection network, thus, enabling full modeling of shared memory systems . This work is a collaboration with the INESC-ID research center under the NESUS project.
In the scope of the COLOC project we worked on understanding scalability issues of the efield application on a large shared memory system. Our analysis with above mentionned tools highlighted a potential bandwidth bottleneck. This problem can usually be tackled by the mean of threads and data mapping on respectively the machine cores and the memories. Unfortunately, those techniques can't be applied with this (closed source) application since the system does not allow to monitor memory accesses and traffic on the system.
hwloc (see Section ) is the de facto standard tool for gathering information of parallel platform topologies. The advent of new memory architecture, with high-bandwidth and/or non-volatile memories cause the memory management subsytem complexity to increase. Indeed, besides taking care of allocating data buffers locally, developers also have to choose between different local memories with different performance and persistence characteristics. Moreover, the operating systems still cannot expose the full details about these technologies to applications. We modified the hwloc tool to cope with these new needs in collaboration with Intel. This work led to the design a new structural model for platforms with heteregeneous memories .
hwloc (see Section ) is used for gathering the topology of thousands of nodes in large clusters. Those nodes are now growing to hundreds of cores, making the overall amount of topology information non-negligible. We designed new ways to compress topologies, either lossless or lossy, for easier transfer between compute nodes and front nodes and more compact storage and manipulation . We also studied the overhead of topology discovery on the overall execution time and showed that the Linux kernel is bottleneck on large nodes. It raised the need to use exported and/or abstracted topologies to factorize this overhead .
MPI one-sided operations, aka Remote Memory Access (RMA), are direct read/write memory access to a remote node. Only one node (the origin) explicitely calls MPI operations, while communication progression is implicit for the other node (the target). These operations assume that the communication library is able to make communication progress in background.
Since MadMPI, the MPI implementation of NewMadeleine (see Section ), extensively uses event-driven mechanism to reach asynchronous progression, we have taken advantage of this property to implement MPI RMA operations in the library. This implementation keeps the overlap properties by asynchronously handle the messages exchanged by the applications. The addition also supports MPI_THREAD_MULTIPLE, for both shared and distributed memory contexts.
The evolution of massively parallel supercomputers make palpable two issues in particular: the load imbalance and the poor management of data locality in applications. Thus, with the increase of the number of cores and the drastic decrease of amount of memory per core, the large performance needs imply to particularly take care of the load-balancing and as much as possible of the locality of data. One mean to take into account this locality issue relies on the placement of the processing entities and load balancing techniques are relevant in order to improve application performance. With large-scale platforms in mind, we developed a hierarchical and distributed algorithm which aim is to perform a topology-aware load balancing tailored for Charm++ applications. This algorithm is based on both LibTopoMap for the network awareness aspects and on Treematch to determine a relevant placement of the processing entities. We show that the proposed algorithm improves the overall execution time in both the cases of real applications and a synthetic benchmark as well. For this last experiment, we show a scalability up to one millions processing entities .
Reading and writing data efficiently from storage systems is critical for high performance data-centric applications. These I/O systems are being increasingly characterized by complex topologies and deeper memory hierarchies. Effective parallel I/O solutions are needed to scale applications on current and future supercomputers. Data aggregation is an efficient approach consisting of electing some processes in charge of aggregating data from a set of neighbors and writing the aggregated data into storage. Thus, the bandwidth use can be optimized while the contention is reduced. In , we have taken into account the network topology for mapping aggregators and we propose an optimized buffering system in order to reduce the aggregation cost. We have validated our approach using micro-benchmarks and the I/O kernel of a large-scale cosmology simulation. We have showed improvements up to 15× faster for I/O operations compared to a standard implementation of MPI I/O.
Monitoring data exchanges is critical when it comes to optimize process placement in a large scale environment. We participated in adding in Open-MPI, which is one of the major MPI implementation, a fine grain, point-to-point monitoring component that keeps track of message exchanges. Unlike implementations using PMPI operations, the layer in which this monitoring acts allow us to record at a lower level the effective data communications, for example, after the covering tree has been calculated. This component has been enriched with a complete coverage of collectives, point-to-point and one-sided communications. This component also reports informations about message sizes distribution. Monitored informations can be accessed by using MPI_Tools interface, or by dumping data in files.
We released TreeMatch ver 0.4 in August. The new feature are: a new API, the handling oversubscribing (being able to map more processes that computing resources), fast exhaustive search (for small cases), K-partitioning in case of large arity of the tree, and a set of extensive tests.
SLURM is a Resource and Job Management System, a middleware in charge of delivering computing power to applications in HPC systems. Our goal is to take in account in SLURM placement process hardware topology but application communication pattern too. We have a new , selection option for the cons_res plugin in SLURM. In this case the usually best_fit algorithm used to choose nodes is replaced by TreeMatch, an algorithm to find the best placement among the free nodes list in light of a given application communication matrix. We plan to release this work in the next release SLURM 17.02.
Fragmentation in cluster is one of the criteria important for administrator. Indeed, the way jobs are allocated impacts the global resource usage. Usually it is observed throught utilization of a cluster for a fixed load rate, but no metrics dedicated to fragmentation exist in litterature. Hence we construct several metrics to measure it. Our goal is to study the impact of our selection algorithm on fragmentation in comparison with other.
MPI Non-Blocking Collectives (NBC) allow communication overlap with computation. A good overlapping ratio is obtained when computation and communication are running in parallel. To achieve this, some implementations use progress threads to manage communication tasks. These threads should be bound on different cores to maximize the overlap. Thus, we elaborate several threads placement algorithms. These algorithms have been implemented within the MPC framework, using the hwloc software to get a global view of the machine topology. We propose a thread placement algorithm taking into account the NUMA topology of the machine in order to improve the overlapping ratio of non-blocking collective communications.
MPI, in its current state provides only a very limited set of functionnalities so as to allow the programmer to effectively leverage the physical characteristics of the underlying hardaware, such as the potentially complex memory hierarchy. The MPI philosophy being to be a hardware-agnostic interface, the challenge is therefore to propose an interface extension that offers the programmer significant control over the hardawre without dwelving too much into hardware details. We seek the right level of abstraction for this interface and the goal is push this proposal to the MPI Forum. This new interface is based on the concept of communicators, expands an already existing function available in the standard and also introduces a couple of helper functions. We have prototyped and drafted our proposal for the 2017 meetings of the forum.
Task-based models and runtimes are quite popular in the HPC community. They help to implement applications with a high level of abstraction while still applying different types of optimizations. An important optimization target is hardware affinity, which concerns to match application behavior (thread, communication, data) to the architecture topology (cores, caches, memory). In fact, realizing a well adapted placement of threads is a key to achieve performance and scalability, especially on NUMA-SMP machines. However, this type of optimization is difficult: architectures become increasingly complex and application behavior changes with implementations and input parameters, e.g problem size and number of thread. Thus, by themselves task based runtimes often deal badly with this optimization and leave a lot of fine-tuning to the user. In this work , , , we propose a fully automatic, abstracted and portable affinity module. It produces and implements an optimized affinity strategy that combines knowledge about application characteristics and the architecture's topology. Implemented in the backend of our task-based runtime ORWL, our approach was used to enhance the performance and the scalability of several unmodified ORWL-coded applications: matrix multiplication, a 2D stencil (Livermore Kernel 23), and a video tracking real world application. On two SGI SMP machines with quite different hardware characteristics, our tests show spectacular performance improvements for this unmodified application code due to a dramatic decrease of cache misses. A comparison to reference implementations using OpenMP confirms this performance gain of almost one order of magnitude.
A new set of algorithms has been designed to compute multi-criteria static mappings for the load balancing of multi-phisics simulations. The multi-criteria graph partitioning is known to be NP-hard, and there exist very few multi-criteria graph partitioners. Moreover, they focus on the edge-cut minimization instead of enforcing load balance. In practice, this strategy often leads to very unbalanced partitions, which are not useful for multi-physics simulations.
We have designed algorithms that focus on balancing several criteria at the same time to ensure that our results always match all balance criteria. We have implemented a prototype in Python to test these different heuristics. One of them, called PIERE, obtained good results , in term of balance as well as communication costs. PIERE uses the classic multilevel framework, but implements a new initial partitioning algorithm, which allows to find a balanced partition of the graph. The partition is then refined by local optimization heuristics that ensure the balance is kept for all criteria. This allow us to return a partition respecting the balance constraints. In , we compare against well-known partitioners that are Scotch and METIS, and highlight that, for a small mesh, the results exhibit a high discrepancy: each tool lacks of robustness.
PIERE outperformed the existing software METIS in our test cases, but there is room for improvement. We also verified the superiority of the hypergraph model over the graph model used by most partitioners. Meanwhile, we studied the source code of well known partitioners, namely METIS and Scotch, and we have identified a lot of algorithmic choices and internal parameters that are not described in their documentations. Carefully analyzing them helps us to clearly understand the differences of the different algorithms.
In order to prepare for the inclusion of multi-criteria graph partitioning algorithms in Scotch, in the context of the PhD thesis of Rémi Barat, a new branch has been created in the Scotch repository. This new branch, labeled as 6.1, is the basis for the next main release of Scotch. The sequential graph structure has been adapted to handle graphs with multiple loads per vertex, and all the related algorithms have been adapted to take into account multiple vertex loads. This resulted in minimal updates in the interface of Scotch, with ful ascending compatibility. All of these modifications have been performed so as not to slow down significantly the algorithms in the most common case of graphs with single vertex loads.
Parallel remeshing has been improved. PaMPA coupled with Mmg (v5) remeshed a tetrahedral mesh from 43Melements to more than 1Belements on 280 Broadwell processors in 20 minutes. The resulted mesh, used by CERFACS, permitted one of the most finest simulation computed with LES (Large Eddy Simulation) on combustion.
The scalability of PT-Scotch scalability has been tested on the Curie cluster and compared to that of ParMETIS. These tests used DARI resources.
Most judges have very little, if not none, knowledge on software developement. This results in misconceptions and mistakes regarding the application of copyright/author right (droit d'auteur) in court cases related to software. More generally, the concept of originality is misunderstood. While this criterion is meant in theory to separate works of the mind that are personal to an author (e.g., literary works), from creations of form that cannot, by nature, reflect the personality of their creator (e.g. mathematical tables), it is often used to qualify the degree of similarity between two different works, in the context of plagiarism. Also, the distinction between the realm of programs, that is, works of the mind, and that of algorithms, is not mastered. Algorithms belong to the fonds commun, a French term that has no equivalent in English and might be translated as “common pool”. In order to help judges and lawmakers in understanding these notions, and articulate them, we have proposed a methodology for ruling software disputes. This methodology is solely based on the study of similarities in software code, since author right exclusively pertains to the level of the form .
CEA is granting the PhD thesis of Hugo Taboada on specialized thread management in the context of multi programming models, and the PhD thesis of Rémi Barat on multi-criteria graph partitioning.
Bull/ATOS is granting the CIFRE PhD thesis on Nicolas Denoyelle on advanced memory hierarchies and new topologies.
Onera is granting the PhD thesis of Raphaël Blanchard on the parallelization and data distribution of discontinuous Galerkin methods for complex flow simulations.
EDF is granting the CIFRE PhD thesis of Benjamin Lorendeau on new programming models and optimization of Code Saturn.
Intel is granting $30k and providing information about future many-core platforms and memory architectures to ease the design and development of the hwloc software with early support for next generation hardware.
ANR MOEBUS Scheduling in HPC
(http://
ANR INFRA 2013, 10/2013 - 9/2017 (48 months)
Coordinator: Denis Trystram (Inria Rhône-Alpes)
Other partners: Inria Bordeaux Sud-Ouest, Bull/ATOS
Abstract: This project focuses on the efficient execution of parallel applications submitted by various users and sharing resources in large-scale high-performance computing environments
ANR SATAS SAT as a Service.
AP générique 2015, 01/2016 - 12-2019 (48 months)
Coordinator: Laurent Simon (LaBRI)
Other partners: CRIL (Univ. Artois), Inria Lille (Spirals)
Abstract: The SATAS project aims to advance the state of the art in massively parallel SAT solving. The final goal of the project is to provide a “pay as you go” interface to SAT solving services and will extend the reach of SAT solving technologies, daily used in many critical and industrial applications, to new application areas, which were previously considered too hard, and lower the cost of deploying massively parallel SAT solvers on the cloud.
MULTICORE - Large scale multicore virtualization for performance scaling and portability
Participants: Emmanuel Jeannot and Farouk Mansouri.
Multicore processors are becoming the norm in most computing systems. However supporting them in an efficient way is still a scientific challenge. This large-scale initiative introduces a novel approach based on virtualization and dynamicity, in order to mask hardware heterogeneity, and to let performance scale with the number and nature of cores. It aims to build collaborative virtualization mechanisms that achieve essential tasks related to parallel execution and data management. We want to unify the analysis and transformation processes of programs and accompanying data into one unique virtual machine. We hope delivering a solution for compute-intensive applications running on general-purpose standard computers.
COLOC: the Concurrency and Locality Challenge (http://
Program: ITEA2
Project acronym: COLOC
Project title: The Concurrency and Locality Challenge
Duration: November 2014 - November 2017
Coordinator: BULL/ATOS
Other partners: BULL/ATOS (France); Dassault Aviation (France) ; Enfeild AB (Sweden); Scilab entreprise (France); Teratec (France); Inria (France); Swedish Defebnse Research Agency - FOI (France); UVSQ (France).
Abstract: The COLOC project aims at providing new models, mechanisms and tools for improving applications performance and supercomputer resources usage taking into account data locality and concurrency.
NESUS: Network for Ultrascale Computing (http://
Program: COST
Project acronym: NESUS
Project title: Network for Ultrascale Computing
Duration: April 2014 - April 2018
Coordinator: University Carlos III de Madrid
Other partners: more than 35 countries
Abstract: Ultrascale systems are envisioned as large-scale complex systems joining parallel and distributed computing systems that will be two to three orders of magnitude larger that today’s systems. The EU is already funding large scale computing systems research, but it is not coordinated across researchers, leading to duplications and inefficiencies. The goal of the NESUS Action is to establish an open European research network targeting sustainable solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big data management. The network will contribute to glue disparate researchers working across different areas and provide a meeting ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in research topics such as sustainable software solutions (applications and system software stack), data management, energy efficiency, and resilience. Some of the most active research groups of the world in this area are members of this proposal. This Action will increase the value of these groups at the European-level by reducing duplication of efforts and providing a more holistic view to all researchers, it will promote the leadership of Europe, and it will increase their impact on science, economy, and society.
Partner 1: INESC-ID, Lisbon, (Portugal)
Subject 1: Application modeling for for hierarchical memory system
Partner 2: Argonne National Lab
Subject 2: Topology-aware data aggregation for I/O intensive application
Partner 3: BSC, Barcelona (Spain)
Subject 3: High-performance communication on new architectures; load-balancing and meshing: improve the distribution of data accross the processors for a flow and particle simulation in the human nasal cavity.
Partner 4: University of Liege (Belgium), Université Catholique de Louvain (Belgium), Weierstrass Institute for Applied Analysis and Stochastics (WIAS) (Germany)
Subject 4: Coupling sequential remeshers with PaMPA began in 2016. The work is in progress and it concerns Tetgen developped by Hang Si, and Gmsh by Christophe Geuzaine and Jean-François Remacle.
Joint-Lab on Extreme Scale Computing (JLESC):
Coordinators: Franck Cappello and Marc Snir.
Other partners: Argonne National Lab, University of Urbanna Champaign, Tokyo Riken, Jülich Supercomputing Center, Barcelona Supercomputing Center.
Abstract: The Joint Laboratory is based at Illinois and includes researchers from Inria, and the National Center for Supercomputing Applications, ANL, Riken, Jülich, and BSC. It focuses on software challenges found in extreme scale high-performance computers.
Partner 1: AMD Research
Subject 1: Managing locality in the Heterogeneous System Architecture.
AMD provided hardware and details about its future architectures and programming models (HSA) to improve locality support for its products in the hwloc software.
Partner 1: ICL at University of Tennessee
Subject 1: on instrumenting MPI applications and modeling platforms (works on HWLOC take place in the context of the OPEN MPI consortium) and MPI and process placement
Partner 2: Cisco Systems
Subject 2: network topologies and platform models
Partner 3: University of Tokyo and RIKEN
Subject 3: Adaptation of MPI and runtime systems to lightweight kernels used on clusters of manycores. This action has been submitted as a JLESC project proposal, currently beeing evaluated.
Partner 4: Lawrence Livermore National Laboratory
Subject 4: Testing of the mapping features of Scotch on very large process graphs (more than two billion vertices) and very large target architectures (more than 200,000 parts).
Partner 5: Sandia National Lab
Subject 5: Topology-aware management and allocation of computing resources in runtime systems.
Balazs Gerofi from RIKEN visited us to present his work on micro-kernels for HPC. His visit led to a project proposal for JLESC.
Jose-Luiz Garcìa Zapata, stayed for three months in the team to work on spectral partitioning and mapping. He implemented a spectral bipartitioning method in Scotch.
Guillaume Aupy was the Technical Program vice-chair of SC'17.
Emmanuel Jeannot is member of the steering committee of Euro-Par and the Cluster international conference.
Guillaume Aupy was the co-chair of the Parallel and Distributed Algorithms track of ICA3PP'17.
Emmanuel Jeannot was the Program chair of the Heterogeneity in Computing Workshop (HCW'17).
Emmanuel Jeannot was the Program chair of the track parallelism of COMPAS 2016.
Alexandre Denis was a member of the program committee of Compas'16 and CCGrid 2017.
Brice Goglin was a member of the program committee of CCGrid 2016, Cluster 2016, EuroMPI 2017, HotInterconnect 24 and of the Exacomm workshop.
Emmanuel Jeannot was a member of the program committee of IPDPS 2017, CCGRID 2017,
Guillaume Mercier was a member of the program committee of EuroMPI 2016 and EuroMPI 2017.
Cyril Bordage was reviewer for Cluster 2016.
Alexandre Denis was a reviewer for Cluster 2016.
Brice Goglin was a reviewer for IEEE Micro.
Farouk Mansouri was a reviewer for Cluster 2016.
Guillaume Mercier was a reviewer for IPDPS 2017.
Emmanuel Jeannot is associate editor of the International Journal of Parallel, Emergent & Distibuted Systems (IJPEDS).
Guillaume Mercier is editor of the EuroMPI 2016 Special issue of the Journal of High Performance Computing Applications (IJHPCA).
Guillaume Aupy was a reviewer for EURASIP Journal of Embedded Systems, Cluster Computing and Transactions on Parallel and Distributed Systems (TPDS).
Alexandre Denis was a reviewer for the Journal of Parallel and Distributed Computing (JPDC).
Emmanuel Jeannot was reviewer of IEEE TPDS.
Guillaume Mercier was a reviewer for the EuroMPI 2016 Special Issue of the Parallel Computing journal.
François Pellegrini was a reviewer for SIAM Journal on Scientific Computing (SISC).
Brice Goglin gave a talk about managing hardware locality in HPC during an AMD Tech Talk at AMD Research (Austin, TX).
Emmanuel Jeannot gave a talk about topology-aware data management at the Workshop on Clusters, Clouds, and Data for Scientific Computing (CCDSC 2016).
Emmanuel Jeannot gave a talk about metrics and models for process placement at the Third Workshop on Programming Abstractions for Data Locality (PADAL'16).
François Pellegrini delivered a keynote speech on freedom in the digital age, during the annual congress of Société informatique de France, Strasbourg.
François Pellegrini gave a talk on software law at Université de Nice Sophia-Antipolis.
François Pellegrini participated in a round-table on Big data, compliance and personal data during the JInov meeting, Paris.
François Pellegrini gave a talk on Free software, a tool for sustainable development in countries of the Souths law at the Colloque international sur le logiciel libre dans les pays du Sud, organized by Université Moulay Ismaïl & École nationale supérieure d'arts et métiers de Meknès.
François Pellegrini delivered a talk on freedom in the digital age, during the Defense Security Cyber summer school organized by Université de Bordeaux.
François Pellegrini delivered a talk on freedom and the ethics of informatics during the summer school for young researchers on the ethics of informatics, organized by CERNA and Allistene in Arcachon.
François Pellegrini participated in a round-table on the legal crtieria for software originality in the colloquium on protection and infringement of software : the notion of digital common pool, organized by AFDIT at Conseil national des barreaux, Paris.
François Pellegrini delivered a talk on the issues of rights on immaterial goods for digital development, during the international seminal of training for trainers on internet and information systems governance, organised by ITICC with the support of Organisation Internationale de la Francophonie and ARCEP-BF, in Ouagadougou.
François Pellegrini delivered the opening conference on the legal and economic bases the digital economy, for a training seminar for Members of the Parliament of Benin on the issues of laws on digital matters, organized by Organisation Internationale de la Francophonie at Grand-Popo.
François Pellegrini gave a talk on the operational solutions to digital security issues, during the 4th NGO forum organized by the French embassy in Moscow.
François Pellegrini delivered a keynote speech on the governance of open and free innovation, at the invitation of the French ministry of Foreign affairs, during the workshop on open innovation which took place within the French-German inter-governmental conference on digital issues, in Berlin.
Adèle Villiermet has been invited to give a talk at the summer school of GDR RO.
Emmanuel Jeannot was member of the hiring committee for an assistant professor position in informatics at Université de Bordeaux.
Brice Goglin was also a member of the hiring committee for Inria Bordeaux - Sud-Ouest research scientists.
François Pellegrini was a member of the hiring committee for a full professor position in informatics at Université de Nice Sophia-Antipolis (PR27-327). He also reviewed a PR1 promotion file at Université de Bordeaux.
TADaaM attends the MPI Forum meetings on behalf of Inria (where the MPI standard for communication in parallel applications is developed and maintained).
A proposal in currently under early discussion for submission to the forum .
Brice Goglin gave a tutorial about managing hardware affinities on hierarchical platforms with hwloc during a PRACE Advanced Training Center session.
François Pellegrini gave a “hands-on” tutorial on Scotch during a meeting of the European projet COLOC.
Emmanuel Jeannot is member of the scientific committee of the Labex IRMIA (Université de Strasbourg).
Emmanuel Jeannot is the head of the young researcher commission of Inria Bordeaux Sud-Ouest in charge of supervising the hiring of the PhDs and post-doc of the center.
Members of the TADaaM project gave hundreds of hours of teaching at Université de Bordeaux and the Bordeaux INP engineering school, covering a wide range of topics from basic use of computers and C programming to advanced topics such as computer architecture, operating systems, parallel programming and high-performance runtime systems, as well as software law.
PhD in progress: Remi Barat, multi-criteria graph partitioning, started in 2014. Advisor: François Pellegrini.
PhD in progress: Raphaël Blanchard, parallelization and data distribution of discontinuous Galerkin methods for complex flow simulations, started in 2013. Advisor: François Pellegrini.
PhD in progress: Nicolas Denoyelle, advanced memory hierarchies and new topologies, started in 2015. Advisor: Brice Goglin and Emmanuel Jeannot.
PhD in progress: Benjamin Lorendeau, new programming models and optimization of Code Saturn, started in 2015. Advisor: Yvan Fournier and Emmanuel Jeannot.
PhD in progress: Hugo Taboada, communication progression in runtime systems, started in 2015. Advisor: Alexandre Denis and Emmanuel Jeannot.
PhD in progress: Adèle Villiermet, topology-aware resource management, started in 2014. Advisor: Emmanuel Jeannot and Guillaume Mercier.
PhD stopped: Romain Prou, communication management based on remote memory access, student resigned in october 2016. Advisor: Alexandre Denis and Emmanuel Jeannot.
Brice Goglin was member of the PhD defense committee of:
Mohamed Lamine Karaoui (Université Pierre et Marie Curie, Reviewer).
Emmanuel Jeannot was member of the PhD defense committee of:
Loïc Thiébault (Université de Versailles Saint-Quentin, Reviewer).
François Pellegrini was member of the PhD defense committee of:
Karl-Eduard Berger (Université de Versailles Saint-Quentin);
Alessandro Fanfarillo (Università degli Studi di Roma Tor Vergata, Reviewer);
Thomas Hume, Université de Bordeaux;
Sébastien Morais (Université Évry Val d'Essonne, Reviewer).
Brice Goglin is in charge of the diffusion of the scientific culture for the Inria Research Center of Bordeaux. He organized several popularization activities in the center. He also gave several talks about computer architecture, high performance computing, and research careers to general public audience, school students, teachers, or even to non-expert Inria colleagues.
Brice Goglin was involved in the design of the section about fondamentals of computer science in the 2017 massive open online course that will help teachers of the new ICN section in schools (Informatique et Création Numérique). It was filmed for 10 video sequences (about an hour in total).
François Pellegrini was filmed during a 3-hour conference
on author's rights, in the context of the MAPI'Days, to serve as an
on-line training for personnel and students of Université de Bordeaux
(https://
François Pellegrini is the author of an opinion piece
on digital sovereignty in newspaper Le Monde
(http://
François Pellegrini is the co-author of a booklet on
free/libre software licenses edited by Pôle Systematic Paris Région &
Pôle Aquinetic, which is now in its second edition (http://
In the context of the decree authorizing the TES (Titres
Électroniques Sécurisés) file, François Pellegrini
published a set of three blog posts (starting with
http://
François Pellegrini delivered a talk on Freedom
and the ethics of informatics during a seminar on
Technologies, ethics and cognition organized by the bouddhist
group Dhagpo Bordeaux, in partnership with Cap Sciences and Université
de Bordeaux
(http://
François Pellegrini was filmed, during an interview on Innovation and free/libre licenses, for the ULab Innov+ MOOC.