2025Activity reportProject-TeamTOPAL
RNSR: 202324391S- Research center Inria Centre at the University of Bordeaux
- In partnership with:Bordeaux INP, Université de Bordeaux, CNRS
- Team name: Tools and Optimization for high Performance Applications and Learning
- In collaboration with:Laboratoire Bordelais de Recherche en Informatique (LaBRI)
Creation of the Project-Team: 2023 March 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A1.1.4. High performance computing
- A1.1.5. Exascale
- A1.1.9. Fault tolerant systems
- A1.2.10. Digital Communications
- A1.3. Distributed Systems
- A1.3.4. Peer to peer
- A1.3.5. Cloud
- A1.6. Green Computing
- A2.6.4. Ressource management
- A6.2.5. Numerical Linear Algebra
- A6.2.7. HPC for machine learning
- A7.1. Algorithms
- A7.1.1. Distributed algorithms
- A7.1.2. Parallel algorithms
- A8.1. Discrete mathematics, combinatorics
- A8.2. Optimization
- A9.2. Machine learning
- A9.2.4. Optimization and learning
- A9.2.6. Neural networks
- A9.2.8. Deep learning
- A9.7. AI algorithmics
- A9.9. Distributed AI, Multi-agent
Other Research Topics and Application Domains
- B4.2.2. Fusion
- B9.5.1. Computer science
- B9.5.2. Mathematics
1 Team members, visitors, external collaborators
Research Scientists
- Olivier Beaumont [Team leader, INRIA, Senior Researcher, HDR]
- Lionel Eyraud Dubois [INRIA, Researcher]
- Yulia Gusak [INRIA, Researcher]
- Thomas Herault [INRIA, Senior Researcher, HDR]
- Laercio Lima Pilla [CNRS, Researcher]
Faculty Members
- Aurélien Esnard [UNIV BORDEAUX, Associate Professor]
- Mathieu Faverge [BORDEAUX INP, Associate Professor]
- Abdou Guermouche [UNIV BORDEAUX, Associate Professor, HDR]
- Pierre Ramet [UNIV BORDEAUX, Professor, HDR]
- Philippe Swartvagher [BORDEAUX INP, Associate Professor]
PhD Students
- Adrien Aguila–Multner [INRIA]
- Giorgio Bettonte [HIVE COMPUTING SERVICES SAS, CIFRE, from Oct 2025]
- Abel Anas Calluaud [CEA, CIFRE, until Oct 2025]
- Jean Conan [BULL, CIFRE]
- Jean Francois David [INRIA, until Feb 2025]
- Andrei Drozdov [DIABOLOCOM, CIFRE, from Oct 2025]
- Alan Lira Nunes [INRIA and UFF, Joint-doctorate (cotutelle) with UFF, Brazil]
- Alycia Lisito [BULL]
- Samuel Mendoza [INRIA, from Sep 2025]
- Brieuc Nicolas [INRIA]
- Hayfa Tayeb [INRIA, until Mar 2025]
- Dimitri Walther [CEA, CIFRE]
Technical Staff
- Pierre Estérie [INRIA, Engineer]
Interns and Apprentices
- Fares Boudjaoui [INRIA, Apprentice, from Dec 2025]
- Raphael Bourgouin [INRIA, from Oct 2025]
- Raphael Bourgouin [INRIA, Intern, from May 2025 until Aug 2025]
- Raphael Bourgouin [INRIA, until Apr 2025]
- Killian Chateau [INRIA, Intern, until Apr 2025]
- Enrique Galvez [INRIA, Intern, until Jan 2025]
- Theo Grandsart [INRIA, from Nov 2025]
- Theo Grandsart [INRIA, Intern, from May 2025 until Aug 2025]
- Mohamed Kherraz [INRIA, Intern, from Apr 2025 until Sep 2025]
- Matteo Marcos [INRIA, Intern, from Mar 2025 until Jul 2025]
- Samuel Mendoza [INRIA, from Apr 2025 until Aug 2025]
- Zhaniya Nurkhanova [INRIA, Intern, until Apr 2025]
- Joachim Robert [INRIA, Intern, from Apr 2025 until Aug 2025]
- Victor Lucas Rosada Canesin [INRIA, Intern, from Sep 2025 until Sep 2025]
- Victor Lucas Rosada Canesin [INRIA, Intern, from May 2025 until Aug 2025]
Administrative Assistants
- Catherine Cattaert Megrat [INRIA]
- Marie-Melissandre Roy [INRIA]
2 Overall objectives
The expertise of the team is at the heart of the issues between numerical simulations, training and HPC. In this context, the ability to effectively use the ever-increasing power of machines for numerical simulations (the shift to exascale for the next few years) is always central. These new platforms are characterized by their huge size (in terms of number of cores) and the heterogeneity of computing resources, with most of the computational power based on accelerators. We have largely anticipated these evolutions, and in particular, the different members of the team have been making efforts for several years to promote the use of dynamic runtimes such as StarPU, through a long-running collaboration with Storm project team. Runtime systems allow heterogeneous resources to be used transparently and allow some placement and scheduling decisions to be made dynamically, without the need to make static planning in advance. Indeed, such a fully static allocation would not be able to cope with the uncertainties of task and communication durations in increasingly complex environments and with increasingly shared resources. The question of scaling up these solutions, their use in (Neural Network) training and the effective management of large-scale distributed machines in particular, remains largely open.
As in many other fields, Machine Learning is changing the landscape at many levels. Training of large networks represents a new application for HPC because of the huge computational and memory needs it generates. Training has become a major source of use for converged HPC systems such as the Jean Zay supercomputer at IDRIS. If considered as an HPC workflow, it is an application that is quite different from traditional numerical simulation applications, because the calculations are tensor-based rather than matrix-based and because the nature of the dependencies makes parallelization more difficult and more intertwined with memory management issues.
On the other hand, ML plays a central role in the analysis of data, particularly data produced by large scientific instruments and large numerical simulations. In this context, it is important to bridge the data placement, resource allocation and computational scheduling strategies that are used to perform simulations and to perform data analysis. There again, we believe that dynamic runtime schedulers, coupled with static data placement strategies, are a relevant and promising tool. Finally, training represents a very important market, has a strong and growing influence on processor architectures, their accuracy and their arithmetics. This requires to further adapt the algorithms, the management of ever-increasing heterogeneity and the control of computational accuracy, both for classical numerical kernels and training deep neural networks.
Another major concern is the control of energy and carbon footprint minimizations. HPC is not naturally and historically an area of energy sobriety, but energy is a critical issue. Firstly, energy is a major subject because the race towards exascale has highlighted the difficulty of electrically powering all these resources, and the increasing presence of dark silicon in computing resources makes resource allocation and power management problems extremely difficult. Furthermore, the minimization of our carbon footprint is a major societal issue and must be an axis of evaluation for our research. In this context, we believe that the solution cannot only be at the architecture and system levels, but that it is necessary to rethink parallel numerical kernels and algorithms in such a way as to allow prolonged use of the computing resources.
A new development in the team’s research is the explicit focus on communication efficiency and fault tolerance as central challenges of modern high-performance computing. As platforms continue to scale in size and heterogeneity, the cost of data movement increasingly dominates execution time, while hardware and software failures can no longer be considered exceptional. Addressing these issues requires approaches that integrate communication management and resilience directly into algorithms and runtime systems, rather than treating them as external concerns. This problem statement is particularly relevant to the team’s main application domains—linear algebra and machine learning—where large-scale data exchanges, iterative methods, and long-running computations make performance and robustness tightly coupled.
Overall, the objective of the project is to transfer our historical expertise in linear algebra, runtime systems and combinatorial optimization (resource allocation, scheduling) to new problems (decompositions and tensor algebra, training in DNNs) which require a change of scale and new algorithms for new computing platforms (with different number representations and an ever increasing heterogeneity of computing resources). In addition, these new applications and new platforms require a central focus on data, since the gap between the costs (in energy and time) of storing and moving data compared to the costs of computation is always growing, which encourages innovative solutions (compression, redundant computation) that can in turn contribute to increasing the duration of use of computing resources.
3 Research program
3.1 Objectives
We propose to structure our research around two main application fields (see Section 4): linear multi-dimensional algebra and solvers on the one hand, and training in particular of deep learning networks on the other hand. In these two domains, our contributions will be organized around three main research axes (see Section 3.3): the use of task based runtime systems (to provide robust solutions and to increase the portability in the context of heterogeneous large scale platforms), the use of compression (to limit memory footprint and data transfers) and the minimization of energy consumption and carbon impact (using an approach of rewriting algorithms and placement strategies to limit data movements). This matrix organization of our activities (see Section 3.4) is intended to maximize the interactions between the different researchers of the team and facilitate knowledge sharing and joint participation in projects.
In these topics, the use of task based runtime systems and the design of efficient linear algebra kernels and solvers belong to the historical expertise of the team and is shared by all team members, especially in the context of linear algebra kernels. Our goal is to build on this expertise to extend the use of task based runtime systems to other types of applications such as training and to use the precise knowledge of these linear algebra kernels to incorporate new criteria such as energy minimization. The application to training (and interference) in deep neural networks and data compression are subjects we have been interested in for a few years, typically during the last HiePACS evaluation period and within the Inria Challenge of AI, HPC and Big Data led by Bruno Raffin. The extension of the techniques developed in linear algebra to tensor algebra and tensor decompositions is natural, given the proximity of the fields and the practical importance of the subject, but it is more recent and reinforced by the arrival of Julia Gusak, who is an expert in the field. Finally, the objective of energy and carbon footprint minimization, at the algorithmic and software levels rather than at the architecture level, is a field that we wish to emphasize in our research, both because of its own fundamental importance and because we believe that our expertise and the techniques that we have developed in recent years are well adapted to it and that the approach we propose is original.
3.2 Overall Positionning
The general positioning of the team is to produce tools for users, academic or industrial, in the form of algorithms and software libraries. These users can work either in numerical simulation or in training. Nevertheless, as our experiences in simulation and training have already demonstrated, this interaction cannot be carried out in the form of providing black boxes and it is crucial for us to work directly with the users of our software to understand their needs and adapt our algorithms and codes to the characteristics of their data. This interaction will be particularly critical to work on data representation and compression, which requires a strong interaction with numerical methods and machine learning in order to understand the application requirements and the characteristics of data, based on their significance.
At the other end of the spectrum, it is also essential for us to maintain close relationships with both the architecture and system communities. Indeed, the very rapid growth of machine learning applications has also renewed the landscape of computing resources with the emergence of very original solutions, at the architectural and arithmetic level. Even if we cannot influence on these evolutions, it is very important to propose solutions that make the best use of them. We also decided several years ago to rely on task based runtime systems to implement our software developments. This decision has many implications on our developments and requires an extremely close collaboration with their designers. In this context, we have co-supervised several PhD theses related to StarPU with the Storm project team and we will pursue this strategy, which is crucial in particular to take into account the challenges ahead of us: the transition to exascale, the integration of the energy, the extension to training applications and the ever increasing heterogeneity of computing resources.
3.3 Research Axes
3.3.1 Use of Runtime systems
Participants: Olivier Beaumont, Aurélien Esnard, Lionel Eyraud Dubois, Mathieu Faverge, Abdou Guermouche, Thomas Herault, Laércio Lima Pilla, Philippe Swartvagher.
In previous works, our main goal was to study the methodology needed to efficiently exploit the new generation of high-performance computers with all the constraints that it induces (number of cores, heterogeneity, co-scheduling effects, etc.). To achieve this goal, we successfully proposed a methodology based on the use of modern task-based runtime systems to ensure both portability and performance portability (the ability to achieve high performance by only tuning few parameters of the application). This work was done in the context of several projects (ANR Solhar, ANR SOLHARIS, Projet Région HPC Scalable Ecosystem, etc.). The work done mainly targeted single multicore nodes equipped with several accelerator devices and the extension of these techniques to the multi-node case will be the focus of our future works, especially with the arrival of Philippe Swartvagher in the team. Indeed, it has been observed that in the context of distributed nodes, the placement strategies of runtime systems are insufficient and generate too much communication. In this context, it is therefore crucial to develop efficient placement strategies 60, 53. The extension of these mixed (static/dynamic) strategies in the case of tensors is largely open.
3.3.2 Design of compression techniques
Participants: Abdou Guermouche, Yulia Gusak, Mathieu Faverge, Pierre Ramet, Philippe Swartvagher.
The memory consumption of the applications has been and will remain an important challenge for solving larger problems that will lead to exascale computations. In the recent years we have demonstrated the interest of data compression techniques in linear solvers, both to save space and computations. Increasingly complex compression schemes require programming models to evolve to properly express the parallelism of these formats and to accommodate the increasing irregularity of applications. In TOPAL, we propose to continue the study of data compression techniques (low-rank, mixed precision, ...) in the context of solvers, but also in the context of training and multi-linear algebra. This part will be a very pertinent field for the study of applications over runtime systems, because of the strong irregularities that make the load balancing more complicated. At the same time, it is an original and promising approach for energy reduction. Representing convolutional / fully-connected weights in tensor formats is an effective way to reduce the parameters/FLOP in neural networks. However, post-quantization (reduction of parameters precision, for example, from float32 to int8) of networks with factorized weights yields a significant drop in accuracy. Due to memory/power consumption limitations of real devices, the quantization step is necessary, when pre-trained models are deployed. Therefore, our goal is to find algorithms that build tensorized neural networks, where weight factors are directly contain elements in low-precision format. Efficient implementation of operations on tensors represented in low-bit format will be required, as well as development of regularization techniques to tackle instability issues when training deep learning models with low-bit weights.
3.3.3 Energy minimization
Participants: Olivier Beaumont, Lionel Eyraud Dubois, Mathieu Faverge, Abdou Guermouche, Yulia Gusak, Laércio Lima Pilla.
Running computations with resource frugality is an important challenge, both for the upcoming exascale shift and for generally reducing the carbon impact of scientific computing. In addition to the usual objective of making computations run faster, we thus intend to design and evaluate our techniques and algorithms with the purpose of limiting their carbon footprint. In particular, given the lasting trend that the time and energy costs of computing are becoming ever lower than the costs of accessing and communicating data, we want to explore the tradeoffs of trading more computation for less data movements. This can be achieved in several ways: compression techniques as described above, replication of some computations, or use of lower precision. We are planning to work on this issue from two points of views: more frugal numerical algorithms, and energy-aware scheduling techniques. As for the embedded architectures in the phone, but also in the latest generation of laptops (Apple M1 Pro and Max chips), we are starting to see the emergence of Big-Little type technologies in the design of HPC oriented chips. In general, thermal design power (TDP) constraints push architects to increase the diversity and number of energy efficient circuits, even if they cannot all be powered simultaneously. If this hardware solution is very debatable from the point of view of carbon impact, it raises difficult and original questions about the optimization of computing performance under energy constraints. This kind of approach opens new perspectives, both from the point of view of scheduling algorithms but also in the design of computational kernels in linear algebra. We are also seeing the emergence of new processors (ARM or RISC-V technologies, Rhea from the SiPearl company within the EPI consortium, which should seriously compete with the supremacy of x86 architectures (Intel and AMD) with Nvidia accelerator cards in the search for a compromise between pure performance and energy sobriety.
In the field of training, a complementary opportunity is available. Indeed, contrary to classical HPC, the renewal of computational resources is often linked to the need to run larger models (and data with a better resolution to a lesser extent), rather than by the acceleration of computations. In this context, the possibility offered by tools such as Rotor 7.1.8, Rockmate 7.1.7, ELF 7.1.2 to limit memory requirements contributes to limiting the carbon footprint. Our goal is to extend the scope of these techniques, including to other fields of application than training. Our collaboration with Qarnot Computing is consistent with this objective. The co-design environment of the TextaRossa and Eupex projects 10 are also great avenues to explore these questions.
3.3.4 Communication and Fault Tolerance
Participants: Olivier Beaumont, Lionel Eyraud Dubois, Thomas Herault, Philippe Swartvagher.
The new research axis on communication and fault tolerance represents an opportunity for the team to address a broader spectrum of challenges arising in modern high-performance computing platforms. As applications increasingly rely on large numbers of interconnected components, communication costs and failures have become central limitations to performance, scalability, and usability. This axis builds on the expertise brought by the arrival of Thomas Hérault as a research director, together with the team’s existing strengths in communication systems through Philippe Swartvagher, to explore new techniques spanning communication optimization, resilience mechanisms, and their interaction with runtime systems. By covering a wider range of problems and solution strategies, this axis naturally complements the existing research directions of the team and reinforces their applicability to the targeted application domains, enabling more scalable, efficient, and robust executions on current and future computing platforms.
3.4 Main Research Topics
The list of our contributions can be read at the intersection of the research domains described in Section 4 and research axes described in Section 3.3 as shown in the following table:
| Axis 3.3.1 – | Axis 3.3.2 – | Axis 3.3.3 – | Axis 3.3.4 – | |
| Runtime | Compression | Energy | Comm. & Fault Tol. | |
| Domain 4.1 – Lin. Alg., Tensors | Topic 3.4.1 | Topic 3.4.2 | Topic 3.4.3 | Topic 3.4.7 |
| Domain 4.2 – Training | Topic 3.4.4 | Topic 3.4.5 | Topic 3.4.6 | Topic 3.4.8 |
3.4.1 Task-based Linear Algebra and Tensor Computations
Participants: Olivier Beaumont, Aurélien Esnard, Lionel Eyraud Dubois, Mathieu Faverge, Abdou Guermouche, Thomas Herault, Pierre Ramet, Philippe Swartvagher.
We plan to continue our activity on task-based linear algebra to find solutions for expressing high level algorithms in an elegant way while ensuring high performance. First, we want to consider the expressivity of the algorithms for large scale distributed architectures while considering the specific problems of scheduling, data and task mapping, and data granularity. This work will be done in tight collaboration with the Storm and Tadaam teams and is a key objective of the ANR SOLHARIS project. Moreover, the foundations of this topic fall back to the HiePACS project. Thus, we plan to collaborate and exchange with the CONCACE team on topics which are of interest to both teams (mainly expressivity and scalability). Second, as mentioned above, we plan to study data compression techniques in linear algebra 70, 75, 79, which brings new algorithmic schemes that are outside of the scope of the classical programming model used until now. As mid and long term objectives, we would like to find new ways to express these linear algebra algorithms to efficiently exploit large heterogeneous architectures. A second research topic focuses on the extension of the techniques developed in the framework of linear algebra, in particular with the Chameleon library, to multi-linear algebra and tensors. The idea is to build on the expertise we have in the field of compression and in the use of runtimes to use heterogeneous resources in particular.
Another challenge would be to redesign the graph partitioning & matrix ordering algorithms in a task-based runtime, in order to facilitate the integration of this basic building block in modern tasked-based solvers. This work has already been initiated in the StarPart 7.1.5 project.
3.4.2 Multi-Linear Algebra and Tensor Decompositions
Participants: Olivier Beaumont, Lionel Eyraud Dubois, Mathieu Faverge, Abdou Guermouche, Yulia Gusak, Thomas Herault, Pierre Ramet.
Tensor decompositions can be viewed as a natural generalization of SVD-type matrix decompositions from linear algebra. In the tensor setting, several decomposition formats have been developed, each offering different trade-offs between expressiveness, computational cost, and compression efficiency. These methods play an important role in the analysis of large-scale data, as well as in the compression and inference/training acceleration of neural networks. The addition of Julia Gusak to the project strengthens our expertise in this area 83, 65.
In addition to the basic kernels to be integrated in Chameleon proposed in the Topic 3.4.1, we will propose distributed tensor decomposition algorithms compression algorithms, focusing on low-order tensors with large mode dimensions, which are common in neural network models.
3.4.3 Energy Minimization in Linear Solvers
Participants: Mathieu Faverge, Abdou Guermouche.
We plan to investigate how to reduce the energy consumption of linear algebra libraries (either sparse or dense). To do so we will rely on an algorithmic approach rather than a system approach. The idea, in a first step, is to consider several implementations of a same kernel and select the implementation while taking into account energy consumption 51, 50, 52. For instance a low-rank implementation of a given operation will be slower than a regular high-performance implementation but it will tend to require less energy. In the longer term, we plan also to investigate how to design energy efficient implementations of basic kernels. They will then be used within higher level algorithms in order to find a better trade-off between energy consumption and high performance. In the context of developing linear algebra solvers using compression techniques, a research axis we would like to develop is the energy consumption study of these solvers: is it possible to provide computation kernels with different energy consumption levels that can be easily exchanged to lower the final energy consumption of the application while keeping the same numerical accuracy. Low-rank compression techniques, as well as mixed-precision solution are envisioned toward this objective.
3.4.4 Task-based Approaches for Deep Learning
Participants: Olivier Beaumont, Lionel Eyraud Dubois, Mathieu Faverge, Abdou Guermouche, Yulia Gusak, Thomas Herault, Laércio Lima Pilla, Pierre Ramet, Philippe Swartvagher.
In popular Deep Learning frameworks like TensorFlow or PyTorch, the parallelization of the training process is performed with a large granularity, mostly relying on Data Parallelism. Specialized frameworks have been proposed to explore finer parallel schemes, like PipeDream for model parallelism 81. These implementations are however very static and require explicit and error-prone data management policies. We believe that our expertise in using task-based runtime systems can be used to provide much simpler approaches for a finer grain control on the execution of the corresponding task graphs and communications patterns, for both training and inference phases. We plan to design a prototype implementation that would allow to easily use clever scheduling and optimization techniques to improve the performance of inference. In the longer term, we expect that this approach will provide better scalability and flexibility, and unlock new opportunities for optimization, for a wide range of deep learning applications.
3.4.5 Tensor Compression for Inference
Participants: Olivier Beaumont, Yulia Gusak.
We envision a research activity focused on the use of tensor compression for inference. Initially, the objective is to combine tensor compression techniques and quantization in order to enable inference under strict memory constraints or low-latency requirements 67, 83. These techniques can also be extended to the context of on-device training, which in particular requires memory-saving approaches 74. Finally, a more ambitious goal would be to combine these approaches with methods for designing neural networks that are inherently efficient in terms of memory usage 71.
3.4.6 Carbon Saving and Energy-Efficient Training
Participants: Olivier Beaumont, Lionel Eyraud Dubois, Yulia Gusak, Laércio Lima Pilla.
The training phase of Deep Neural Networks is notoriously very resource-hungry, especially regarding its energy consumption. In the last years, we have proposed several algorithmic solutions (re-materialization 54, 5, offloading 57, their combination 55, pipelining 58, 36) to reduce the resource consumption of this training phase, with a focus on reducing the training time. We plan to broaden the scope of these studies, by also taking into account the energy usage. A heterogeneous context and a flexible runtime system, as planned in Topic 3.4.4, may also be an opportunity to reduce energy consumption by allocating some tasks, typically the non-critical ones, to the most efficient resources for them, or by selecting a different implementation with better energy efficiency. This can be seen as a generalization of mixed-precision techniques, which are also very popular in this context to help achieving a better frugality. However, care must be taken to not degrade the convergence of the training phase. Moreover, the carbon footprint comes essentially from the manufacturing 82, 73 of the computing resources (GPUs) and the main goal is to facilitate their non-renewal, as enabled by memory saving techniques.
3.4.7 Communication-Aware Resilience Patterns for Iterative Linear Algebra
Participants: Thomas Herault.
This year, our work in the Communication and Fault Tolerance axis addressed the growing need for efficient and resilient execution of large-scale iterative linear algebra algorithms on modern HPC platforms. At scale, such computations are increasingly limited by communication costs and are exposed to a wide range of faults, including process failures, silent data corruptions, and memory errors, which can no longer be treated as rare events. Our contributions explore communication-aware resilience strategies that integrate fault detection, recovery, and checkpointing directly into the algorithmic structure of iterative methods. By carefully controlling the frequency and granularity of verification, redundancy, and checkpointing mechanisms, we showed that it is possible to bound error propagation while significantly reducing overheads compared to classical replication-based approaches. A central outcome of this work is a set of analytical models and optimization techniques that guide the design of hierarchical and adaptive resilience patterns, balancing computation, communication, memory usage, and execution time under realistic system constraints such as bounded detection latency and fixed-time resource allocations. Although primarily evaluated on linear algebra solvers, these techniques are largely generic and directly applicable to other iterative workloads, including neural network training, where similar trade-offs between communication efficiency, redundancy, and robustness arise.
3.4.8 Communication-Aware Resilience Patterns for Training and Inference
Participants: Olivier Beaumont, Fares Boudjaoui, Lionel Eyraud Dubois, Thomas Herault, Philippe Swartvagher.
The training and inference phases of modern machine learning workloads rely heavily on large-scale collective communications, which are traditionally designed under the assumption of stable, homogeneous, and reliable infrastructures. However, as execution platforms become increasingly heterogeneous and volatile, communication costs and failures have a growing impact on both performance and correctness. Building on our recent work on resilience patterns for iterative algorithms, we are initiating new research on communication-aware resilience mechanisms for distributed training and inference, with the goal of jointly addressing efficiency and robustness. This work, launched with the PhD of Farès Boudjaoui in the context of the Cupseli Inria challenge, explores adaptive communication schemes, fault-tolerant collectives, and dynamic reconfiguration strategies that can tolerate node unavailability, bandwidth variability, and network contention, while limiting synchronization and data movement overheads. By integrating resilience directly into communication patterns, rather than treating failures as exceptional events, we aim to support scalable and robust executions in both centralized HPC platforms and more decentralized environments, without degrading convergence or inference quality.
4 Application domains
4.1 Multi-Linear Algebra and Solvers
Participants: Olivier Beaumont, Aurélien Esnard, Lionel Eyraud Dubois, Mathieu Faverge, Abdou Guermouche, Yulia Gusak, Thomas Herault, Pierre Ramet, Philippe Swartvagher.
At the core of a large number of simulation tools, the resolution of large linear systems often represents the dominant part of the computing time. These linear solvers rely on a wide variety of numerical methods and algorithms. Massively parallel versions are required to support advances in multi-physics and multi-scale simulations, especially when targeting exascale platforms. The aim is therefore to address the major challenge of designing and building numerically robust solvers on top of runtime systems that can scale up and push back the limits of existing industrial codes by making full use of all computing resources such as CPUs, GPUs and other accelerator units. Following the ANR project SOLHARIS (and previously SOLHAR), we now have experience of strong/weak scalability of sparse direct solvers on large scale, distributed memory, heterogeneous computers. These solvers already rely on asynchronous task-based parallelism 48, 49, 78, 47, rather than traditional and widely adopted message-passing and multithreading techniques. Indeed, the use of modern runtime systems have proven to be good tools for the development of scientific computing applications 80, 62, 86, in particular in combination with compression 63, 85, 84, 59, 76 and communication avoiding techniques 60, 53. This work can be extended naturally to multi-dimensional objects such as tensors. In the tensor case, we propose to extend the data distribution strategies to minimize communication and the use of system runtimes to handle the variability and heterogeneity of computational resources. Finally, we have focused so far on minimizing the execution time, whereas energy efficiency is becoming a critical element. We therefore plan to revisit the algorithms and methods we developed in linear algebra, and those we propose to design for handling tensors, to allow the optimal use of the available hardware in order to guarantee the performance of the computations within a fixed energy budget.
4.2 Training and Inference for DNNs
Participants: Olivier Beaumont, Lionel Eyraud Dubois, Yulia Gusak, Thomas Herault, Laércio Lima Pilla, Pierre Ramet, Philippe Swartvagher.
The training phase in Deep Neural Networks has become an important source of HPC resource usage and it is crucial to perform it efficiently on parallel architectures. Until today, data parallelism is the most widely used method, but the associated requirement to replicate all the weights on all computing resources causes memory issues at the level of each node and of collective communications at the level of the platform.
In general, the overall shape of the dependency graphs associated with the feed forward training phase has characteristics (long dependencies) that generate a lot of memory needs and data exchange. However, there are multiple opportunities to address these problems by combining 55 re-computations 66, 54, 61, 77, 72, 5, offloading 57, compression and different parallelism strategies (image, filter, kernel, model parallelism 58, 81, 56, 74, 36). It is also promising to consider other more radical techniques to go beyond feed forward training, such as the use of multigrid reduction in time (MGRIT) 68, 69 that come from the field of numerical simulations and that we already address in other contexts.
Within this general framework, the minimization of carbon footprint is obviously a major concern that must guide strategies. Tools to train complex and deep network on otherwise obsolete hardware using memory saving techniques are already a strong contribution in this direction to increase the lifetime of computing resources. and our goal is to extend these techniques in terms of efficiency and in terms of scope, which has consumed a little more energy associated with the computations. As in the case of linear algebra, energy optimization also requires the use of heterogeneous computation resources (CPUs, GPUs, TPUs, FPGAs). Conversely, this heterogeneity hinders scalability because of difficulties in predicting task durations and makes the use of dynamic runtime schedulers necessary. Finally, the use of these dynamic runtimes also poses the problem of knowing what needs to be decided statically and dynamically in terms of resource allocation and scheduling.
5 Social and environmental responsibility
5.1 Footprint of research activities
As part of our research activities, we use local computing resources such as PlaFRIM and the national computing resources of IDRIS and the TGCC.
The environmental impact of using these platforms is significant, whether for numerical simulation or training applications. However, the positioning of the team, which produces simulation and training tools but does not directly perform simulations and training, is relatively limited. For example, in the case of training, we have so far concentrated on techniques that do not modify the architecture of the networks and the computations that are performed, so that the number of epochs and the final accuracy are not impacted. In this way, it is possible to validate our developments to accelerate training on a single batch (at full machine scale) and then to extrapolate the acceleration at the whole training scale. Similarly, the techniques developed in linear algebra in the team often do not depend (typically for dense approaches) on the numerical properties of the matrices, so that acceleration (for a given problem size) can be validated without heavy experimental campaigns, beyond what is necessary to obtain valid experimental results in complex environments where performance varies from one experiment to another.
In this context, the use of simulation as opposed to direct experimentation is also a tool that enables us to limit the impact of our research on power consumption, since simulation can save several orders of magnitude in power consumption compared with direct experimentation. In this context, it is crucial to produce simulation tools that are as precise and generic as possible, and the team has been actively collaborating for many years in the development of simulation tools such as SimGRID.
Nevertheless, the tools we produce are used on a large scale in terms of computation resources and simulation/training time, and the associated energy consumption issue is therefore indirectly crucial. In this context, we are developing original solutions for reusing the heat dissipated by computation resources, in particular as part of the Inria-Qarnot Computing Pulse challenge (see Section 5.2). We have also added a research axis aimed at minimizing energy consumption for a given kernel (Section 3.3.3).
TOPAL has also signed the "Labos en transitions" Charter of Commitment for research facilities on the Bordeaux university site whose preamble states that "Faced with contemporary environmental and societal challenges, and the urgent need for systemic transformation to meet them, the academic world has a particular responsibility: to promote responsible research, aware of environmental issues and respectful of the people who produce it, which contributes to transitions and enables us to understand and guide current and future societal transformations". In exchange for this commitment, the establishments undertake to provide us with an estimate of the impact of our research activities (including the purchase of equipment and missions). At this stage, this information is difficult to aggregate at team level, but making it available will enable us to measure our progress and involvement.
5.2 Impact of research results
5.2.1 Carbon Impact of Cloud Platforms
To limit the environmental impact of cloud computing, Qarnot focuses on re-using the heat produced by computations in heat circuits or boilers. As part of the Pulse Inria challenge, we are working with Qarnot on algorithms for placing computations on their infrastructure, so as to maximize the use of reusable heat sources, depending on computation demand and task characteristics. The aim is to enable users of the Qarnot platform to specify their objective function on the (carbon footprint, time, cost) axes, and to be able to meet it.
Our activities with Hivenet, conducted within the framework of the Cupseli challenge, complement this approach. In the long term, one of Cupseli’s objectives is to enable the use of distributed computing resources—typically owned by gaming venues—to carry out inference and learning tasks. The aim is therefore to extend the lifespan and usage of these computing resources by providing them with practical utility and added value.
5.2.2 Democratization of Large Models Training
In the context of training, at one end of the spectrum we see the provision of computing resources, such as the Jean Zay supercomputer, whose efficient use requires large-scale parallel training algorithms and frameworks to optimize resource utilization and accelerate time to discovery. At the other end of the spectrum, we see the importance of enabling researchers from different communities to use the resources at their disposal (often just a few GPUs) to develop original models without being constrained by hardware limitations. In particular, recent transformer-based models are very heavy-weight, and techniques must be employed to run them on GPUs that are only a few years old, without compromising data quality, computational accuracy, or model size. In particular, the Topal team has been working for several years on memory-saving strategies to enable the training of large models on limited-capacity resources (re-materialization and offloading), and on software 7 such as Rotor and Rockmate, which are recognized and visible in the AI applications community and enable researchers with access to limited capacity resources to train large models. Recent ELF 7.1.2 software has been developed to optimize multi-node, multi-GPU training using various types of parallelism and memory-saving techniques. While remaining user-friendly, it supports the easy integration of custom strategies and has been validated at large scale during the NVIDIA–OpenACC IDRIS'25 hackathon through the training of large language models and diffusion models.
6 Highlights of the year
Best Paper Award for “Scheduling Strategies for Partially-Replicable Task Chains on Two Types of Resources” 22 in Heterogeneity in Computing Workshop (HCW) - Participant: Laércio Lima Pilla.
The Inria/Hivenet Cupseli challenge is co-led by Olivier Beaumont (Topal) and Alexandru Dobrila (Hivenet) and brings together 11 Inria teams along with researchers from Hivenet, representing a total of around thirty permanent staff members. Over a four-year period, it plans for the recruitment of nine PhD students, two postdoctoral researchers, and three engineers. The project kickoff meeting took place on September 25, 2025.
7 Latest software developments, platforms, open data
7.1 Latest software developments
7.1.1 Chameleon
-
Keywords:
Runtime system, Task-based algorithm, Dense linear algebra, HPC, Task scheduling
-
Scientific Description:
Chameleon is part of the MORSE (Matrices Over Runtime Systems @ Exascale) project. The overall objective is to develop robust linear algebra libraries relying on innovative runtime systems that can fully benefit from the potential of those future large-scale complex machines.
We expect advances in three directions based first on strong and closed interactions between the runtime and numerical linear algebra communities. This initial activity will then naturally expand to more focused but still joint research in both fields.
1. Fine interaction between linear algebra and runtime systems. On parallel machines, HPC applications need to take care of data movement and consistency, which can be either explicitly managed at the level of the application itself or delegated to a runtime system. We adopt the latter approach in order to better keep up with hardware trends whose complexity is growing exponentially. One major task in this project is to define a proper interface between HPC applications and runtime systems in order to maximize productivity and expressivity. As mentioned in the next section, a widely used approach consists in abstracting the application as a DAG that the runtime system is in charge of scheduling. Scheduling such a DAG over a set of heterogeneous processing units introduces a lot of new challenges, such as predicting accurately the execution time of each type of task over each kind of unit, minimizing data transfers between memory banks, performing data prefetching, etc. Expected advances: In a nutshell, a new runtime system API will be designed to allow applications to provide scheduling hints to the runtime system and to get real-time feedback about the consequences of scheduling decisions.
2. Runtime systems. A runtime environment is an intermediate layer between the system and the application. It provides low-level functionality not provided by the system (such as scheduling or management of the heterogeneity) and high-level features (such as performance portability). In the framework of this proposal, we will work on the scalability of runtime environment. To achieve scalability it is required to avoid all centralization. Here, the main problem is the scheduling of the tasks. In many task-based runtime environments the scheduler is centralized and becomes a bottleneck as soon as too many cores are involved. It is therefore required to distribute the scheduling decision or to compute a data distribution that impose the mapping of task using, for instance the so-called “owner-compute” rule. Expected advances: We will design runtime systems that enable an efficient and scalable use of thousands of distributed multicore nodes enhanced with accelerators.
3. Linear algebra. Because of its central position in HPC and of the well understood structure of its algorithms, dense linear algebra has often pioneered new challenges that HPC had to face. Again, dense linear algebra has been in the vanguard of the new era of petascale computing with the design of new algorithms that can efficiently run on a multicore node with GPU accelerators. These algorithms are called “communication-avoiding” since they have been redesigned to limit the amount of communication between processing units (and between the different levels of memory hierarchy). They are expressed through Direct Acyclic Graphs (DAG) of fine-grained tasks that are dynamically scheduled. Expected advances: First, we plan to investigate the impact of these principles in the case of sparse applications (whose algorithms are slightly more complicated but often rely on dense kernels). Furthermore, both in the dense and sparse cases, the scalability on thousands of nodes is still limited, new numerical approaches need to be found. We will specifically design sparse hybrid direct/iterative methods that represent a promising approach.
Overall end point. The overall goal of the MORSE associate team is to enable advanced numerical algorithms to be executed on a scalable unified runtime system for exploiting the full potential of future exascale machines.
-
Functional Description:
Chameleon is a dense linear algebra software relying on sequential task-based algorithms where sub-tasks of the overall algorithms are submitted to a Runtime system. A Runtime system such as StarPU is able to manage automatically data transfers between not shared memory area (CPUs-GPUs, distributed nodes). This kind of implementation paradigm allows to design high performing linear algebra algorithms on very different type of architecture: laptop, many-core nodes, CPUs-GPUs, multiple nodes. For example, Chameleon is able to perform a Cholesky factorization (double-precision) at 80 TFlop/s on a dense matrix of order 400 000 (i.e. 4 min 30 s).
-
Release Contributions:
Chameleon includes the following features:
- BLAS 3, LAPACK one-sided and LAPACK norms tile algorithms - Support QUARK and StarPU runtime systems and PaRSEC since 2018 - Exploitation of homogeneous and heterogeneous platforms through the use of BLAS/LAPACK CPU kernels and cuBLAS/MAGMA CUDA kernels - Exploitation of clusters of interconnected nodes with distributed memory (using OpenMPI)
- URL:
- Publications:
-
Contact:
Mathieu Faverge
-
Participants:
Mathieu Faverge, Florent Pruvost, Emmanuel Agullo, Samuel Thibault
-
Partners:
Innovative Computing Laboratory (ICL), King Abdullha University of Science and Technology, University of Colorado Denver
7.1.2 ELF
-
Name:
Efficient Deep Learning Framework
-
Keywords:
Neural networks, Pytorch, Python, GPU, Deep learning, Automatic parallelization
-
Functional Description:
ELF is a deep learning framework designed for efficient and easy-to-launch multi-GPU training. It enables users to input a PyTorch model and train it on an HPC cluster by automatically handling data, model and other types of parallelization across multiple devices.
By optimizing the training schedule, minimizing communication overhead, and maximizing GPU utilization, ELF ensures highly optimized execution. Users don’t need to manually implement parallelization—ELF does it automatically while maintaining computational correctness throughout training iterations.
- Publication:
-
Contact:
Yulia Gusak
7.1.3 PaStiX
-
Name:
Parallel Sparse matriX package
-
Keywords:
Direct solvers, Parallel numerical solvers, Linear Systems Solver
-
Scientific Description:
PaStiX is based on an efficient static scheduling and memory manager, in order to solve 3D problems with more than 50 million of unknowns. The mapping and scheduling algorithm handles a combination of 1D and 2D block distributions. A dynamic scheduling can also be applied to take care of NUMA architectures while taking into account very precisely the computational costs of the BLAS 3 primitives, the communication costs and the cost of local aggregations.
-
Functional Description:
PaStiX is a scientific library that provides a high performance parallel solver for very large sparse linear systems based on block direct and block ILU(k) methods. It can handle low-rank compression techniques to reduce the computation and the memory complexity. Numerical algorithms are implemented in single or double precision (real or complex) for LLt, LDLt and LU factorization with static pivoting (for non symmetric matrices having a symmetric pattern). The PaStiX library uses the graph partitioning and sparse matrix block ordering packages Scotch or Metis.
The PaStiX solver is suitable for any heterogeneous parallel/distributed architecture when its performance is predictable, such as clusters of multicore nodes with GPU accelerators or KNL processors. In particular, we provide a high-performance version with a low memory overhead for multicore node architectures, which fully exploits the advantage of shared memory by using a hybrid MPI-thread implementation.
The solver also provides some low-rank compression methods to reduce the memory footprint and/or the time-to-solution.
- URL:
- Publications:
-
Contact:
Pierre Ramet
-
Participants:
Alycia Lisito, Grégoire Pichon, Mathieu Faverge, Pierre Ramet
7.1.4 pmtool
-
Keywords:
Scheduling, Task scheduling, StarPU, Heterogeneity, GPGPU, Performance analysis
-
Functional Description:
Analyse post-mortem the behavior of StarPU applications. Provide lower bounds on makespan. Study the performance of different schedulers in a simple context. Provide implementations of many scheduling algorithms from the literature
- URL:
-
Publications:
hal-01386174, hal-01878606
-
Contact:
Lionel Eyraud Dubois
-
Participant:
an anonymous participant
7.1.5 StarPart
-
Keyword:
3-point-lighting technique
-
Functional Description:
StarPart is a flexible and extensible framework that integrates state-of-the-art methods for graph partitioning and sparse matrix ordering. More precisely, StarPart is a framework that offers a uniform API to manipulate graph, hypergraph and mesh structures. It is designed to be easily extensible by adding new methods and to plug all these methods into a comprehensive framework. It is initially designed to provide graph partitioning and sparse matrix ordering methods, that come from sate-of-the-art software such as Metis, Scotch, Patoh, Zoltan, etc. Besides, it provides some facilities for IO, diagnostic, benchmark, visualization (VTK, SVG, ...). StarPart is the core of the MetaPart project. It is built upon the LibGraph library.
- URL:
-
Contact:
Aurélien Esnard
-
Participant:
an anonymous participant
7.1.6 StarPU
-
Name:
The StarPU Runtime System
-
Keywords:
Runtime system, High performance computing
-
Scientific Description:
Traditional processors have reached architectural limits which heterogeneous multicore designs and hardware specialization (eg. coprocessors, accelerators, ...) intend to address. However, exploiting such machines introduces numerous challenging issues at all levels, ranging from programming models and compilers to the design of scalable hardware solutions. The design of efficient runtime systems for these architectures is a critical issue. StarPU typically makes it much easier for high performance libraries or compiler environments to exploit heterogeneous multicore machines possibly equipped with GPGPUs or Cell processors: rather than handling low-level issues, programmers may concentrate on algorithmic concerns.Portability is obtained by the means of a unified abstraction of the machine. StarPU offers a unified offloadable task abstraction named "codelet". Rather than rewriting the entire code, programmers can encapsulate existing functions within codelets. In case a codelet may run on heterogeneous architectures, it is possible to specify one function for each architectures (eg. one function for CUDA and one function for CPUs). StarPU takes care to schedule and execute those codelets as efficiently as possible over the entire machine. In order to relieve programmers from the burden of explicit data transfers, a high-level data management library enforces memory coherency over the machine: before a codelet starts (eg. on an accelerator), all its data are transparently made available on the compute resource.Given its expressive interface and portable scheduling policies, StarPU obtains portable performances by efficiently (and easily) using all computing resources at the same time. StarPU also takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models.
StarPU is a task programming library for hybrid architectures.
The application provides algorithms and constraints: - CPU/GPU implementations of tasks, - A graph of tasks, using StarPU's rich C API.
StarPU handles run-time concerns: - Task dependencies, - Optimized heterogeneous scheduling, - Optimized data transfers and replication between main memory and discrete memories, - Optimized cluster communications.
Rather than handling low-level scheduling and optimizing issues, programmers can concentrate on algorithmic concerns!
-
Functional Description:
StarPU is a runtime system that offers support for heterogeneous multicore machines. While many efforts are devoted to design efficient computation kernels for those architectures (e.g. to implement BLAS kernels on GPUs), StarPU not only takes care of offloading such kernels (and implementing data coherency across the machine), but it also makes sure the kernels are executed as efficiently as possible.
-
Release Contributions:
StarPU is a runtime system that offers support for heterogeneous multicore machines. While many efforts are devoted to design efficient computation kernels for those architectures (e.g. to implement BLAS kernels on GPUs), StarPU not only takes care of offloading such kernels (and implementing data coherency across the machine), but it also makes sure the kernels are executed as efficiently as possible.
- URL:
-
Publications:
tel-04213186, inria-00326917, inria-00378705, inria-00384363, inria-00411581, inria-00421333, inria-00467677, inria-00523937, inria-00547614, inria-00547616, inria-00547847, inria-00550877, inria-00590670, inria-00606195, inria-00606200, inria-00619654, hal-00643257, hal-00648480, hal-00654193, hal-00661320, hal-00697020, hal-00714858, hal-00725477, hal-00772742, hal-00773114, hal-00773571, hal-00773610, hal-00776610, tel-00777154, hal-00803304, hal-00807033, hal-00824514, hal-00851122, hal-00853423, hal-00858350, hal-00911856, hal-00920915, hal-00925017, hal-00926144, tel-00948309, hal-00966862, hal-00978364, hal-00978602, hal-00987094, hal-00992208, hal-01005765, hal-01011633, hal-01081974, hal-01101045, hal-01101054, hal-01120507, hal-01147997, tel-01162975, hal-01180272, hal-01181135, hal-01182746, hal-01223573, tel-01230876, hal-01283949, hal-01284004, hal-01284136, hal-01284235, hal-01316982, hal-01332774, hal-01353962, hal-01355385, hal-01361992, hal-01372022, hal-01386174, hal-01387482, hal-01409965, hal-01410103, hal-01473475, hal-01474556, tel-01483666, hal-01502749, hal-01507613, hal-01517153, tel-01538516, hal-01616632, hal-01618526, hal-01718280, tel-01816341, hal-01842038, tel-01959127, hal-02120736, hal-02275363, hal-02296118, hal-02403109, hal-02421327, hal-02872765, hal-02914793, hal-02933803, hal-02943753, hal-02970529, hal-02985721, hal-03144290, hal-03273509, hal-03290998, hal-03298021, hal-03318644, hal-03348787, hal-03552243, hal-03609275, hal-03623220, hal-03773486, hal-03773985, hal-03789625, hal-03936659, tel-03989856, hal-04005071, hal-04088833, hal-04115280, hal-04146714, hal-04236246, tel-04260094, tel-04316145, hal-04548787, hal-04646530, hal-04668550, hal-04690154, hal-05147860, hal-05199066, hal-05226796
-
Contact:
Nathalie Furmento
-
Participants:
Olivier Aumage, Nathalie Furmento, Samuel Thibault, 38 anonymous participants
7.1.7 rockmate
-
Name:
rockmate
-
Keywords:
Deep learning, Optimization, Python, Pytorch, GPU, Automatic differentiation
-
Scientific Description:
We propose Rockmate to control the memory requirements when training PyTorch DNN models. Rockmate is an automatic tool that starts from the model code and generates an equivalent model, using a predefined amount of memory for activations, at the cost of a few re-computations. Rockmate automatically detects the structure of computational and data dependencies and rewrites the initial model as a sequence of complex blocks. We show that such a structure is widespread and can be found in many models in the literature (Transformer based models, ResNet, RegNets,...). This structure allows us to solve the problem in a fast and efficient way, using an adaptation of Checkmate (too slow on the whole model but general) at the level of individual blocks and an adaptation of Rotor (fast but limited to sequential models) at the level of the sequence itself. We show through experiments on many models that Rockmate is as fast as Rotor and as efficient as Checkmate, and that it allows in many cases to obtain a significantly lower memory consumption for activations (by a factor of 2 to 5) for a rather negligible overhead (of the order of 10% to 20%). Rockmate is open source and available at https://github.com/topal-team/rockmate.
Complete paper: https://openreview.net/pdf?id=wLAMOoL0KD
-
Functional Description:
Given a PyTorch model, a sample input, and a GPU memory budget, Rockmate builds a new torch.nn.Module, which performs forward and backward pass while keeping the memory of activations under the given budget.
The new model produces the same outputs and gradients as the original one. Training the model with a lower memory than PyTorch Autodiff is achieved by re-computing some of the activations instead of storing them for gradient calculation. Based on the budget, Rockmate determines automatically which activations should be recomputed.
- URL:
-
Contact:
Lionel Eyraud Dubois
-
Participants:
Lionel Eyraud Dubois, Yulia Gusak, Olivier Beaumont, Xunyi Zhao
7.1.8 rotor
-
Name:
Re-materializing Optimally with pyTORch
-
Keywords:
Deep learning, Optimization, Python, GPU, Automatic differentiation
-
Scientific Description:
This software implements in PyTorch a new activation checkpointing method which allows to significantly decrease memory usage when training Deep Neural Networks with the back-propagation algorithm. Similarly to checkpointing techniques coming from the literature on Automatic Differentiation, it consists in dynamically selecting the forward activations that are saved during the training phase, and then automatically recomputing missing activations from those previously recorded. We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs (this uses more memory, but requires fewer recomputations in the backward phase), and we provide in https://hal.inria.fr/hal-02352969 an algorithm to compute the optimal computation sequence for this model.
Our PyTorch implementation processes the entire chain, dealing with any sequential DNN whose internal layers may be arbitrarily complex and automatically executing it according to the optimal checkpointing strategy computed given a memory limit. In https://hal.inria.fr/hal-02352969, through extensive experiments, we show that our implementation consistently outperforms existing checkpoint-ing approaches for a large class of networks, image sizes and batch sizes.
-
Functional Description:
Allows to train very large convolutional networks on limited memory by optimally selecting which activations should be kept and which should be recomputed. This code is meant to replace the checkpoint.py utility available in pytorch, by providing more efficient rematerialization strategies. The algorithm is easier to tune: the only required parameter is the available memory, instead of the number of segments.
- URL:
- Publication:
-
Contact:
Lionel Eyraud Dubois
-
Participant:
5 anonymous participants
7.1.9 VITE
-
Name:
Visual Trace Explorer
-
Keywords:
Visualization, Execution trace
-
Functional Description:
ViTE is a trace explorer. It is a tool made to visualize execution traces of large parallel programs. It supports Pajé, a trace format created by Inria Grenoble, and OTF and OTF2 formats, developed by the University of Dresden and allows the programmer a simpler way to analyse, debug and/or profile large parallel applications.
- URL:
- Publications:
-
Contact:
Mathieu Faverge
-
Participants:
Mathieu Faverge, Philippe Swartvagher
8 New results
As explained in Section 3.4, our contributions can be read at the intersection of the research domains described in Section 4 and research axes described in Section 3.3 as shown in the following table:
| Axis 3.3.1 – | Axis 3.3.2 – | Axis 3.3.3 – | Axis 3.3.4 – | |
| Runtime | Compression | Energy | Comm. & Fault Tol. | |
| Domain 4.1 – Lin. Alg., Tensors | Topic 3.4.1 | Topic 3.4.2 | Topic 3.4.3 | Topic 3.4.7 |
| Domain 4.2 – Training | Topic 3.4.4 | Topic 3.4.5 | Topic 3.4.6 | Topic 3.4.8 |
8.1 Scalable and portable LU factorization with partial pivoting on top of runtime systems (Topic 3.4.1)
Participants: Alycia Lisito, Mathieu Faverge, Pierre Ramet.
Task-based runtime systems have demonstrated efficiency in leveraging the capabilities of large, heterogeneous architectures. Many linear algebra algorithms and applications have been implemented on top of runtime systems to increase their performance. However, the High Performance Linpack (HPL) benchmark, used by the TOP500 to rank supercomputers, has not yet been successfully implemented using taskbased runtime systems. In this paper, we explore solutions to implement efficient LU factorization with partial pivoting using the sequential task-flow programming model. We show that, due to the pivoting strategy, this algorithm generates a large number of very small tasks, which usually overload the runtime system and make it inefficient. We propose two solutions to improve the efficiency and reduce the number of tasks. First, we apply wellknown blocking strategies in the context of task-based algorithms. Secondly, we explore batching techniques to reduce the number of tasks submitted to the runtime system. Moreover, in distributed architectures, partial pivoting generates many reductions on the critical path throughout the factorization which needs to be carefully handled to reach high performance. Two task-based reduction algorithms are proposed to express these operations and improve the runtime reactivity on the critical path. These proposals have been implemented in the dense linear algebra library CHAMELEON on top of the STARPU runtime system. Experiments conducted on our cluster with these optimizations show that our LU with partial pivoting asymptotically reaches the performance of the non-pivoting algorithm.
This work has been presented at IPDPS Conference, June 2025, Milan, Italy 20.
8.2 Batching the tasks of the LU factorization with partial pivoting on top of runtime systems (Topic 3.4.1)
Participants: Alycia Lisito, Mathieu Faverge, Florent Pruvost, Pierre Ramet.
Task-based runtime systems have demonstrated efficiency in leveraging the capabilities of large heterogeneous architectures. Many linear algebra algorithms and applications have been implemented on top of runtime systems to increase their performance. However, the LU factorization with partial pivoting has not yet been successfully implemented using task-based runtime systems. This operation is used to solve large dense linear systems in numerical simulations, such as the Maxwell equations in electromagnetism. This factorization is a major part of the High Performance Linpack (HPL) benchmark used in the TOP500 to evaluate and rank supercomputers. We explore solutions to implement efficient LU factorization with partial pivoting using the sequential task-flow programming model. These solutions have been implemented in the dense linear algebra library Chameleon on top of the StarPU runtime system. We showed that, due to the pivoting strategy, this algorithm generates a large number of very small tasks, which usually overloads the runtime system and makes it inefficient. With a naive task batching strategy, we improved the efficiency and reduced the number of tasks. We propose solutions to adapt the batch size to the granularity of the tasks. In order to do that, we first distinguish two types of tasks and set an adapted batch size for each. Then, we introduce a heuristic based on the number of operations per tasks to adapt the batch size to the computational complexity of the tasks during the factorization. Experiments conducted on our cluster with these optimizations show that our LU factorization with partial pivoting asymptotically reaches about 96% of the performance of the non-pivoting algorithm. Thanks to the adaptive batch size mechanism, the performance peak is reached even faster.
This work has been presented at COMPAS Conference, June 2025, Bordeaux, France 27.
8.3 Toward an algebraic multigrid method for the indefinite Helmholtz equation (Topic 3.4.2)
Participants: Clement Richefort, Pierre Ramet.
It is well known that multigrid methods are very competitive in solving a wide range of SPD problems. However achieving such performance for non-SPD matrices remains an open problem. In particular, three main issues may arise when solving a Helmholtz problem : some eigenvalues may be negative or even complex, requiring the choice of an adapted smoother for capturing them, and because the near-kernel space is oscillatory, the geometric smoothness assumption cannot be used to build efficient interpolation rules. Moreover, the coarse correction is not equivalent to a projection method since the indefinite matrix does not define a norm. We present some investigations about designing a method that converges in a constant number of iterations with respect to the wavenumber. The method builds on an ideal reduction-based framework and related theory for SPD matrices to improve an initial least squares minimization coarse selection operator formed from a set of smoothed random vectors. A new coarse correction is proposed to minimize the residual in an appropriate norm for indefinite problems. We also present numerical results at the end of the paper.
This paper has been published in SIAM SISC 11.
8.4 Hierarchical partitioning for the numerical simulation of complex 3D objects (Topic 3.4.2)
Participants: Dimitri Walther, Mathieu Faverge, Pierre Ramet.
The Boundary Element Method (BEM) offers numerous advantages for simulating complex physical phenomena. By placing the unknowns (or degrees of freedom) on the interfaces between different media, it becomes possible to model problems with distant boundary conditions (such as fluid flow around an object, acoustic or electromagnetic wave diffraction, radiative heat transfer, etc.). However, this approach results in a fully coupled system with a dense matrix. When this dense matrix can be decomposed into low-rank sub-blocks, it is possible to construct a hierarchical matrix (H-matrix) that approximates the original system to a desired level of accuracy. In favorable scenarios, this approximation reduces spatial complexity from to by compressing the matrix sub-blocks. This work investigates the relationship between the partitioning of degrees of freedom and the compression rate of the H-matrix. A new hierarchical partitioning technique, specifically designed to optimize H-matrix compression, is introduced. Unlike existing algorithms based on geometric information (such as Median cut, Cobblestone, or Space-filling curves), this new method relies on the construction of a connectivity graph of the degrees of freedom. This graph is built in quasi-linear time () from the mesh of the studied object and partitioned in log-quadratic time () using a multi-level partitioning approach. An additional constraint is imposed to balance the partition loads, facilitating optimization on task-based execution environments. Numerical experiments are conducted on a variety of test cases from electromagnetic simulations.
This work has been presented at COMPAS Conference, June 2025, Bordeaux, France 46. It was awarded the prize for best poster.
8.5 Optimal scheduling algorithms for software-defined radio pipelined and replicated task chains on multicore architecture (Axis 3.3.1)
Participants: Laércio Lima Pilla.
Software-Defined Radio (SDR) represents a move from dedicated hardware to software implementations of digital communication standards. This approach offers flexibility, shorter time to market, maintainability, and lower costs, but it requires an optimized distribution tasks in order to meet performance requirements. Thus, we studied the problem of scheduling SDR linear task chains of stateless and stateful tasks for streaming processing. We modeled this problem as a pipelined workflow scheduling problem based on pipelined and replicated parallelism on homogeneous resources. We proposed an optimal dynamic programming solution and an optimal greedy algorithm named OTAC for maximizing throughput while also minimizing resource utilization. Moreover, the optimality of the proposed scheduling algorithm was proved. We evaluated our solutions and compared their execution times and schedules to other algorithms using synthetic task chains and an implementation of the DVB-S2 communication standard on the AFF3CT SDR Domain Specific Language. Our results demonstrated how OTAC quickly finds optimal schedules, leading consistently to better results than other algorithms, or equivalent results with fewer resources.
This paper has been published in the Journal of Parallel and Distributed Computing in October 2025 13.
8.6 Task-Based HPC in the Cloud: Price-Performance Analysis of N-Body Simulations with StarPU (Axis 3.3.1)
Participants: Laércio Lima Pilla.
Public cloud environments present significant challenges for traditional High Performance Computing (HPC) applications due to infrastructure limitations that differ substantially from dedicated HPC systems. Unlike traditional HPC clusters optimized for tightly coupled parallel workloads, cloud platforms were designed primarily for web services and data processing applications. Key obstacles include high-latency networks, hardware virtualization overhead, and limited availability of specialized accelerators, all of which can severely impact the performance of compute-intensive applications such as physics simulations. This study investigated the feasibility of running HPC workloads on public cloud infrastructure using standard and cost-effective instance configurations rather than expensive specialized “HPC” offerings. We deployed heterogeneous clusters on Amazon Web Services using the HPC@Cloud Toolkit, incorporating various instance types, including GPU-accelerated nodes with different computational capabilities. Our evaluation focused on N-body simulations implemented using a task-based parallel programming model, leveraging the StarPU runtime system to dynamically schedule computational tasks across various processing units. Our experimental results demonstrated three key findings: (1) smaller GPU-equipped instances (g6.2xlarge) achieve performance comparable to larger instances while costing approximately one-sixth the price, challenging conventional scaling assumptions for cloud-based HPC; (2) strategic GPU utilization yields up to performance improvements over CPU-only configurations while reducing total execution costs by ; and (3) while task-based programming models effectively address network limitations through dynamic scheduling, complex tree-based algorithms like TBFMM face significant optimization challenges in cloud environments due to load balancing issues and expensive parameter tuning requirements. These findings provide practical guidance for researchers and practitioners seeking cost-effective cloud HPC deployments, demonstrating that commodity cloud infrastructures can be viable for regular computational workloads but require careful algorithmic-resource matching for optimal efficiency.
This work has been published in IEEE International Conference on Cloud Engineering, September 2025, Rennes, France 25.
8.7 Task-Based HPC in the Cloud: Price-Performance Analysis of N-Body Simulations with StarPU (Topic 3.4.2)
Participants: Laércio Lima Pilla.
Tensor-train (TT) decomposition has garnered tremendous popularity for its efficiency in handling high-dimensional data arising in scientific and quantum computing as well as machine learning applications. It provides a compact representation for matrices and vectors with a Kronecker product-like low-rank structure and enables efficient matrix-vector operations in this compressed form. The vector scalar product is among such key operations, comprising a series of tensor contractions in a specific tensor network topology whose order significantly impacts the computational cost. In this work, we proposed efficient algorithms for finding near-optimal contraction orderings for tensor networks representing scalar products in the TT format. We showed that our algorithms outperform all existing contraction ordering methods for general tensor networks where the best existing method incurs up to 15% higher cost for , twice the cost for , and ten times higher cost for scalar products where and are vectors and matrices expressed in the TT format, respectively.
This work has been published in the European Conference on Parallel Processing, August 2025, Dresden, Germany 24.
8.8 MetaCS-FL: A Metaheuristic-Based Framework for Client Selection in Federated Learning Systems (Topic 3.4.6)
Participants: Alan Lira Nunes, Laércio Lima Pilla.
Federated Learning (FL) enables the collaborative training of distributed machine learning models, with each participant (client) using their own local private data. In Cross-Device FL systems, clients usually include unreliable and heterogeneous mobile and edge devices with highly imbalanced and small local datasets. Given these characteristics, the selection of clients to participate in the training plays an essential role in the efficacy of these systems, as a poor selection can lead to long execution times, high energy consumption, and low accuracy. In this work, we proposed MetaCS-FL, a client selection framework built to support different metaheuristics, initial solution methods, and user-defined triggers for new client selections. It also employs client profiling and historical and current performance data to produce more efficient selections of clients and the volume of data they should use for training locally. We evaluated our framework in an extensive series of experiments, including comparisons with state-of-the-art algorithms, revealing the effectiveness of our approach. Having FedAvg as the baseline for comparisons, MetaCS-FL reduced total time (resp. energy consumption), by up to 64.83% (resp. 56.79%) for CIFAR-10, and by up to 67.59% (resp. 60.87%) for Fashion-MNIST while reaching the target testing accuracy.
This report has been published in HAL in July 2025 40, and its paper is currently under evaluation.
8.9 Approximation Algorithms for Scheduling With/Without Deadline Constraints Where Rejection Costs are Proportional to Processing Times (Axis 3.3.3)
Participants: Olivier Beaumont, Lionel Eyraud-Dubois, Laércio Lima Pilla.
We studied two offline job scheduling problems where tasks can be processed on a limited number of energy-efficient edge machines or offloaded to an unlimited supply of energy-inefficient cloud machines (called rejected). The objective was to minimize total energy consumption. First, we considered scheduling without deadlines, formulating it as a scheduling problem with rejection, where rejection costs are proportional to processing times. We proposed a novel -approximation algorithm, BEKP by associating it to a Multiple Subset Sum problem, improving upon the existing -approximation for arbitrary rejection costs. Next, we addressed scheduling with deadlines, aiming to minimize the weighted number of rejected jobs. We positioned this problem within the literature and introduced a new -approximation algorithm, MDP, inspired by an interval selection algorithm with a -approximation for arbitrary rejection costs. Experimental results demonstrate that BEKP and MDP obtain better results (lower costs or higher profits) than other state-of-the-art algorithms while maintaining a competitive or better time complexity.
This work was developed in the context of the Challenge PULSE, and the paper has been published in IEEE Transactions on Parallel and Distributed Systems in December 2025 9.
8.10 Energy-Aware Scheduling Strategies for Partially-Replicable Task Chains on Heterogeneous Processors (Axis 3.3.3)
Participants: Laércio Lima Pilla.
The arrival of heterogeneous (or hybrid) multicore architectures has brought new performance trade-offs for applications, and efficiency opportunities to systems. They have also increased the challenges related to thread scheduling, as tasks' execution times will vary depending if they are placed on big (performance) cores or little (efficient) ones. In this work, we focused on the challenges heterogeneous multicore processors bring to partially-replicable task chains, such as the ones that implement digital communication standards in Software-Defined Radio (SDR). Our objective was to maximize the throughput of these task chains while also minimizing their power consumption. We modeled this problem as a pipelined workflow scheduling problem using pipelined and replicated parallelism on two types of resources whose objectives were to minimize the period and to use as many little cores as necessary. We proposed two greedy heuristics (FERTAC and 2CATAC) and one optimal dynamic programming (HeRAD) solution to the problem. We evaluated our solutions and compared the quality of their schedules (in period and resource utilization) and their execution times using synthetic task chains. We also studied an open source implementation of the DVB-S2 communication standard based on the StreamPU runtime. Leading processor vendors were covered with ARM, Apple, AMD, and Intel platforms. Both the achieved throughput and the energy consumption were evaluated. Our results demonstrated the benefits and drawbacks of the different proposed solutions.
This work has been published in Heterogeneity in Computing Workshop, June 2025, Milan, Italy 22, and its extended version 39 is currently under evaluation.
8.11 HiRemate: Hierarchical Approach for Efficient Re-materialization of Large Neural Networks (Domain 4.2)
Participants: Olivier Beaumont, Lionel Eyraud Dubois, Yulia Gusak.
Training modern neural networks poses a significant memory challenge, as storing intermediate results during the forward and backward passes requires considerable memory resources. To address this issue without affecting model accuracy, re-materialization techniques have been introduced to recompute selected intermediate results instead of storing them, thus fulfilling the memory size constraint. The main algorithmic problem is to compute a re-materialization schedule that minimizes the computational overhead within a given memory budget. Our proposed HiRemate framework is based on a new hierarchical approach that provides generality and quality: we can handle any class of network graphs and satisfy the memory constraint with a low computational overhead during training. The framework exhibits low algorithmic complexity, making it possible to scale up and handle very large models. The framework automatically builds a dataflow graph from a PyTorch model, decomposes the graph hierarchically, and then builds an nn.Module that executes forward and backward passes within the given memory budget.
This work has been published in the Forty-Second International Conference on Machine Learning (ICML 2025), July 2025, Vancouver, Canada 19.
8.12 Fault-tolerant numerical iterative algorithms at scale (Topic 3.4.7)
Participants: Thomas Herault.
This year, we developed a coherent set of models and strategies to make large-scale iterative computations —- with a strong emphasis on linear algebra kernels and solvers —- both more resilient to errors and more efficient in their use of communication and storage. A first contribution revisits protection against silent data corruptions (SDCs), where errors may remain undetected for several iterations. Instead of relying on costly full replication, we studied the use of partial detectors whose detection latency is bounded, and we derived optimal execution schemes (segment lengths, and how many in-memory checkpoints must be kept) that guarantee correctness while reducing overhead. The analysis and Monte-Carlo results show that, across a broad range of parameters, partial detection can significantly outperform replication, sometimes yielding substantial speedups for realistic error rates 14.
8.13 Partial Detectors Versus Replication To Cope With Silent Errors (Axis 3.3.4)
Participants: Thomas Herault.
We proposed a holistic fault-tolerance methodology for numerical iterative algorithms that jointly addresses the three dominant error sources at scale: fail-stop failures, computation silent errors, and memory bit flips. The key idea is a hierarchical periodic pattern that interleaves mechanisms at different frequencies —- (i) frequent computation verifications (“chunks”), (ii) less frequent memory verification + in-memory checkpoints (“segments”), and (iii) even less frequent global checkpoints to tolerate fail-stop failures (“patterns”) —- and we provide an analytical framework to derive the optimal pattern minimizing the expected time per iteration. We instantiated and evaluated this approach on Preconditioned Conjugate Gradient, illustrating scenarios where the optimal pattern can dramatically reduce resilience overheads compared to more naïve strategies 16, 38.
8.14 Fixed-Work vs. Fixed-Time Checkpointing on Large-Scale Failure-Prone Platforms (Axis 3.3.4)
Participants: Thomas Herault.
We addressed a very practical systems constraint that directly impacts large-scale linear algebra runs: the prevalence of fixed-length reservations on HPC systems. We studied checkpointing not only in the classical “fixed-work” setting, but also in the dual fixed-time setting, where the goal is to maximize the expected progress achieved within a reservation. We show that fixed-time checkpointing is surprisingly harder than fixed-work checkpointing, propose dynamic threshold-based heuristics that perform well for short/medium reservations, and derive an (discretized) optimal dynamic-programming strategy, including extensions to stochastic checkpoint durations. These results provide actionable guidance for running iterative solvers robustly under real scheduling constraints 8.
8.15 PaRSEC: Scalability, flexibility, and hybrid architecture support for task-based applications in ECP (Axis 3.3.1)
Participants: Thomas Herault.
The work conducted during the Exascale Computing Project (ECP) provided several key lessons on the role and design of task-based runtime systems for future large-scale platforms. In particular, ECP confirmed that data movement, rather than raw computation, is the dominant performance limiter on heterogeneous and accelerated systems, making it essential for the runtime to manage communication, data placement, and the overlap of computation and transfers. The diversity of ECP target architectures also showed that performance portability cannot be achieved through static programming models, but instead requires runtimes that dynamically adapt scheduling, task granularity, and resource usage based on runtime information. In addition, the coexistence of legacy MPI-based components with task-based execution emphasized the importance of interoperability, while the scale and duration of ECP runs highlighted the need to treat resilience as a first-class runtime concern, naturally enabled by dataflow-based execution models. These lessons directly inform the objectives of the NumPEx project, the French counterpart to ECP, in which the TOPAL team is actively involved. By building on the experience gained in ECP, our participation in NumPEx aims to transfer and extend these concepts to the French exascale ecosystem, contributing runtime-level solutions for communication efficiency, adaptability, and fault tolerance on next-generation European supercomputing platforms 10.
8.16 Tensor Contractions on Top of Runtime Systems: Application to the Coupled-Cluster Method (Topic 3.4.1)
Participants: Thomas Herault.
This year, we investigated how the benefits of task-based and distributed runtime systems, well established for dense linear algebra, can be extended to tensor computations, which play a central role in modern high-performance computing. Our work focused on tensor contractions arising in computational quantum chemistry, in particular in coupled-cluster methods, where tensors have a small number of dimensions but very large sizes. We extended the Chameleon dense linear algebra library to support tensor contractions by expressing them as sequences of optimized matrix operations, combined with flexible tensor permutations. To address the challenges of data layout and memory footprint, we identified a set of elementary and composable tensor transformations and implemented them on top of the StarPU runtime system. We validated this approach on the computation of coupled-cluster residuals with density fitting, demonstrating both its efficiency and its scalability on modern heterogeneous platforms 28.
8.17 Scalable Block-Sparse Matrix Multiplication Using Template Task Graphs (Topic 3.4.1)
Participants: Thomas Herault.
This year, we advanced the use of task-based runtime systems for sparse linear algebra by addressing communication scalability issues in distributed block-sparse matrix multiplication. Building on the Template Task Graph (TTG) programming model, we introduced application-defined scheduling constraints that allow the runtime to control task eligibility without resorting to ad-hoc control flow. Applied to block-sparse matrix multiplication, these constraints make it possible to throttle and structure communication, limiting the number of concurrent broadcasts and reducing network contention while preserving overlap between communication and computation. Experimental results demonstrate that this approach significantly improves scalability on large problem sizes and highlights the importance of exposing high-level execution constraints to the runtime. This work reinforces the role of expressive task-based runtimes as a key enabler for scalable and communication-efficient linear algebra on modern distributed systems 23.
8.18 Comparing and Contrasting User and Runtime Directed Data Placement Strategies for Owner-Compute, Multi-accelerator Distributed Task Based Scheduling (Topic 3.4.1)
Participants: Thomas Herault.
This work explores data placement strategies in task-based runtime systems for linear algebra applications on multi-accelerator, distributed platforms. Using the PaRSEC runtime, we compared runtime-directed heuristics and user-directed placement strategies in the context of owner-compute scheduling, focusing on dense matrix multiplication and Cholesky factorization. The results show that while automated strategies can significantly improve locality and outperform naïve approaches, they remain consistently outperformed by carefully designed user-directed placements, particularly at scale. The study also highlights the limitations of relying on unified virtual memory and demonstrates the importance of explicitly managing data received from the network, especially on modern systems where network interfaces are directly attached to accelerators. Overall, this work emphasizes that runtime systems must expose flexible mechanisms for expressing data placement policies, allowing expert users to guide execution while preserving a clear separation between algorithmic expression and performance tuning 17.
8.19 Optimizing Parallel Heterogeneous System Efficiency: Dynamic Task Graph Adaptation with Recursive Tasks (Topic 3.4.1)
Participants: Abdou Guermouche.
Task-based programming models are currently an ample trend to leverage heterogeneous parallel systems in a productive way (OpenACC, Kokkos, Legion, OmpSs, PaRSEC, StarPU, XKaapi, ...). Among these models, the Sequential Task Flow (STF) model is widely embraced (PaRSEC's DTD, OmpSs, StarPU) since it allows to express task graphs naturally through a sequential-looking submission of tasks, and tasks dependencies are inferred automatically. However, STF is limited to task graphs with task sizes that are fixed at submission, posing a challenge in determining the optimal task granularity. Notably, in heterogeneous systems, the optimal task size varies across different processing units, so a single task size would not fit all units. StarPU's recursive tasks allow graphs with several task granularities by turning some tasks into sub-graphs dynamically at runtime. The decision to transform these tasks into sub-graphs is decided by a StarPU component called the Splitter. After deciding to transform some tasks, classical scheduling approaches are used, making this component generic, and orthogonal to the scheduler. In this paper, we propose a new policy for the Splitter, which is designed for heterogeneous platforms, that relies on linear programming aimed at minimizing execution time and maximizing resource utilization. This results in a dynamic well-balanced set comprising both small tasks to fill multiple CPU cores, and large tasks for efficient execution on accelerators like GPU devices. We then present an experimental evaluation showing that just-in-time adaptations of the task graph lead to improved performance across various dense linear algebra algorithms 12.
8.20 Improving energy efficiency of HPC applications using unbalanced GPU power capping (Topic 3.4.3)
Participants: Abdou Guermouche, Hayfa Tayeb.
Energy efficiency represents a significant challenge in the domain of High-Performance Computing (HPC). One potential key parameter to improve energy efficiency is the use of power capping, a technique for controlling the power limits of a device, such as a CPU or GPU.In this paper, we propose to examine the impact of GPU power capping in the context of HPC applications using heterogeneous computing systems. To this end, we first conduct an extensive study of the impact of GPU power capping on a compute intensive kernel, namely matrix multiplication kernel (GEMM), on different Nvidia GPU architectures. Interestingly, such compute-intensive kernels are up to 30 % more energy efficient when the GPU is set to 55-70 % of its Thermal Design Power (TDP). Using the best power capping configuration provided by this study, we investigate how setting different power caps for GPU devices of a heterogeneous computing node can improve the energy efficiency of the running application. We consider dense linear algebra task-based operations, namely matrix multiplication and Cholesky factorization.We show how the underlying runtime system scheduler can then automatically adapt its decisions to take advantage of the heterogeneous performance capability of each GPU. The results show that for a given platform equipped with four GPU devices, applying a power cap on all GPUs improves the energy efficiency for matrix multiplication up to 24.3 % (resp. 33.78 %) for double (resp. single) precision 26.
8.21 Sparse Matrix Ordering for Fine Grain Parallel Triangular Solve Using SIMD (Topic 3.4.1)
Participants: Abdou Guermouche.
The evolution of processor hardware increasingly supports fine grain parallelism through SIMD (Single Instruction, Multiple Data) vector instruction sets and hardware threading. For instance, the new ARM SVE instruction set allows for hardware implementation of up to 32 double precision SIMD vector sizes per hardware thread. In this work, we focus on vectorization of the triangular solves required in BiCGStab preconditioned with ILU(0) that is particularly numerically effective for IFPEN applications. In our context, expressing some parallelism can be achieved by changing the sparse structure of the matrices through unknown ordering; that can be recast in terms of graph ordering and coloring. We use a graph coloring method named ColorRCM to exhibit fine grain parallelism to feed the SIMD computing units while improving the convergence of the Krylov solver compared to classical greedy graph coloring method. We first evaluate the performance of SIMD-SpTRSV using the permutation provided by ColorRCM and achieve an acceleration between 1.7 and 6 in AVX2 compared to Intel MKL 21.4. Then we examine the impact of ColorRCM ordering on ILU(0)-BiCGStab performance on 201 matrices, including those from the Suite Sparse matrix (The University of Florida Sparse Matrix Collection collection and from the IFPEN porous media flow simulations. The solver configuration uses the ColorRCM ordering and vectorized with AVX2 instructions showed the best convergence times in two thirds of the tests 21.
8.22 Mind Bubbles and Memory: Bounds on Scheduling Pipeline Parallelism with Rematerialization (Domain 4.2)
Participants: Adrien Aguila–Multner, Yulia Gusak, Olivier Beaumont, Lionel Eyraud Dubois.
Training large neural networks, especially Transformer-based Large Language Models (LLMs), requires massive high-performance computing (HPC) resources. Within each microbatch, computations follow a strictly sequential flow through a stack of transformer blocks: a forward pass to compute the loss, and a backward pass to propagate gradients. This sequential structure limits intrinsic parallelism. To improve performance, several complementary strategies have been developed: data, tensor, sequence, and pipeline parallelism, typically combined to achieve scalability over tens of thousands of GPUs.
In 35, we present a formal analysis of pipeline parallelism (PP) for large-scale training. In PP, the model is partitioned into multiple stages, and microbatches are injected into the pipeline to overlap computation. The main challenge is to minimize idle periods (pipeline bubbles) while managing memory usage, since each GPU must store intermediate activations from multiple in-flight microbatches. Existing scheduling algorithms such as GPIPE, 1F1B, HANAYO, and MEGATRON reduce idle time but lack formal lower bounds or explicit modeling of memory constraints.
We develop a unified analytical approach for PP scheduling, deriving lower bounds on completion time for both single-wave and multi-wave regimes. Our analysis explicitly incorporates a memory constraint K, denoting the number of activations that can be stored per GPU. Exact results are provided for two extreme cases (minimal memory () and large memory ()), while general lower bounds are established for intermediate configurations. Our analysis highlights the intrinsic coupling between pipeline utilization and memory footprint, providing a foundation for evaluating and comparing pipeline scheduling algorithms under realistic memory constraints.
8.23 Optimized Forward-Backward Rematerialization for Memory-Efficient Pipeline Parallel Training (Topic 3.4.4)
Participants: Adrien Aguila–Multner, Olivier Beaumont, Lionel Eyraud Dubois, Yulia Gusak.
Pipeline parallelism is a key technique for scaling deep network training across multiple devices. Recent works have significantly reduced pipeline idle time by improving scheduling efficiency. Decoupling the computation of gradients with respect to weights and activations led to the development of schedules with almost no idle time. However, these methods still require substantial memory, limiting their applicability on resource-constrained hardware.
In 36, our first contribution is to introduce recomputation to the backward pass, extending rematerialization beyond the forward pass. This enables executing schedules with decoupled gradient computations under much tighter memory constraints. Our second contribution is a unified optimization approach that, given a model and hardware memory constraints, formulates and solves an Integer Linear Programming (ILP) problem to determine the optimal per-microbatch, per-GPU rematerialization strategy for a given schedule, applicable to both one-wave and multi-wave pipeline schedules. Our third contribution shows that, as device memory constraints vary, the relative advantages of different pipeline schedules also change in the presence of rematerialization. We provide corresponding insights and a PyTorch framework that enables finding and executing the optimal combination of pipeline scheduling and rematerialization strategies. Experiments demonstrate the effectiveness of all three contributions, showing that our approach enables efficient training of larger models under tight memory budgets, adapts optimally to varying memory capacities, and reduces recomputation overhead compared to existing recomputation solutions.
8.24 Leveraging Expert Usage to Speed up LLM Inference with Expert Parallelism (Topic 3.4.4)
Participants: Olivier Beaumont, Raphael Bourgouin.
Large language models have become indispensable for many text-processing applications. Their inference 15, i.e. their use to generate text, is a time-consuming task since tokens have to be generated one after the other, even if the computational load has been reduced by model sparsification, e.g. by using a Mixture of Experts (MoE) models. In the MoE context, a subset of experts is selected at each stage. Note that not all subsets of experts (pairs of experts in most cases) in a given layer have the same probability of being selected. When experts are mapped to different GPUs, there is a risk of load imbalance if the selected experts end up on a small number of GPUs. This paper proposes to leverage this heterogeneity in expert usage to map experts of popular subsets onto distinct GPUs, allowing them to be processed in parallel and thus reducing the time needed for inference. Even though this mapping problem is NP-complete, it is possible to design simple greedy strategies that significantly reduce the need for sequential expert processing. Our proof-ofconcept confirms that our mapping strategies effectively reduce inference time on the Mixtral model.
8.25 Pallas: a generic trace format for large HPC trace analysis (Axis 3.3.1)
Participant: Philippe Swartvagher.
Identifying performance bottlenecks in a parallel application is tedious, especially because it requires analyzing the behaviour of various software components, as bottlenecks may have several causes and symptoms. For example, a load imbalance may cause long MPI waiting times, or contention on disk may degrade the performance of I/O operations. Detecting a performance problem means investigating the execution of an application and applying several performance analysis techniques. To do so, one can use a tracing tool to collect information describing the behaviour of the application. At the end of the execution, a trace file in a specific format is available to the application user, which can be used to conduct a complete post-mortem investigation. Several challenges emerge from the generation and use of traces. Tracing applications may alter the performance of the application, and can create thousands of heavy trace files, especially at a large scale. Most importantly, the post-mortem analysis needs to load these thousands of trace files in memory, and process them. This quickly becomes impractical for large scale applications, as memory gets exhausted and the number of opened files exceeds the system capacity. In this paper, we propose Pallas 18, a generic trace format tailored for conducting various post-mortem performance analysis of traces describing large executions of HPC applications. During the execution of the application, Pallas collects events and detects their repetitions on-the-fly. When storing the trace to disk, Pallas groups the data from similar events or groups of events together in order to later speed up trace reading. We demonstrate that the Pallas online detection of the program structure does not significantly degrade the performance of the applications. Moreover, the Pallas format allows faster trace analysis compared to other evaluated trace formats. Overall, the Pallas trace format allows an interactive analysis of a trace that is required when a user investigates a performance problem.
9 Bilateral contracts and grants with industry
9.1 Bilateral Grants with Industry
Participants: Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Faverge, Abdou Guermouche, Yulia Gusak, Pierre Ramet.
Some on the ongoing PhD thesis are developed within bilateral contract with industry for PhD advisory:
- Airbus (2022-). This collaboration concerns the parallelization and optimization of the Flusepa application, which models the separation of boosters for space launchers at Airbus Safran Launchers. Flusepa combines computational fluid mechanics, algorithms (AMR) and task-based parallelism based on the StarPU runtime system. We are involved in the supervision of the PhD. of Alice Lasserre in this context.
- CEA-Cesta for the PhD of Abel Calluaud. A direct solver developed at CEA relies on the approximation by hierarchical matrices to reduce both computational and memory costs. Although these developments have met a growing demand for increased simulation accuracy, there are still open problems to pursue these research efforts in an HPC context. In this thesis, we propose to develop and compare several approaches to adapt the granularity of hierarchical tasks and extract parallelism to exploit the multicore computational nodes associated with massively parallel architectures such as GPUs.
- CEA-Cesta for the PhD of Dimitri Walther. In the context of numerical simulation of electromagnetism, integral methods are among the most widely used because of their power. These methods lead to the solution of dense linear problems and are therefore very expensive. For this reason, hierarchical compression methods have been developed that drastically reduce the cost associated with these matrices. They are based on a hierarchical partitioning of the matrix, and therefore of the mesh, and the efficiency of the compression depends on this partitioning. In this context, the aim of the thesis is to develop efficient and scalable hierarchical partitioners to optimise the compression of the matrix.
- Eviden for the PhD of Alycia Lisito. For over three years, we have been collaborating with Eviden on the development of an HPL benchmark on top of runtime systems. This work is continued as part of Alycia Lisito's thesis funded by a CIFRE contract. To guarantee a high level of flexibility and portability, it is possible to use a task-based implementation through an executive support (or runtime). This programming model has already proved its effectiveness in the implementation of various parallel algorithms, in particular for dense linear algebra (LU decomposition, Cholesky decomposition, QR, etc.). In this thesis, we will use Inria's existing software stack, through the dense linear algebra library Chameleon and the executive support StarPU. These reference libraries for runtime linear algebra will be studied to enable the scaling up of more complex algorithms such as HPL.
- Eviden for the PhD of Jean Conan. Within the framework of High-Performance Computing (HPC) tenders, Atos Bull must provide contractual performance guarantees for future supercomputers. However, direct measurement is often impossible during the bidding phase, either because the hardware components (processors, accelerators) are not yet commercially available or because the scale of the proposed system exceeds the testing resources available internally. Performance prediction has become a critical tool, not only for meeting client requirements but also for upstream architecture sizing (such as network topology) and optimizing massively parallel software. The transition to exascale computing introduces unprecedented complexity, driven by the increasing heterogeneity of compute nodes and the intricate structure of high-speed networks. The objective of the thesis is thus to explore novel methodologies for performance prediction, with a primary focus on simulation techniques, Reduce optimization overhead by minimizing the number of large-scale physical runs required to determine optimal execution parameters, and finally accurately model the impact of node heterogeneity and network architecture on overall system performance.
- Diabolocom and Inria are now on the final stage of the contract negotiation to start a PhD thesis co-supervised by Yulia Gusak and Olivier Beaumont on Optimization of Multi-Stage Generative Model Pipelines for Cost-Efficient, Scalable Inference.
10 Partnerships and cooperations
10.1 International initiatives
10.1.1 Associate Teams in the framework of an Inria International Lab or in the framework of an Inria International Program
ELF Associate Team on on Efficient deep Learning Frameworks.
Partners
- TOPAL
- California Institute of Technology (Caltech)
Nowadays, Deep Learning (DL) and Artificial Intelligence (AI) technologies are incorporated in more and more areas to solve various problems of video, audio, natural language processing, content generation, etc. Frameworks based on neural networks, which are core modules of deep learning models, have been already successfully used for action recognition, weather forecasting, robotic surgery and other inspiring applications [24, 44, 48]. The drawbacks of modern neural networks are that they usually require a significant amount of data and a lot of GPU devices to be trained, which makes them expensive in terms of energy and money costs, and harmful in terms of air emissions [27]. The general question we are going to address during the work of the associate team is: given your application and your computation platform, how to perform the model training efficiently in terms of time/energy?
10.2 International research visitors
10.2.1 Visits to international teams
Research stays abroad
Olivier Beaumont visited Loris Marchal and Pablo Piantanida for ten days in July 2025 at ETS (École de technologie supérieure) to work on inference optimization using speculative decoding. This collaboration led to the submission of an international cooperation project, which is currently under evaluation.
10.3 European initiatives
10.3.1 H2020 projects
EUPEX
Participants: Olivier Beaumont.
EUPEX project on cordis.europa.eu
-
Title:
EUROPEAN PILOT FOR EXASCALE
-
Duration:
From January 1, 2022 to December 31, 2025
-
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
- GRAND EQUIPEMENT NATIONAL DE CALCUL INTENSIF (GENCI), France
- VSB - TECHNICAL UNIVERSITY OF OSTRAVA (VSB - TU Ostrava), Czechia
- FORSCHUNGSZENTRUM JULICH GMBH (FZJ), Germany
- COMMISSARIAT A L ENERGIE ATOMIQUE ET AUX ENERGIES ALTERNATIVES (CEA), France
- IDRYMA TECHNOLOGIAS KAI EREVNAS (FOUNDATION FOR RESEARCH AND TECHNOLOGYHELLAS), Greece
- SVEUCILISTE U ZAGREBU FAKULTET ELEKTROTEHNIKE I RACUNARSTVA (UNIVERSITYOF ZAGREB FACULTY OF ELECTRICAL ENGINEERING AND COMPUTING), Croatia
- UNIVERSITA DEGLI STUDI DI TORINO (UNITO), Italy
- CYBELETECH (Cybeletech), France
- UNIVERSITA DI PISA (UNIPI), Italy
- GRAN SASSO SCIENCE INSTITUTE (GSSI), Italy
- ISTITUTO NAZIONALE DI ASTROFISICA (INAF), Italy
- UNIVERSITA DEGLI STUDI DEL MOLISE, Italy
- E 4 COMPUTER ENGINEERING SPA (E4), Italy
- UNIVERSITA DEGLI STUDI DELL'AQUILA (UNIVAQ), Italy
- CONSIGLIO NAZIONALE DELLE RICERCHE (CNR), Italy
- JOHANN WOLFGANG GOETHE-UNIVERSITAET FRANKFURT AM MAIN (GUF), Germany
- EUROPEAN CENTRE FOR MEDIUM-RANGE WEATHER FORECASTS (ECMWF), United Kingdom
- BULL SAS (BULL), France
- POLITECNICO DI MILANO (POLIMI), Italy
- EXASCALE PERFORMANCE SYSTEMS - EXAPSYS IKE, Greece
- ALMA MATER STUDIORUM - UNIVERSITA DI BOLOGNA (UNIBO), Italy
- PARTEC AG (PARTEC), Germany
- ISTITUTO NAZIONALE DI GEOFISICA E VULCANOLOGIA, Italy
- CINECA CONSORZIO INTERUNIVERSITARIO (CINECA), Italy
- SECO SPA (SECO SRL), Italy
- CONSORZIO INTERUNIVERSITARIO NAZIONALE PER L'INFORMATICA (CINI), Italy
-
Inria contact:
Olivier Beaumont
-
Coordinator:
Jean-Robert Bacou (Eviden)
-
Summary:
The EUPEX consortium aims to design, build, and validate the first EU platform for HPC, covering end-to-end the spectrum of required technologies with European assets: from the architecture, processor, system software, development tools to the applications. The EUPEX prototype will be designed to be open, scalable and flexible, including the modular OpenSequana-compliant platform and the corresponding HPC software ecosystem for the Modular Supercomputing Architecture. Scientifically, EUPEX is a vehicle to prepare HPC, AI, and Big Data processing communities for upcoming European Exascale systems and technologies. The hardware platform is sized to be large enough for relevant application preparation and scalability forecast, and a proof of concept for a modular architecture relying on European technologies in general and on European Processor Technology (EPI) in particular. In this context, a strong emphasis is put on the system software stack and the applications.
Being the first of its kind, EUPEX sets the ambitious challenge of gathering, distilling and integrating European technologies that the scientific and industrial partners use to build a production-grade prototype. EUPEX will lay the foundations for Europe's future digital sovereignty. It has the potential for the creation of a sustainable European scientific and industrial HPC ecosystem and should stimulate science and technology more than any national strategy (for numerical simulation, machine learning and AI, Big Data processing).
The EUPEX consortium – constituted of key actors on the European HPC scene – has the capacity and the will to provide a fundamental contribution to the consolidation of European supercomputing ecosystem. EUPEX aims to directly support an emerging and vibrant European entrepreneurial ecosystem in AI and Big Data processing that will leverage HPC as a main enabling technology.
DARE
Participants: Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Faverge, Pierre Ramet, Florent Pruvost.
-
Title:
A new era for supercomputing in Europe
-
Duration:
From March 1, 2025 to March 1, 2026
-
Partners (partial list):
- BARCELONA SUPERCOMPUTING CENTER (BSC)
- CODASIP GMBH (CODA-DE)
- AXELERA AI SRL (AXE-IT)
- OPENCHIP SOFTWARE TECHNOLOGIES SL (OCT)
- INTERUNIVERSITAIR MICRO-ELECTRONICA CENTRUM (IMEC)
- FORSCHUNGSZENTRUM JUELICH GMBH (JSC)
- CINECA CONSORZIO INTERUNIVERSITARIO (CINECA)
- E4 COMPUTER ENGINEERING SPA (E4)
- CHALMERS TEKNISKA HOGSKOLA AB (CHALMERS)
- POLITECNICO DI MILANO (POLIMI)
- UNIVERSIDAD COMPLUTENSE DE MADRID (UCM)
- UNIVERSITAT POLITECNICA DE V ALENCIA (UPV)
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA)
- THALES (TRT)
- TECHNISCHE UNIVERSITAET MUENCHEN (TUM)
- BULL SAS (BULL)
-
Inria contact:
Olivier Sentyies
-
Coordinator:
Osman Unsal (BSC)
-
Summary:
DARE explores new paths toward greater European autonomy in HPC and AI by advancing open technologies and fostering homegrown innovation. The project aims to reduce strategic dependencies and strengthen Europe’s ability to shape its digital future.
DARE’s technologies will power future European supercomputers, enabling breakthroughs in science, industry, and AI. By strengthening Europe’s HPC supply chain and IP portfolio, DARE creates long-term economic, technological, and societal benefits across critical sectors.
DARE sets out to lay the technological foundations for European digital autonomy in HPC and AI. By combining open RISC-V architectures, chiplet technologies, and a co-designed software ecosystem, DARE aims to deliver working prototypes, shape the EU HPC roadmap, and boost Europe’s ability to build and sustain its own supercomputing value chain.
10.4 National initiatives
10.4.1 Inria Challenge
Challenge Cupseli: Collaborative Unified Platform for a Scalable and Efficient Learning Infrastructure
-
Duration:
2025 – 2029
-
Coordinator:
Olivier Beaumont (Inria) and Alexandru Dobrila (Hivenet)
-
Local contact:
Olivier Beaumont & Lionel Eyraud Dubois & Julia Gusak & Thomas Herault & Philippe Swartvagher
-
Partners:
Hivenet
-
Inria teams:
- ARGO and MIMOVE, Inria Paris
- COAST, Inria Nancy – Grand Est
- MAGELLAN, STACK and WIDE, Inria Centre at Rennes University
- OCKHAM, Inria Centre of Lyon
- COATI and NEO, Inria Centre at Université Côte d’Azur
- TADAAM and TOPAL, Inria Centre of the University of Bordeaux
-
Summary:
The Cupseli challenge aims to demonstrate that it is possible to run complex applications (particularly in the field of machine learning) on heterogeneous, distributed, and volatile resources, while achieving strong parallel efficiency and preserving both accuracy and confidentiality. Building on the combined expertise of hive and Inria in storage technologies illustrated in Alvearium, this strategic partnership explores algorithmic and system solutions to optimize computation, memory, and communications, while ensuring security and fault tolerance. The work is organized around three axes: Frugality (adapting training and inference to limited and dynamic resources), Security and Confidentiality (protecting data and models through encryption, secure enclaves, and defenses against attacks), and Volatility (ensuring robustness and performance despite the unpredictable arrival and departure of resources). The shared goal is to offer a green and sovereign alternative to data centers, by leveraging already-existing resources for the benefit of AI and Big Data applications.
Challenge PULSE: Pushing low-carbon services towards the Edge
-
Duration:
2022 – 2026
-
Coordinator:
Romain Rouvoy
-
Local contact:
Olivier Beaumont & Lionel Eyraud Dubois
-
Partners:
Qarnot Computing, ADEME
-
Inria teams:
- Avalon
- Ctrl-A
- Spirals
- Stack
- Storm
- Topal
-
Summary:
The Pulse challenge aims to develop and promote best practices in geo-repaired hardware and software infrastructures for more environmentally friendly intensive computing. The idea is to analyze which solutions are the most relevant, and which levers need to be focused on, to reduce the impact of infrastructures while maximizing the usefulness of their emissions. To this end, the challenge is structured around two complementary research axes to address this technological and environmental issue: holistic analysis of the environmental impact of intensive computing, and implementing more virtuous edge services.
10.5 Public policy support
Olivier Beaumont conducted an expert assessment, in collaboration with IRD and other Inria colleagues, on the current state and future development prospects of high-performance computing in Africa. This study was commissioned by the French Development Agency (Agence Française de Développement).
11 Dissemination
11.1 Promoting scientific activities
11.1.1 Scientific events: organisation
General chair, scientific chair
Member of the organizing committees
Philippe Swartvagher and Emmanuel Agullo (Concace team) were organizing chairs of Compas 2025, the French Conference on Parallelism, Architecture and System.
11.1.2 Scientific events: selection
Chair of conference program committees
- Abdou Guermouche was chair of the track System Software and Cloud Computing of the SC25 (International Conference for High Performance Computing, Networking, Storage, and Analysis) international conference.
- Thomas Herault was chair of the Programming Environments and System Software track of the ISC High Performance 2026 international conference.
Member of the conference program committees
- Olivier Beaumont was involved in the following program committes: SC25 (Algorithms)HPDC25IPDPS26 (Algorithms)
- Lionel Eyraud Dubois was involved in the program committee of EuroPar 2025.
- Philippe Swartvagher was involved in the following program committees: Cluster 2025, PMBS 2025 workshop, and reproducibility committee of SC 25 (41, 42).
- Abdou Guermouche was involved in the following program committees : HCW 2025, Heterogeneity in Computing Workshop.
- Mathieu Faverge was involved in the program committee of : SBAC-PAD 2025 (Parallel Applications and Algorithms).
- Yulia Gusak was involved in the following program committees: ICML 2025, NeurIPS 2025, AAAI 2026, ICLR 2026.
- Laércio Lima Pilla was involved in the following program committees: ESSA 2025, HCW 2025, IC2E 2025, and SC25 (Algorithms).
- Thomas Herault was involved in the following program committees: ISC High Performance 2025, ICPP 2025, SC25 (Algorithms), HiPC'25 (System Software), and the Workshop on Asynchronous Many-Tasks Applications 2025 30.
Reviewer
The members of the TOPAL project have also performed reviewing for the following list of conferences: IPDPS'25, SC 25, HIPC'25
11.1.3 Journal
Member of the editorial boards
- Olivier Beaumont is Associate Editor in Chief for the Journal of Parallel and Distributed Computing Elsevier JPDC
- Olivier Beaumont is Guest Editor for a Special Issue of IEEE Internet Computing with Shadi Ibrahim et al. on Serverless Computing.
- Thomas Herault is Associate Editor for Algorithms of The Journal of Parallel and Distributed Computing (JPDC).
Reviewer - reviewing activities
The members of the TOPAL project have performed reviewing for Journal of Parallel and Distributed Computing (Lionel Eyraud Dubois , Abdou Guermouche ), ACM Transactions on Mathematical Software (Pierre Ramet , Abdou Guermouche ), IEEE Transactions on Parallel and Distributed Systems (Lionel Eyraud Dubois , Abdou Guermouche , Mathieu Faverge ), SoftwareX (Abdou Guermouche ), Parallel Computing (Laércio Lima Pilla , Abdou Guermouche ), 4OR - A Quarterly Journal of Operations Research (Lionel Eyraud Dubois ).
11.1.4 Invited talks
- Yulia Gusak gave a talk at Sharp+Foundary @ COLT workshop entitled “Training Neural Networks Under Memory Constraints“.
- Yulia Gusak gave a talk at AI4Industry'25 entitled “Efficient Training of Neural Networks“.
- Yulia Gusak gave a talk at the 18th Scheduling for large-scale systems workshop, entitled “Optimizing neural networks training using different types of parallelisms (data/tensor/model/pipeline) and re-materialization“
- Laércio Lima Pilla gave a talk at the 18th Scheduling for large-scale systems workshop, entitled “Exploring scheduling solutions for Federated Learning training”.
- Olivier Beaumont gave a talk at the 18th Scheduling for large-scale systems workshop, entitled “Optimized Forward-Backward Rematerialization for Memory-Efficient Pipeline Parallel Training”.
11.1.5 Leadership within the scientific community
- Olivier Beaumont is a member of the IEEE CS Babbage Award selection commitee
11.1.6 Scientific expertise
- Olivier Beaumont conducted an expert assessment, in collaboration with IRD and other Inria colleagues, on the current state and future development prospects of high-performance computing in Africa. This study was commissioned by the French Development Agency (Agence Française de Développement).
- Olivier Beaumont acted as external evaluator for several EuroHPC calls: Inno4Scale, Energy, FFPlus
- Pierre Ramet is Scientific Advisor at the CEA-DAM CESTA.
- Pierre Ramet participated in the HCERES evaluation committee of the IRFU (Institut de recherche sur les lois fondamentales de l'Univers) at CEA Saclay. The final report has been published in March 2025.
- Abdou Guermouche acted as external evaluator for one ANRT proposal.
11.1.7 Research administration
- Pierre Ramet is the head of the CNRS Satanas department.
- Pierre Ramet is member of Scientific comittee of the LaBRI.
- Philippe Swartvagher is the communication referent for the NumPEx/Exa-SofT project.
- Philippe Swartvagher is the point of contact in Bordeaux for Grid5000/SLICES-FR infrastructure.
- Philippe Swartvagher is the representative of the TOPAL team at the Bordeaux CUMI.
- Philippe Swartvagher is elected member at the Center Committee of Inria Bordeaux.
- Abdou Guermouche is the scientific lead of the numerical library work package of the ExaSoft project (PEPR NumPEx).
- Abdou Guermouche is member of the Scientific Committee of LaBRI.
- Yulia Gusak is a PI of the ELF associate team between Topal and Caltech.
- Laércio Lima Pilla is a member of the societal challenges commission at the LaBRI.
- Laércio Lima Pilla is a member of the committee on gender equality and equal opportunities of the Inria Research center at the University of Bordeaux.
- Laércio Lima Pilla is a member of the National Gender Equality and Equal Opportunities Committee at Inria.
11.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
- Undergraduate level/Licence:
- Aurélien Esnard : Network (54h), Software technologies (80h) at Bordeaux University.
- Pierre Ramet : System programming 24h, Databases 32h, Object programming 48h, Distributed programming 16h, Cryptography 16h, Introduction to unsupervised learning 16h at Bordeaux University.
- Philippe Swartvagher : C Programming (46h), Web Programming (36h), Tools for Programming and C project (30h) at Bordeaux INP (Enseirb-MatMeca).
- Abdou Guermouche System programming 36h at Bordeaux University.
- Mathieu Faverge : Programming environment (26h), Numerical algorithmic (25h), C projects (25h) at Bordeaux INP (Enseirb-MatMeca).
- Post graduate level/Master:
- Aurélien Esnard : Network management (24h), Network security (24h) at Bordeaux University.
- Lionel Eyraud Dubois : Graphs and Algorithms (20h), Complexity and Approximation (20h) at Bordeaux University.
- Olivier Beaumont : Parallel Algorithms, 20h at Bordeaux INP.
- Pierre Ramet : Cryptography 20h and Numerical algorithms 40h at Bordeaux INP (Enseirb-MatMeca).
- Philippe Swartvagher : Parallel Algorithms (17h), Project of network and system programming (25h), Operating Systems (15h) at Bordeaux INP (Enseirb-MatMeca).
- Abdou Guermouche Network management 92h, Network security 64h, Operating system 24h at Bordeaux University.
- Mathieu Faverge : System programming: lecture, practice and project (54h), Linear Algebra for high Performance Computing (9h) at Bordeaux INP (Enseirb-MatMeca). He is also in charge of the master 2 internship for the Computer Science department at Bordeaux INP (Enseirb-MatMeca) and he is in charge, with Abdou Guermouche , of the High Performance Computing - High Performance Data Analytics specialty at Enseirb-MatMeca. This is a common training curriculum between the Computer Science and the MatMeca departments at Bordeaux INP and with the Bordeaux University in the context of the Computer Science Research Master.
- Yulia Gusak : Efficient Deep Learning (Outils pour l'apprentissage) (19h) at Bordeaux INP (Enseirb-MatMeca).
- Laércio Lima Pilla : Algorithms for High-Performance Computing Platforms (16h) at Bordeaux INP (Enseirb-MatMeca) and Bordeaux University, Reading articles and scientific documentation (3h) at Bordeaux University.
- Thomas Herault : Introduction to tensor algebra for the Engineer in Computer Science (9h) at Bordeaux INP (Enseirb-MatMeca); Open MP programming (8h) at Bordeaux INP (Enseirb-MatMeca).
11.2.1 Supervision
- PhD in progress: Brieuc Nicolas ; Scalable tensor algebra on top of runtime system; started Oct 2024; advisors Thomas Herault , Mathieu Faverge ,Abdou Guermouche .
- PhD in progress: Nicolas Ducarton ; Fault tolerance and task-based programming for large-scale systems ; started April 2025; advisors Thomas Herault , Samuel Thibault ,Amina Guermouche .
- PhD in progress: Abel Calluaud; Combined compiler and runtime approach for a direct hierarchical solver; started Nov. 2022; advisors Pierre Ramet , Mathieu Faverge .
- PhD in progress: Jean-François David; Dynamic Scheduling for Inference in Deep Neural Networks; advisors Olivier Beaumont , Lionel Eyraud Dubois .
- PhD in progress: Alycia Lisito; Design and implementation of a portable linear algebra benchmark on runtime systems for performance evaluation of heterogeneous Exascale architectures ; started Nov. 2023; advisors Pierre Ramet , Mathieu Faverge , Matthieu Kuhn (Eviden).
- PhD in progress: Dimitri Walther; ; started Nov. 2024; advisors Pierre Ramet , Mathieu Faverge , M. Lecouvez (CEA Cesta).
- PhD in progress: Hayfa Tayeb ; Optimization of high-performance applications on heterogeneous computing nodes; started Nov. 2021; A. Guermouche , B. Bramas , M. Faverge. Defended March 25th, 2025.
- PhD in progress: Albert D'Aviau de Piolant ; started October 2023; Energy aware scheduling for exascale architectures. Advisors: Abdou Guermouche and Amina Guermouche.
- PhD in progress: Thomas Morin ; started October 2023; Scheduling recursive task graphs. Advisors: Abdou Guermouche, Samuel Thibault, Pierre-André Wacrenier.
- PhD in progress : Alice Lasserre ; Started Oct. 2022; Optimization of a task-based simulation code on a distributed supercomputer; Advisors: Jean-Marie Couteyen-Carpaye, Raymond Namyst and Abdou Guermouche.
- PhD in progress: Samuel Mendoza; On the Scalability of sparse linear system solvers using the task-based paradigm. Started Sept. 2025; advisors Abdou Guermouche , Emmanuel Agullo and Alfredo Buttari.
- PhD in progress: Jean Conan; Simulation-based performance prediction of scientific computing applications on exascale supercomputers; Started March 2025; advisors Abdou Guermouche , Louis Poirel and Arnaud Legrand.
- PhD in progress: Adrien Aguilla–Multner , Started October 2024; Efficient Training of Neural Networks 36, 35. Advisors: Yulia Gusak , Olivier Beaumont .
- PhD defended: Diane Orhan ; Modeling and dynamic optimization of software radio chains on heterogeneous architectures; defended in December 2025; advisors Denis Barthou , Christophe Jégo , and Laércio Lima Pilla .
- PhD in progress: Alan Lira Nunes ; Scheduling algorithms for the optimization of distributed machine learning models on heterogeneous resources; started in August 2022; advisors Cristina Boeres , Lúcia Drummond , and Laércio Lima Pilla .
- PhD in progress: Vanderlei Munhoz Pereira Filho ; Scheduling of task-based parallel applications on heterogeneous Cloud computing environments; started in February 2025; advisors Olivier Aumage , Márcio Castro , and Laércio Lima Pilla .
- PhD in progress: Giorgio Bettonte ; Large-Scale Artificial Intelligence Inference Optimization in Distributed Cloud Environments; started in October 2025; advisors Olivier Beaumont , Thomas Lambert , and Laércio Lima Pilla .
- PhD in progress: Tristan Riehs ; Integrate scheduling of asynchronous network communications and task scheduling; started in October 2025; advisors Samuel Thibault (Storm team), Alexandre Denis (Tadaam team), and Philippe Swartvagher .
- Lionel Eyraud-Dubois and Philippe Swartvagher supervised the internship of Theo Grandsart about the use of task-based runtime systems to implement LLMs.
- Philippe Swartvagher , with Alexandre Denis (Tadaam team) and Samuel Thibault (Storm team), supervised the internship of Tanguy Chatelain, about the anticipation of communications in task-based parallelism 64.
- Thomas Herault and Philippe Swartvagher supervised the internship of Joachim Robert about communications for AI applications in an heterogeneous and geo-distributed network 45.
- Thomas Herault and Philippe Swartvagher supervised the pre-PhD period of Fares Boudjaoui about the scheduling of communications in an heterogeneity network.
- Internship on task-based systems for efficient deep learning (Enrique Galves ). Supervised by Yulia Gusak and Olivier Beaumont .
- Internship on diffusion model inference speed-up via parallelization within solver steps and solver composition (Victor Lucas Rosada Canesin ). Supervised by Yulia Gusak .
- Internship on efficient teacher–student pipeline-parallel training, with application to Reinforcement Learning from human feedback (Mohamed Kherraz ). Supervised by Yulia Gusak .
11.2.2 Juries
- Pierre Ramet : chair of the PhD jury of Lise Jolicoeur.
- Olivier Beaumont : chair of the PhD jury of Luis Lopes Marques and Diane Orhan
- Lionel Eyraud Dubois acted as "opponent" for the defense of Pirah Noor Soomro at Chalmers University of Technology.
- Thomas Herault : chair of the HDR jury of Francieli Boito ; examiner in the jury of Atte Torri PhD defense; examiner in the jury of Abdessalam Benhari PhD defense.
- Yulia Gusak : member of the PhD jury of Yannick Malot on Quantized DNN learning algorithms with limited hardware overhead for Edge implementation.
- Yulia Gusak : member of the PhD monitoring committee (comité de suivi) of Méline Trochon on Adaptive Checkpoint-Restart System with Knowledge of the Network Load.
- Yulia Gusak : member of the PhD monitoring committee of Rafael Silva on Artificial Intelligence for Cardiac Monitoring: Portable Multimodal Cardiac Function Analysis.
11.3 Popularization
11.3.1 Participation in Live events
- As part of the "Circuit Scientifique Bordelais", Philippe Swartvagher presented to high school pupils from the Lycée Stendhal at Aiguillon what is research in computer science and how to become a researcher.
- As part of the "Fête de la Science", Olivier Beaumont presented HPC to students at Lycée Gaston Crampe, Aire-sur-l'Adour (Landes)
- Olivier Beaumont participated in several internal events (closed doors,...) to present the activities of the Inria Bordeaux center teams at the interface between HPC and AI.
- As part of Maths en Jeans, Olivier Beaumont worked with groups of students from Andernos high school on combinatorial problems linked training.
- On several occasions, we have welcomed 3rd and 2nd grade students into the team, with the participation of Topal's PhD students, for periods of 2 hours to half a day.
12 Scientific production
12.1 Major publications
- 1 inproceedingsSymmetric Block-Cyclic Distribution: Fewer Communications Leads to Faster Dense Cholesky Factorization.SC 2022 - SupercomputingDallas, Texas, United StatesNovember 2022HAL
- 2 inproceedingsI/O-Optimal Algorithms for Symmetric Linear Algebra Kernels.ACM Symposium on Parallelism in Algorithms and ArchitecturesPhiladelphie, United StatesJuly 2022HAL
- 3 articleToward an Algebraic Multigrid Method for the Indefinite Helmholtz Equation.SIAM Journal on Scientific ComputingAugust 2025, S285-S310HALDOI
- 4 articleProgramming Heterogeneous Architectures Using Hierarchical Tasks.Concurrency and Computation: Practice and Experience35252023HALDOI
- 5 inproceedingsHiRemate: Hierarchical Approach for Efficient Re-materialization of Large Neural Networks.Proceedings of the 42nd International Conference on Machine LearningForty-Second International Conference on Machine Learning (ICML 2025)267Vancouver, Canada2025HALback to textback to text
- 6 inproceedingsScalable and portable LU factorization with partial pivoting on top of runtime systems.IPDPS25 - 39th IEEE International Parallel and Distributed Processing SymposiumMilan, ItalyJune 2025HAL
- 7 inproceedingsRockmate: an Efficient, Fast, Automatic and Generic Tool for Re-materialization in PyTorch.ICML 2023Honolulu (HI), United StatesJuly 2023HAL
12.2 Publications of the year
International journals
International peer-reviewed conferences
Conferences without proceedings
Edition (books, proceedings, special issue of a journal)
Doctoral dissertations and habilitation theses
Reports & preprints
Other scientific publications
12.3 Cited publications
- 47 articleAchieving High Performance on Supercomputers with a Sequential Task-based Programming Model.IEEE Transactions on Parallel and Distributed Systems2017, 1-1DOIback to text
- 48 articleImplementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems.ACM Trans. Math. Softw.432August 2016, 13:1--13:22HALDOIback to text
- 49 inproceedingsTask-Based Multifrontal QR Solver for GPU-Accelerated Multicore Architectures..HiPCBest paper awardIEEE Computer Society2015, 54-63HALDOIback to text
- 50 inproceedingsReducing Energy Consumption of Dense Linear Algebra Operations on Hybrid CPU-GPU Platforms.2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications2012, 56-62DOIback to text
- 51 articleModeling power and energy consumption of dense matrix factorizations on multicore processors.Concurrency and Computation: Practice and Experience26172014, 2743-2757URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3162DOIback to text
- 52 inproceedingsAdaptive Precision Solvers for Sparse Linear Systems.Proceedings of the 3rd International Workshop on Energy Efficient SupercomputingE2SC '15New York, NY, USAAustin, TexasAssociation for Computing Machinery2015, URL: https://doi.org/10.1145/2834800.2834802DOIback to text
- 53 inproceedingsSymmetric Block-Cyclic Distribution: Fewer Communications leads to Faster Dense Cholesky Factorization.SC'22: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(best paper, Algorithm Track)IEEE and ACM2022back to textback to text
- 54 techreportOptimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory.RR-9302Inria Bordeaux Sud-OuestNovember 2019HALback to textback to text
- 55 inproceedingsEfficient Combination of Rematerialization and Offloading for Training DNNs.NeurIPS 2021 - Thirty-fifth Conference on Neural Information Processing SystemsVirtual-only ConferenceDecember 2021HALback to textback to text
- 56 inproceedingsMadPipe: Memory Aware Dynamic Programming Algorithm for Pipelined Model Parallelism.2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)IEEE2022back to text
- 57 inproceedingsOptimal GPU-CPU Offloading Strategies for Deep Neural Network Training.Euro-Par 2020: Parallel ProcessingChamSpringer International Publishing2020, 151--166back to textback to text
- 58 inproceedingsPipelined Model Parallelism: Complexity Results and Memory Considerations.Proceedings of Europar 2021Lisbon, PortugalSpringerAugust 2021HALback to textback to text
- 59 inproceedings2D Static Resource Allocation for Compressed Linear Algebra and Communication Constraints.2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)IEEE2020, 181--191back to text
- 60 inproceedingsI/O-Optimal Algorithms for Symmetric Linear Algebra Kernels.ACM Symposium on Parallelism in Algorithms and ArchitecturesAssociation for Computing Machinery : SIGACT, SIGARCHPhiladelphie, United StatesJuly 2022HALback to textback to text
- 61 articleOptimal memory-aware backpropagation of deep join networks.Philosophical Transactions of the Royal Society A37821662020, 20190049back to text
- 62 inproceedingsExploiting Generic Tiled Algorithms Toward Scalable H-Matrices Factorizations on Top of Runtime Systems.SIAM PP20-SIAM Conference on Parallel Processing for Scientific Computing2020back to text
- 63 inproceedingsTiled Algorithms for Efficient Task-Parallel ?-Matrix Solvers.2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)IEEE2020, 757--766back to text
- 64 mastersthesisAnticipation des communications réseau grâce à la connaissance du futur dans le parallélisme à tâche.MA ThesisEnseirb-MatmecaSeptember 2025HALback to text
- 65 inproceedingsEfficient gpt model pre-training using tensor train matrix representation.Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation2023, 600--608back to text
- 66 articleTraining deep nets with sublinear memory cost.arXiv preprint arXiv:1604.061742016back to text
- 67 articleQuantization aware factorization for deep neural network compression.Journal of Artificial Intelligence Research812024, 973--988back to text
- 68 articleParallel time integration with multigrid.SIAM Journal on Scientific Computing3662014, C635--C661back to text
- 69 articleAnalysis of the parareal time-parallel time-integration method.SIAM Journal on Scientific Computing2922007, 556--578back to text
- 70 articleAn Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling.SIAM Journal on Scientific Computing3852016, S358-S384back to text
- 71 inproceedingsThe reversible residual network: Backpropagation without storing activations.Proceedings of the 31st International Conference on Neural Information Processing Systems2017, 2211--2221back to text
- 72 inproceedingsMemory-efficient backpropagation through time.Advances in Neural Information Processing Systems2016, 4125--4133back to text
- 73 inproceedingsChasing Carbon: The Elusive Environmental Footprint of Computing.2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)IEEE2021, 854--867back to text
- 74 inproceedingsSurvey on Large Scale Neural Network Training.The 31st International Joint Conference on Artificial Intelligence (IJCAI)2022back to textback to text
- 75 articleParallel Hierarchical Matrices with Adaptive Cross Approximation on Symmetric Multiprocessing Clusters.Journal of Information Processing2242014, 642--650back to text
- 76 techreportDeciding Non-Compressible Blocks in Sparse Direct Solvers using Incomplete Factorization.RR-9396Inria Bordeaux - Sud Ouest2021, 16HALback to text
- 77 inproceedingsTraining on the Edge: The why and the how.2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)IEEE2019, 899--903back to text
- 78 inproceedingsTaking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes.2014 IEEE International Parallel & Distributed Processing Symposium Workshops, Phoenix, AZ, USA, May 19-23, 2014IEEE Computer Society2014, 29--38URL: https://doi.org/10.1109/IPDPSW.2014.9DOIback to text
- 79 bookBlock Low-Rank multifrontal solvers: complexity, performance and scalability.Université Toulouse 3 Paul SabatierPh.D. Dissertation2017back to text
- 80 articleEfficient Parallel Solution of the 3D Stationary Boltzmann Transport Equation for Diffusive Problems.Journal of Computational PhysicsMarch 2019HALDOIback to text
- 81 inproceedingsPipeDream: generalized pipeline parallelism for DNN training.Proceedings of the 27th ACM Symposium on Operating Systems Principles2019, 1--15back to textback to text
- 82 articleCarbon emissions and large neural network training.arXiv preprint arXiv:2104.103502021back to text
- 83 inproceedingsStable Low-rank Tensor Decomposition for Compression of Convolutional Neural Network.European Conference on Computer Vision (ECCV)Springer2020, 522--539back to textback to text
- 84 articleSparse supernodal solver using block low-rank compression: Design, performance and analysis.International Journal of Computational Science and Engineering27July 2018, 255 - 270HALDOIback to text
- 85 inproceedingsRecent Developments Around the Block Low-Rank PaStiX Solver.SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP 2020)2020back to text
- 86 inproceedingsLeveraging Task-Based Polar Decomposition Using PARSEC on Massively Parallel Systems.2019 IEEE International Conference on Cluster Computing (CLUSTER)IEEE2019, 1--12back to text