TOPAL

TOPAL - 2025

2025‌Activity reportProject-TeamTOPAL

RNSR: 202324391S

Research center‌ Inria Centre at the University of Bordeaux
In‌ partnership with:Bordeaux INP, Université de Bordeaux, CNRS‌
Team name: Tools and Optimization for high Performance‌ Applications and Learning
In collaboration with:Laboratoire Bordelais‌ de Recherche en Informatique (LaBRI)

Creation of the‌ Project-Team: 2023 March 01

Each year, Inria research‌ teams publish an Activity Report presenting their work‌ and results over the reporting period. These reports‌ follow a common structure, with some optional sections‌ depending on the specific team. They typically begin‌ by outlining the overall objectives and research programme,‌ including the main research themes, goals, and methodological‌ approaches. They also describe the application domains targeted‌ by the team, highlighting the scientific or societal‌ contexts in which their work is situated.

The‌ reports then present the highlights of the year,‌ covering major scientific achievements, software developments, or teaching‌ contributions. When relevant, they include sections on software,‌ platforms, and open data, detailing the tools developed‌ and how they are shared. A substantial part‌ is dedicated to new results, where scientific contributions‌ are described in detail, often with subsections specifying‌ participants and associated keywords.

Finally, the Activity Report‌ addresses funding, contracts, partnerships, and collaborations at various‌ levels, from industrial agreements to international cooperations. It‌ also covers dissemination and teaching activities, such as participation in scientific events,‌ outreach, and supervision. The‌ document concludes with a‌‌ presentation of scientific production, including major publications and‌ those produced during the‌ year.

Keywords

Computer Science‌‌ and Digital Science

A1.1.4. High performance computing
A1.1.5.‌ Exascale
A1.1.9. Fault tolerant‌ systems
A1.2.10. Digital Communications‌‌
A1.3. Distributed Systems
A1.3.4. Peer to peer
A1.3.5.‌ Cloud
A1.6. Green Computing‌
A2.6.4. Ressource management
A6.2.5.‌‌ Numerical Linear Algebra
A6.2.7. HPC for machine learning‌
A7.1. Algorithms
A7.1.1. Distributed‌ algorithms
A7.1.2. Parallel algorithms‌‌
A8.1. Discrete mathematics, combinatorics
A8.2. Optimization
A9.2. Machine‌ learning
A9.2.4. Optimization and‌ learning
A9.2.6. Neural networks‌‌
A9.2.8. Deep learning
A9.7. AI algorithmics
A9.9. Distributed‌ AI, Multi-agent

Other Research‌ Topics and Application Domains‌‌

B4.2.2. Fusion
B9.5.1. Computer science
B9.5.2. Mathematics

1‌ Team members, visitors, external‌ collaborators

Research Scientists

Olivier‌‌ Beaumont [Team leader, INRIA, Senior‌ Researcher, HDR]‌
Lionel Eyraud Dubois [‌‌INRIA, Researcher]
Yulia Gusak [INRIA‌, Researcher]
Thomas‌ Herault [INRIA,‌‌ Senior Researcher, HDR]
Laercio Lima Pilla‌ [CNRS, Researcher‌]

Faculty Members

Aurélien‌‌ Esnard [UNIV BORDEAUX, Associate Professor]‌
Mathieu Faverge [BORDEAUX‌ INP, Associate Professor‌‌]
Abdou Guermouche [UNIV BORDEAUX, Associate‌ Professor, HDR]‌
Pierre Ramet [UNIV‌‌ BORDEAUX, Professor, HDR]
Philippe Swartvagher‌ [BORDEAUX INP,‌ Associate Professor]

PhD‌‌ Students

Adrien Aguila–Multner [INRIA]
Giorgio Bettonte‌ [HIVE COMPUTING SERVICES‌ SAS, CIFRE,‌‌ from Oct 2025]
Abel Anas Calluaud [‌CEA, CIFRE,‌ until Oct 2025]‌‌
Jean Conan [BULL, CIFRE]
Jean‌ Francois David [INRIA‌, until Feb 2025‌‌]
Andrei Drozdov [DIABOLOCOM, CIFRE,‌ from Oct 2025]‌
Alan Lira Nunes [‌‌INRIA and UFF, Joint-doctorate (cotutelle) with UFF,‌ Brazil]
Alycia Lisito‌ [BULL]
Samuel‌‌ Mendoza [INRIA, from Sep 2025]‌
Brieuc Nicolas [INRIA‌]
Hayfa Tayeb [‌‌INRIA, until Mar 2025]
Dimitri Walther‌ [CEA, CIFRE‌]

Technical Staff

Pierre‌‌ Estérie [INRIA, Engineer]

Interns and‌ Apprentices

Fares Boudjaoui [‌INRIA, Apprentice,‌‌ from Dec 2025]
Raphael Bourgouin [INRIA‌, from Oct 2025‌]
Raphael Bourgouin [‌‌INRIA, Intern, from May 2025 until‌ Aug 2025]
Raphael‌ Bourgouin [INRIA,‌‌ until Apr 2025]
Killian Chateau [INRIA‌, Intern, until‌ Apr 2025]
Enrique‌‌ Galvez [INRIA, Intern, until Jan‌ 2025]
Theo Grandsart‌ [INRIA, from‌‌ Nov 2025]
Theo Grandsart [INRIA,‌ Intern, from May‌ 2025 until Aug 2025‌‌]
Mohamed Kherraz [INRIA, Intern,‌ from Apr 2025 until‌ Sep 2025]
Matteo‌‌ Marcos [INRIA, Intern, from Mar‌ 2025 until Jul 2025‌]
Samuel Mendoza [‌‌INRIA, from Apr 2025 until Aug 2025‌]
Zhaniya Nurkhanova [‌INRIA, Intern,‌‌ until Apr 2025]‌
Joachim Robert [INRIA, Intern, from‌ Apr 2025 until Aug 2025]
Victor Lucas‌ Rosada Canesin [INRIA, Intern, from‌ Sep 2025 until Sep 2025]
Victor Lucas‌ Rosada Canesin [INRIA, Intern, from‌ May 2025 until Aug 2025]

Administrative Assistants‌

Catherine Cattaert Megrat [INRIA]
Marie-Melissandre Roy‌ [INRIA]

2 Overall objectives

The expertise‌ of the team is at the heart of‌ the issues between numerical simulations, training and HPC.‌ In this context, the ability to effectively use‌ the ever-increasing power of machines for numerical simulations‌ (the shift to exascale for the next few‌ years) is always central. These new platforms are‌ characterized by their huge size (in terms of‌ number of cores) and the heterogeneity of computing‌ resources, with most of the computational power based‌ on accelerators. We have largely anticipated these evolutions,‌ and in particular, the different members of the‌ team have been making efforts for several years‌ to promote the use of dynamic runtimes such‌ as StarPU, through a long-running collaboration with Storm‌ project team. Runtime systems allow heterogeneous resources to‌ be used transparently and allow some placement and‌ scheduling decisions to be made dynamically, without the‌ need to make static planning in advance. Indeed,‌ such a fully static allocation would not be‌ able to cope with the uncertainties of task‌ and communication durations in increasingly complex environments and‌ with increasingly shared resources. The question of scaling‌ up these solutions, their use in (Neural Network)‌ training and the effective management of large-scale distributed‌ machines in particular, remains largely open.

As in‌ many other fields, Machine Learning is changing the‌ landscape at many levels. Training of large networks‌ represents a new application for HPC because of‌ the huge computational and memory needs it generates.‌ Training has become a major source of use‌ for converged HPC systems such as the Jean‌ Zay supercomputer at IDRIS. If considered as an‌ HPC workflow, it is an application that is‌ quite different from traditional numerical simulation applications, because‌ the calculations are tensor-based rather than matrix-based and‌ because the nature of the dependencies makes parallelization‌ more difficult and more intertwined with memory management‌ issues.

On the other hand, ML plays a‌ central role in the analysis of data, particularly‌ data produced by large scientific instruments and large‌ numerical simulations. In this context, it is important‌ to bridge the data placement, resource allocation and‌ computational scheduling strategies that are used to perform‌ simulations and to perform data analysis. There again,‌ we believe that dynamic runtime schedulers, coupled with‌ static data placement strategies, are a relevant and‌ promising tool. Finally, training represents a very important‌ market, has a strong and growing influence on‌ processor architectures, their accuracy and their arithmetics. This‌ requires to further adapt the algorithms, the management‌ of ever-increasing heterogeneity and the control of computational‌ accuracy, both for classical numerical kernels and training deep neural networks.

Another‌ major concern is the‌ control of energy and‌‌ carbon footprint minimizations. HPC is not naturally and‌ historically an area of‌ energy sobriety, but energy‌‌ is a critical issue. Firstly, energy is a‌ major subject because the‌ race towards exascale has‌‌ highlighted the difficulty of electrically powering all these‌ resources, and the increasing‌ presence of dark silicon‌‌ in computing resources makes resource allocation and power‌ management problems extremely difficult.‌ Furthermore, the minimization of‌‌ our carbon footprint is a major societal issue‌ and must be an‌ axis of evaluation for‌‌ our research. In this context, we believe that‌ the solution cannot only‌ be at the architecture‌‌ and system levels, but that it is necessary‌ to rethink parallel numerical‌ kernels and algorithms in‌‌ such a way as to allow prolonged use‌ of the computing resources.‌

A new development in‌‌ the team’s research is the explicit focus on‌ communication efficiency and fault‌ tolerance as central challenges‌‌ of modern high-performance computing. As platforms continue to‌ scale in size and‌ heterogeneity, the cost of‌‌ data movement increasingly dominates execution time, while hardware‌ and software failures can‌ no longer be considered‌‌ exceptional. Addressing these issues requires approaches that integrate‌ communication management and resilience‌ directly into algorithms and‌‌ runtime systems, rather than treating them as external‌ concerns. This problem statement‌ is particularly relevant to‌‌ the team’s main application domains—linear algebra and machine‌ learning—where large-scale data exchanges,‌ iterative methods, and long-running‌‌ computations make performance and robustness tightly coupled.

Overall,‌ the objective of the‌ project is to transfer‌‌ our historical expertise in linear algebra, runtime systems‌ and combinatorial optimization (resource‌ allocation, scheduling) to new‌‌ problems (decompositions and tensor algebra, training in DNNs)‌ which require a change‌ of scale and new‌‌ algorithms for new computing platforms (with different number‌ representations and an ever‌ increasing heterogeneity of computing‌‌ resources). In addition, these new applications and new‌ platforms require a central‌ focus on data, since‌‌ the gap between the costs (in energy and‌ time) of storing and‌ moving data compared to‌‌ the costs of computation is always growing, which‌ encourages innovative solutions (compression,‌ redundant computation) that can‌‌ in turn contribute to increasing the duration of‌ use of computing resources.‌

3 Research program

3.1‌‌ Objectives

We propose to structure our research around‌ two main application fields‌ (see Section 4):‌‌ linear multi-dimensional algebra and solvers on the one‌ hand, and training in‌ particular of deep learning‌‌ networks on the other hand. In these two‌ domains, our contributions will‌ be organized around three‌‌ main research axes (see Section 3.3): the‌ use of task based‌ runtime systems (to provide‌‌ robust solutions and to increase the portability in‌ the context of heterogeneous‌ large scale platforms), the‌‌ use of compression (to limit memory footprint and‌ data transfers) and the‌ minimization of energy consumption‌‌ and carbon impact (using an approach of rewriting‌ algorithms and placement strategies‌ to limit data movements).‌‌ This matrix organization of‌ our activities (see Section 3.4) is intended‌ to maximize the interactions between the different researchers‌ of the team and facilitate knowledge sharing and‌ joint participation in projects.

In these topics, the‌ use of task based runtime systems and the‌ design of efficient linear algebra kernels and solvers‌ belong to the historical expertise of the team‌ and is shared by all team members, especially‌ in the context of linear algebra kernels. Our‌ goal is to build on this expertise to‌ extend the use of task based runtime systems‌ to other types of applications such as training‌ and to use the precise knowledge of these‌ linear algebra kernels to incorporate new criteria such‌ as energy minimization. The application to training (and‌ interference) in deep neural networks and data compression‌ are subjects we have been interested in for‌ a few years, typically during the last HiePACS‌ evaluation period and within the Inria Challenge of‌ AI, HPC and Big Data led by Bruno‌ Raffin. The extension of the techniques developed in‌ linear algebra to tensor algebra and tensor decompositions‌ is natural, given the proximity of the fields‌ and the practical importance of the subject, but‌ it is more recent and reinforced by the‌ arrival of Julia Gusak, who is an expert‌ in the field. Finally, the objective of energy‌ and carbon footprint minimization, at the algorithmic and‌ software levels rather than at the architecture level,‌ is a field that we wish to emphasize‌ in our research, both because of its own‌ fundamental importance and because we believe that our‌ expertise and the techniques that we have developed‌ in recent years are well adapted to it‌ and that the approach we propose is original.‌

3.2 Overall Positionning

The general positioning of the‌ team is to produce tools for users, academic‌ or industrial, in the form of algorithms and‌ software libraries. These users can work either in‌ numerical simulation or in training. Nevertheless, as our‌ experiences in simulation and training have already demonstrated,‌ this interaction cannot be carried out in the‌ form of providing black boxes and it is‌ crucial for us to work directly with the‌ users of our software to understand their needs‌ and adapt our algorithms and codes to the‌ characteristics of their data. This interaction will be‌ particularly critical to work on data representation and‌ compression, which requires a strong interaction with numerical‌ methods and machine learning in order to understand‌ the application requirements and the characteristics of data,‌ based on their significance.

At the other end‌ of the spectrum, it is also essential for‌ us to maintain close relationships with both the‌ architecture and system communities. Indeed, the very rapid‌ growth of machine learning applications has also renewed‌ the landscape of computing resources with the emergence‌ of very original solutions, at the architectural and‌ arithmetic level. Even if we cannot influence on‌ these evolutions, it is very important to propose solutions that make the‌ best use of them.‌ We also decided several‌‌ years ago to rely on task based runtime‌ systems to implement our‌ software developments. This decision‌‌ has many implications on our developments and requires‌ an extremely close collaboration‌ with their designers. In‌‌ this context, we have co-supervised several PhD theses‌ related to StarPU with‌ the Storm project team‌‌ and we will pursue this strategy, which is‌ crucial in particular to‌ take into account the‌‌ challenges ahead of us: the transition to exascale,‌ the integration of the‌ energy, the extension to‌‌ training applications and the ever increasing heterogeneity of‌ computing resources.

3.3 Research‌ Axes

3.3.1 Use of‌‌ Runtime systems

Participants: Olivier Beaumont, Aurélien Esnard‌, Lionel Eyraud Dubois‌, Mathieu Faverge,‌‌ Abdou Guermouche, Thomas Herault, Laércio Lima‌ Pilla, Philippe Swartvagher‌.

In previous works,‌‌ our main goal was to study the methodology‌ needed to efficiently exploit‌ the new generation of‌‌ high-performance computers with all the constraints that it‌ induces (number of cores,‌ heterogeneity, co-scheduling effects, etc.).‌‌ To achieve this goal, we successfully proposed a‌ methodology based on the‌ use of modern task-based‌‌ runtime systems to ensure both portability and performance‌ portability (the ability to‌ achieve high performance by‌‌ only tuning few parameters of the application). This‌ work was done in‌ the context of several‌‌ projects (ANR Solhar, ANR SOLHARIS, Projet Région HPC‌ Scalable Ecosystem, etc.). The‌ work done mainly targeted‌‌ single multicore nodes equipped with several accelerator devices‌ and the extension of‌ these techniques to the‌‌ multi-node case will be the focus of our‌ future works, especially with‌ the arrival of Philippe‌‌ Swartvagher in the team. Indeed, it has been‌ observed that in the‌ context of distributed nodes,‌‌ the placement strategies of runtime systems are insufficient‌ and generate too much‌ communication. In this context,‌‌ it is therefore crucial to develop efficient placement‌ strategies 60, 53‌. The extension of‌‌ these mixed (static/dynamic) strategies in the case of‌ tensors is largely open.‌

3.3.2 Design of compression‌‌ techniques

Participants: Abdou Guermouche, Yulia Gusak,‌ Mathieu Faverge, Pierre‌ Ramet, Philippe Swartvagher‌‌.

The memory consumption of the applications has‌ been and will remain‌ an important challenge for‌‌ solving larger problems that will lead to exascale‌ computations. In the recent‌ years we have demonstrated‌‌ the interest of data compression techniques in linear‌ solvers, both to save‌ space and computations. Increasingly‌‌ complex compression schemes require programming models to evolve‌ to properly express the‌ parallelism of these formats‌‌ and to accommodate the increasing irregularity of applications.‌ In TOPAL, we‌ propose to continue the‌‌ study of data compression techniques (low-rank, mixed precision,‌ ...) in the context‌ of solvers, but also‌‌ in the context of training and multi-linear algebra.‌ This part will be‌ a very pertinent field‌‌ for the study of applications over runtime systems,‌ because of the strong‌ irregularities that make the‌‌ load balancing more complicated.‌ At the same time, it is an original‌ and promising approach for energy reduction. Representing convolutional‌ / fully-connected weights in tensor formats is an‌ effective way to reduce the parameters/FLOP in neural‌ networks. However, post-quantization (reduction of parameters precision, for‌ example, from float32 to int8) of networks with‌ factorized weights yields a significant drop in accuracy.‌ Due to memory/power consumption limitations of real devices,‌ the quantization step is necessary, when pre-trained models‌ are deployed. Therefore, our goal is to find‌ algorithms that build tensorized neural networks, where weight‌ factors are directly contain elements in low-precision format.‌ Efficient implementation of operations on tensors represented in‌ low-bit format will be required, as well as‌ development of regularization techniques to tackle instability issues‌ when training deep learning models with low-bit weights.‌

3.3.3 Energy minimization

Participants: Olivier Beaumont, Lionel‌ Eyraud Dubois, Mathieu Faverge, Abdou Guermouche‌, Yulia Gusak, Laércio Lima Pilla.‌

Running computations with resource frugality is an important‌ challenge, both for the upcoming exascale shift and‌ for generally reducing the carbon impact of scientific‌ computing. In addition to the usual objective of‌ making computations run faster, we thus intend to‌ design and evaluate our techniques and algorithms with‌ the purpose of limiting their carbon footprint. In‌ particular, given the lasting trend that the time‌ and energy costs of computing are becoming ever‌ lower than the costs of accessing and communicating‌ data, we want to explore the tradeoffs of‌ trading more computation for less data movements. This‌ can be achieved in several ways: compression techniques‌ as described above, replication of some computations, or‌ use of lower precision. We are planning to‌ work on this issue from two points of‌ views: more frugal numerical algorithms, and energy-aware scheduling‌ techniques. As for the embedded architectures in the‌ phone, but also in the latest generation of‌ laptops (Apple M1 Pro and Max chips), we‌ are starting to see the emergence of Big-Little‌ type technologies in the design of HPC oriented‌ chips. In general, thermal design power (TDP) constraints‌ push architects to increase the diversity and number‌ of energy efficient circuits, even if they cannot‌ all be powered simultaneously. If this hardware solution‌ is very debatable from the point of view‌ of carbon impact, it raises difficult and original‌ questions about the optimization of computing performance under‌ energy constraints. This kind of approach opens new‌ perspectives, both from the point of view of‌ scheduling algorithms but also in the design of‌ computational kernels in linear algebra. We are also‌ seeing the emergence of new processors (ARM or‌ RISC-V technologies, Rhea from the SiPearl company within‌ the EPI consortium, which should seriously compete with‌ the supremacy of x86 architectures (Intel and AMD)‌ with Nvidia accelerator cards in the search for‌ a compromise between pure performance and energy sobriety.‌

In the field of training, a complementary opportunity‌ is available. Indeed, contrary to classical HPC, the renewal of computational resources‌ is often linked to‌ the need to run‌‌ larger models (and data with a better resolution‌ to a lesser extent),‌ rather than by the‌‌ acceleration of computations. In this context, the possibility‌ offered by tools such‌ as Rotor 7.1.8,‌‌ Rockmate 7.1.7, ELF 7.1.2 to limit memory‌ requirements contributes to limiting‌ the carbon footprint. Our‌‌ goal is to extend the scope of these‌ techniques, including to other‌ fields of application than‌‌ training. Our collaboration with Qarnot Computing is consistent‌ with this objective. The‌ co-design environment of the‌‌ TextaRossa and Eupex projects 10 are also great‌ avenues to explore these‌ questions.

3.3.4 Communication and‌‌ Fault Tolerance

Participants: Olivier Beaumont, Lionel Eyraud‌ Dubois, Thomas Herault‌, Philippe Swartvagher.‌‌

The new research axis on communication and fault‌ tolerance represents an opportunity‌ for the team to‌‌ address a broader spectrum of challenges arising in‌ modern high-performance computing platforms.‌ As applications increasingly rely‌‌ on large numbers of interconnected components, communication costs‌ and failures have become‌ central limitations to performance,‌‌ scalability, and usability. This axis builds on the‌ expertise brought by the‌ arrival of Thomas Hérault‌‌ as a research director, together with the team’s‌ existing strengths in communication‌ systems through Philippe Swartvagher,‌‌ to explore new techniques spanning communication optimization, resilience‌ mechanisms, and their interaction‌ with runtime systems. By‌‌ covering a wider range of problems and solution‌ strategies, this axis naturally‌ complements the existing research‌‌ directions of the team and reinforces their applicability‌ to the targeted application‌ domains, enabling more scalable,‌‌ efficient, and robust executions on current and future‌ computing platforms.

3.4 Main‌ Research Topics

The list‌‌ of our contributions can be read at the‌ intersection of the research‌ domains described in Section‌‌ 4 and research axes described in Section 3.3‌ as shown in the‌ following table:

	Axis 3.3.1‌‌ –	Axis 3.3.2 –	Axis 3.3.3 –	Axis‌ 3.3.4 –
	Runtime	Compression‌	Energy	Comm. & Fault‌‌ Tol.
Domain 4.1 – Lin. Alg., Tensors	Topic‌ 3.4.1	Topic 3.4.2	Topic‌ 3.4.3	Topic 3.4.7
Domain‌‌ 4.2 – Training	Topic 3.4.4	Topic 3.4.5	Topic‌ 3.4.6	Topic 3.4.8

3.4.1‌ Task-based Linear Algebra and‌‌ Tensor Computations

Participants: Olivier Beaumont, Aurélien Esnard‌, Lionel Eyraud Dubois‌, Mathieu Faverge,‌‌ Abdou Guermouche, Thomas Herault, Pierre Ramet‌, Philippe Swartvagher.‌

We plan to continue‌‌ our activity on task-based linear algebra to find‌ solutions for expressing high‌ level algorithms in an‌‌ elegant way while ensuring high performance. First, we‌ want to consider the‌ expressivity of the algorithms‌‌ for large scale distributed architectures while considering the‌ specific problems of scheduling,‌ data and task mapping,‌‌ and data granularity. This work will be done‌ in tight collaboration with‌ the Storm and Tadaam‌‌ teams and is a key objective of the‌ ANR SOLHARIS project. Moreover,‌ the foundations of this‌‌ topic fall back to the HiePACS project. Thus,‌ we plan to collaborate‌ and exchange with the‌‌ CONCACE team on topics‌ which are of interest to both teams (mainly‌ expressivity and scalability). Second, as mentioned above, we‌ plan to study data compression techniques in linear‌ algebra 70, 75, 79, which‌ brings new algorithmic schemes that are outside of‌ the scope of the classical programming model used‌ until now. As mid and long term objectives,‌ we would like to find new ways to‌ express these linear algebra algorithms to efficiently exploit‌ large heterogeneous architectures. A second research topic focuses‌ on the extension of the techniques developed in‌ the framework of linear algebra, in particular with‌ the Chameleon library, to multi-linear algebra and tensors.‌ The idea is to build on the expertise‌ we have in the field of compression and‌ in the use of runtimes to use heterogeneous‌ resources in particular.

Another challenge would be to‌ redesign the graph partitioning & matrix ordering algorithms‌ in a task-based runtime, in order to facilitate‌ the integration of this basic building block in‌ modern tasked-based solvers. This work has already been‌ initiated in the StarPart 7.1.5 project.

3.4.2 Multi-Linear‌ Algebra and Tensor Decompositions

Participants: Olivier Beaumont,‌ Lionel Eyraud Dubois, Mathieu Faverge, Abdou‌ Guermouche, Yulia Gusak, Thomas Herault,‌ Pierre Ramet.

Tensor decompositions can be viewed‌ as a natural generalization of SVD-type matrix decompositions‌ from linear algebra. In the tensor setting, several‌ decomposition formats have been developed, each offering different‌ trade-offs between expressiveness, computational cost, and compression efficiency.‌ These methods play an important role in the‌ analysis of large-scale data, as well as in‌ the compression and inference/training acceleration of neural networks.‌ The addition of Julia Gusak to the project‌ strengthens our expertise in this area 83,‌ 65.

In addition to the basic kernels‌ to be integrated in Chameleon proposed in the‌ Topic 3.4.1, we will propose distributed tensor‌ decomposition algorithms compression algorithms, focusing on low-order tensors‌ with large mode dimensions, which are common in‌ neural network models.

3.4.3 Energy Minimization in Linear‌ Solvers

Participants: Mathieu Faverge, Abdou Guermouche.‌

We plan to investigate how to reduce the‌ energy consumption of linear algebra libraries (either sparse‌ or dense). To do so we will rely‌ on an algorithmic approach rather than a system‌ approach. The idea, in a first step, is‌ to consider several implementations of a same kernel‌ and select the implementation while taking into account‌ energy consumption 51, 50, 52.‌ For instance a low-rank implementation of a given‌ operation will be slower than a regular high-performance‌ implementation but it will tend to require less‌ energy. In the longer term, we plan also‌ to investigate how to design energy efficient implementations‌ of basic kernels. They will then be used‌ within higher level algorithms in order to find‌ a better trade-off between energy consumption and high‌ performance. In the context of developing linear algebra‌ solvers using compression techniques, a research axis we would like to develop‌ is the energy consumption‌ study of these solvers:‌‌ is it possible to provide computation kernels with‌ different energy consumption levels‌ that can be easily‌‌ exchanged to lower the final energy consumption of‌ the application while keeping‌ the same numerical accuracy.‌‌ Low-rank compression techniques, as well as mixed-precision solution‌ are envisioned toward this‌ objective.

3.4.4 Task-based Approaches‌‌ for Deep Learning

Participants: Olivier Beaumont, Lionel‌ Eyraud Dubois, Mathieu‌ Faverge, Abdou Guermouche‌‌, Yulia Gusak, Thomas Herault, Laércio‌ Lima Pilla, Pierre‌ Ramet, Philippe Swartvagher‌‌.

In popular Deep Learning frameworks like TensorFlow‌ or PyTorch, the parallelization‌ of the training process‌‌ is performed with a large granularity, mostly relying‌ on Data Parallelism. Specialized‌ frameworks have been proposed‌‌ to explore finer parallel schemes, like PipeDream for‌ model parallelism 81.‌ These implementations are however‌‌ very static and require explicit and error-prone data‌ management policies. We believe‌ that our expertise in‌‌ using task-based runtime systems can be used to‌ provide much simpler approaches‌ for a finer grain‌‌ control on the execution of the corresponding task‌ graphs and communications patterns,‌ for both training and‌‌ inference phases. We plan to design a prototype‌ implementation that would allow‌ to easily use clever‌‌ scheduling and optimization techniques to improve the performance‌ of inference. In the‌ longer term, we expect‌‌ that this approach will provide better scalability and‌ flexibility, and unlock new‌ opportunities for optimization, for‌‌ a wide range of deep learning applications.

3.4.5‌ Tensor Compression for Inference‌

Participants: Olivier Beaumont,‌‌ Yulia Gusak.

We envision a research activity‌ focused on the use‌ of tensor compression for‌‌ inference. Initially, the objective is to combine tensor‌ compression techniques and quantization‌ in order to enable‌‌ inference under strict memory constraints or low-latency requirements‌ 67, 83.‌ These techniques can also‌‌ be extended to the context of on-device training,‌ which in particular requires‌ memory-saving approaches 74.‌‌ Finally, a more ambitious goal would be to‌ combine these approaches with‌ methods for designing neural‌‌ networks that are inherently efficient in terms of‌ memory usage 71.‌

3.4.6 Carbon Saving and‌‌ Energy-Efficient Training

Participants: Olivier Beaumont, Lionel Eyraud‌ Dubois, Yulia Gusak‌, Laércio Lima Pilla‌‌.

The training phase of Deep Neural Networks‌ is notoriously very resource-hungry,‌ especially regarding its energy‌‌ consumption. In the last years, we have proposed‌ several algorithmic solutions (re-materialization‌ 54, 5,‌‌ offloading 57, their combination 55, pipelining‌ 58, 36)‌ to reduce the resource‌‌ consumption of this training phase, with a focus‌ on reducing the training‌ time. We plan to‌‌ broaden the scope of these studies, by also‌ taking into account the‌ energy usage. A heterogeneous‌‌ context and a flexible runtime system, as planned‌ in Topic 3.4.4,‌ may also be an‌‌ opportunity to reduce energy consumption by allocating some‌ tasks, typically the non-critical‌ ones, to the most‌‌ efficient resources for them,‌ or by selecting a different implementation with better‌ energy efficiency. This can be seen as a‌ generalization of mixed-precision techniques, which are also very‌ popular in this context to help achieving a‌ better frugality. However, care must be taken to‌ not degrade the convergence of the training phase.‌ Moreover, the carbon footprint comes essentially from the‌ manufacturing 82, 73 of the computing resources‌ (GPUs) and the main goal is to facilitate‌ their non-renewal, as enabled by memory saving techniques.‌

3.4.7 Communication-Aware Resilience Patterns for Iterative Linear Algebra‌

Participants: Thomas Herault.

This year, our work‌ in the Communication and Fault Tolerance axis addressed‌ the growing need for efficient and resilient execution‌ of large-scale iterative linear algebra algorithms on modern‌ HPC platforms. At scale, such computations are increasingly‌ limited by communication costs and are exposed to‌ a wide range of faults, including process failures,‌ silent data corruptions, and memory errors, which can‌ no longer be treated as rare events. Our‌ contributions explore communication-aware resilience strategies that integrate fault‌ detection, recovery, and checkpointing directly into the algorithmic‌ structure of iterative methods. By carefully controlling the‌ frequency and granularity of verification, redundancy, and checkpointing‌ mechanisms, we showed that it is possible to‌ bound error propagation while significantly reducing overheads compared‌ to classical replication-based approaches. A central outcome of‌ this work is a set of analytical models‌ and optimization techniques that guide the design of‌ hierarchical and adaptive resilience patterns, balancing computation, communication,‌ memory usage, and execution time under realistic system‌ constraints such as bounded detection latency and fixed-time‌ resource allocations. Although primarily evaluated on linear algebra‌ solvers, these techniques are largely generic and directly‌ applicable to other iterative workloads, including neural network‌ training, where similar trade-offs between communication efficiency, redundancy,‌ and robustness arise.

3.4.8 Communication-Aware Resilience Patterns for‌ Training and Inference

Participants: Olivier Beaumont, Fares‌ Boudjaoui, Lionel Eyraud Dubois, Thomas Herault‌, Philippe Swartvagher.

The training and inference‌ phases of modern machine learning workloads rely heavily‌ on large-scale collective communications, which are traditionally designed‌ under the assumption of stable, homogeneous, and reliable‌ infrastructures. However, as execution platforms become increasingly heterogeneous‌ and volatile, communication costs and failures have a‌ growing impact on both performance and correctness. Building‌ on our recent work on resilience patterns for‌ iterative algorithms, we are initiating new research on‌ communication-aware resilience mechanisms for distributed training and inference,‌ with the goal of jointly addressing efficiency and‌ robustness. This work, launched with the PhD of‌ Farès Boudjaoui in the context of the Cupseli‌ Inria challenge, explores adaptive communication schemes, fault-tolerant collectives,‌ and dynamic reconfiguration strategies that can tolerate node‌ unavailability, bandwidth variability, and network contention, while limiting‌ synchronization and data movement overheads. By integrating resilience‌ directly into communication patterns, rather than treating failures‌ as exceptional events, we aim to support scalable‌ and robust executions in both centralized HPC platforms‌ and more decentralized environments, without degrading convergence or inference quality.

4 Application‌ domains

4.1 Multi-Linear Algebra‌ and Solvers

Participants: Olivier‌‌ Beaumont, Aurélien Esnard, Lionel Eyraud Dubois‌, Mathieu Faverge,‌ Abdou Guermouche, Yulia‌‌ Gusak, Thomas Herault, Pierre Ramet,‌ Philippe Swartvagher.

At‌ the core of a‌‌ large number of simulation tools, the resolution of‌ large linear systems often‌ represents the dominant part‌‌ of the computing time. These linear solvers rely‌ on a wide variety‌ of numerical methods and‌‌ algorithms. Massively parallel versions are required to support‌ advances in multi-physics and‌ multi-scale simulations, especially when‌‌ targeting exascale platforms. The aim is therefore to‌ address the major challenge‌ of designing and building‌‌ numerically robust solvers on top of runtime systems‌ that can scale up‌ and push back the‌‌ limits of existing industrial codes by making full‌ use of all computing‌ resources such as CPUs,‌‌ GPUs and other accelerator units. Following the ANR‌ project SOLHARIS (and previously‌ SOLHAR), we now have‌‌ experience of strong/weak scalability of sparse direct solvers‌ on large scale, distributed‌ memory, heterogeneous computers. These‌‌ solvers already rely on asynchronous task-based parallelism 48‌, 49, 78‌, 47, rather‌‌ than traditional and widely adopted message-passing and multithreading‌ techniques. Indeed, the use‌ of modern runtime systems‌‌ have proven to be good tools for the‌ development of scientific computing‌ applications 80, 62‌‌, 86, in particular in combination with‌ compression 63, 85‌, 84, 59‌‌, 76 and communication avoiding techniques 60,‌ 53. This work‌ can be extended naturally‌‌ to multi-dimensional objects such as tensors. In the‌ tensor case, we propose‌ to extend the data‌‌ distribution strategies to minimize communication and the use‌ of system runtimes to‌ handle the variability and‌‌ heterogeneity of computational resources. Finally, we have focused‌ so far on minimizing‌ the execution time, whereas‌‌ energy efficiency is becoming a critical element. We‌ therefore plan to revisit‌ the algorithms and methods‌‌ we developed in linear algebra, and those we‌ propose to design for‌ handling tensors, to allow‌‌ the optimal use of the available hardware in‌ order to guarantee the‌ performance of the computations‌‌ within a fixed energy budget.

4.2 Training and‌ Inference for DNNs

Participants:‌ Olivier Beaumont, Lionel‌‌ Eyraud Dubois, Yulia Gusak, Thomas Herault‌, Laércio Lima Pilla‌, Pierre Ramet,‌‌ Philippe Swartvagher.

The training phase in Deep‌ Neural Networks has become‌ an important source of‌‌ HPC resource usage and it is crucial to‌ perform it efficiently on‌ parallel architectures. Until today,‌‌ data parallelism is the most widely used method,‌ but the associated requirement‌ to replicate all the‌‌ weights on all computing resources causes memory issues‌ at the level of‌ each node and of‌‌ collective communications at the level of the platform.‌

In general, the overall‌ shape of the dependency‌‌ graphs associated with the feed forward training phase‌ has characteristics (long dependencies)‌ that generate a lot‌‌ of memory needs and‌ data exchange. However, there are multiple opportunities to‌ address these problems by combining 55 re-computations 66‌, 54, 61, 77, 72‌, 5, offloading 57, compression and‌ different parallelism strategies (image, filter, kernel, model parallelism‌ 58, 81, 56, 74,‌ 36). It is also promising to consider‌ other more radical techniques to go beyond feed‌ forward training, such as the use of multigrid‌ reduction in time (MGRIT) 68, 69 that‌ come from the field of numerical simulations and‌ that we already address in other contexts.

Within‌ this general framework, the minimization of carbon footprint‌ is obviously a major concern that must guide‌ strategies. Tools to train complex and deep network‌ on otherwise obsolete hardware using memory saving techniques‌ are already a strong contribution in this direction‌ to increase the lifetime of computing resources. and‌ our goal is to extend these techniques in‌ terms of efficiency and in terms of scope,‌ which has consumed a little more energy associated‌ with the computations. As in the case of‌ linear algebra, energy optimization also requires the use‌ of heterogeneous computation resources (CPUs, GPUs, TPUs, FPGAs).‌ Conversely, this heterogeneity hinders scalability because of difficulties‌ in predicting task durations and makes the use‌ of dynamic runtime schedulers necessary. Finally, the use‌ of these dynamic runtimes also poses the problem‌ of knowing what needs to be decided statically‌ and dynamically in terms of resource allocation and‌ scheduling.

5 Social and environmental responsibility

5.1 Footprint‌ of research activities

As part of our research‌ activities, we use local computing resources such as‌ PlaFRIM and the national computing resources of IDRIS‌ and the TGCC.

The environmental impact of‌ using these platforms is significant, whether for numerical‌ simulation or training applications. However, the positioning of‌ the team, which produces simulation and training tools‌ but does not directly perform simulations and training,‌ is relatively limited. For example, in the case‌ of training, we have so far concentrated on‌ techniques that do not modify the architecture of‌ the networks and the computations that are performed,‌ so that the number of epochs and the‌ final accuracy are not impacted. In this way,‌ it is possible to validate our developments to‌ accelerate training on a single batch (at full‌ machine scale) and then to extrapolate the acceleration‌ at the whole training scale. Similarly, the techniques‌ developed in linear algebra in the team often‌ do not depend (typically for dense approaches) on‌ the numerical properties of the matrices, so that‌ acceleration (for a given problem size) can be‌ validated without heavy experimental campaigns, beyond what is‌ necessary to obtain valid experimental results in complex‌ environments where performance varies from one experiment to‌ another.

In this context, the use of simulation‌ as opposed to direct experimentation is also a‌ tool that enables us to limit the impact‌ of our research on power consumption, since simulation can save several orders‌ of magnitude in power‌ consumption compared with direct‌‌ experimentation. In this context, it is crucial to‌ produce simulation tools that‌ are as precise and‌‌ generic as possible, and the team has been‌ actively collaborating for many‌ years in the development‌‌ of simulation tools such as SimGRID.

Nevertheless,‌ the tools we produce‌ are used on a‌‌ large scale in terms of computation resources and‌ simulation/training time, and the‌ associated energy consumption issue‌‌ is therefore indirectly crucial. In this context, we‌ are developing original solutions‌ for reusing the heat‌‌ dissipated by computation resources, in particular as part‌ of the Inria-Qarnot Computing‌ Pulse challenge (see Section‌‌ 5.2). We have also added a research‌ axis aimed at minimizing‌ energy consumption for a‌‌ given kernel (Section 3.3.3).

TOPAL has also‌ signed the "Labos en‌ transitions" Charter of Commitment‌‌ for research facilities on the Bordeaux university site‌ whose preamble states that‌ "Faced with contemporary environmental‌‌ and societal challenges, and the urgent need for‌ systemic transformation to meet‌ them, the academic world‌‌ has a particular responsibility: to promote responsible research,‌ aware of environmental issues‌ and respectful of the‌‌ people who produce it, which contributes to transitions‌ and enables us to‌ understand and guide current‌‌ and future societal transformations". In exchange for this‌ commitment, the establishments undertake‌ to provide us with‌‌ an estimate of the impact of our research‌ activities (including the purchase‌ of equipment and missions).‌‌ At this stage, this information is difficult to‌ aggregate at team level,‌ but making it available‌‌ will enable us to measure our progress and‌ involvement.

5.2 Impact of‌ research results

5.2.1 Carbon‌‌ Impact of Cloud Platforms

To limit the environmental‌ impact of cloud computing,‌ Qarnot focuses on re-using‌‌ the heat produced by computations in heat circuits‌ or boilers. As part‌ of the Pulse Inria‌‌ challenge, we are working with Qarnot on algorithms‌ for placing computations on‌ their infrastructure, so as‌‌ to maximize the use of reusable heat sources,‌ depending on computation demand‌ and task characteristics. The‌‌ aim is to enable users of the Qarnot‌ platform to specify their‌ objective function on the‌‌ (carbon footprint, time, cost) axes, and to be‌ able to meet it.‌

Our activities with Hivenet‌‌, conducted within the framework of the Cupseli‌ challenge, complement this approach.‌ In the long term,‌‌ one of Cupseli’s objectives is to enable the‌ use of distributed computing‌ resources—typically owned by gaming‌‌ venues—to carry out inference and learning tasks. The‌ aim is therefore to‌ extend the lifespan and‌‌ usage of these computing resources by providing them‌ with practical utility and‌ added value.

5.2.2 Democratization‌‌ of Large Models Training

In the context of‌ training, at one end‌ of the spectrum we‌‌ see the provision of computing resources, such as‌ the Jean Zay supercomputer,‌ whose efficient use requires‌‌ large-scale parallel training algorithms and frameworks to optimize‌ resource utilization and accelerate‌ time to discovery. At‌‌ the other end of‌ the spectrum, we see the importance of enabling‌ researchers from different communities to use the resources‌ at their disposal (often just a few GPUs)‌ to develop original models without being constrained by‌ hardware limitations. In particular, recent transformer-based models are‌ very heavy-weight, and techniques must be employed to‌ run them on GPUs that are only a‌ few years old, without compromising data quality, computational‌ accuracy, or model size. In particular, the Topal‌ team has been working for several years on‌ memory-saving strategies to enable the training of large‌ models on limited-capacity resources (re-materialization and offloading), and‌ on software 7 such as Rotor and Rockmate‌, which are recognized and visible in the‌ AI applications community and enable researchers with access‌ to limited capacity resources to train large models.‌ Recent ELF 7.1.2 software has been developed to‌ optimize multi-node, multi-GPU training using various types of‌ parallelism and memory-saving techniques. While remaining user-friendly, it‌ supports the easy integration of custom strategies and‌ has been validated at large scale during the‌ NVIDIA–OpenACC IDRIS'25 hackathon through the training of large‌ language models and diffusion models.

6 Highlights of‌ the year

Best Paper Award for “Scheduling‌ Strategies for Partially-Replicable Task Chains on Two Types‌ of Resources” 22 in Heterogeneity in Computing‌ Workshop (HCW) - Participant: Laércio Lima Pilla.

The‌ Inria/Hivenet Cupseli challenge is co-led by Olivier Beaumont‌ (Topal) and Alexandru Dobrila (Hivenet) and brings together‌ 11 Inria teams along with researchers from Hivenet,‌ representing a total of around thirty permanent staff‌ members. Over a four-year period, it plans for‌ the recruitment of nine PhD students, two postdoctoral‌ researchers, and three engineers. The project kickoff meeting‌ took place on September 25, 2025.

7 Latest‌ software developments, platforms, open data

7.1 Latest software‌ developments

7.1.1 Chameleon

Keywords:
Runtime system, Task-based algorithm,‌ Dense linear algebra, HPC, Task scheduling
Scientific Description:‌

Chameleon is part of the MORSE (Matrices Over‌ Runtime Systems @ Exascale) project. The overall objective‌ is to develop robust linear algebra libraries relying‌ on innovative runtime systems that can fully benefit‌ from the potential of those future large-scale complex‌ machines.

We expect advances in three directions based‌ first on strong and closed interactions between the‌ runtime and numerical linear algebra communities. This initial‌ activity will then naturally expand to more focused‌ but still joint research in both fields.

1.‌ Fine interaction between linear algebra and runtime systems.‌ On parallel machines, HPC applications need to take‌ care of data movement and consistency, which can‌ be either explicitly managed at the level of‌ the application itself or delegated to a runtime‌ system. We adopt the latter approach in order‌ to better keep up with hardware trends whose‌ complexity is growing exponentially. One major task in‌ this project is to define a proper interface‌ between HPC applications and runtime systems in order‌ to maximize productivity and expressivity. As mentioned in‌ the next section, a widely used approach consists in abstracting the application‌ as a DAG that‌ the runtime system is‌‌ in charge of scheduling. Scheduling such a DAG‌ over a set of‌ heterogeneous processing units introduces‌‌ a lot of new challenges, such as predicting‌ accurately the execution time‌ of each type of‌‌ task over each kind of unit, minimizing data‌ transfers between memory banks,‌ performing data prefetching, etc.‌‌ Expected advances: In a nutshell, a new runtime‌ system API will be‌ designed to allow applications‌‌ to provide scheduling hints to the runtime system‌ and to get real-time‌ feedback about the consequences‌‌ of scheduling decisions.

2. Runtime systems. A runtime‌ environment is an intermediate‌ layer between the system‌‌ and the application. It provides low-level functionality not‌ provided by the system‌ (such as scheduling or‌‌ management of the heterogeneity) and high-level features (such‌ as performance portability). In‌ the framework of this‌‌ proposal, we will work on the scalability of‌ runtime environment. To achieve‌ scalability it is required‌‌ to avoid all centralization. Here, the main problem‌ is the scheduling of‌ the tasks. In many‌‌ task-based runtime environments the scheduler is centralized and‌ becomes a bottleneck as‌ soon as too many‌‌ cores are involved. It is therefore required to‌ distribute the scheduling decision‌ or to compute a‌‌ data distribution that impose the mapping of task‌ using, for instance the‌ so-called “owner-compute” rule. Expected‌‌ advances: We will design runtime systems that enable‌ an efficient and scalable‌ use of thousands of‌‌ distributed multicore nodes enhanced with accelerators.

3. Linear‌ algebra. Because of its‌ central position in HPC‌‌ and of the well understood structure of its‌ algorithms, dense linear algebra‌ has often pioneered new‌‌ challenges that HPC had to face. Again, dense‌ linear algebra has been‌ in the vanguard of‌‌ the new era of petascale computing with the‌ design of new algorithms‌ that can efficiently run‌‌ on a multicore node with GPU accelerators. These‌ algorithms are called “communication-avoiding”‌ since they have been‌‌ redesigned to limit the amount of communication between‌ processing units (and between‌ the different levels of‌‌ memory hierarchy). They are expressed through Direct Acyclic‌ Graphs (DAG) of fine-grained‌ tasks that are dynamically‌‌ scheduled. Expected advances: First, we plan to investigate‌ the impact of these‌ principles in the case‌‌ of sparse applications (whose algorithms are slightly more‌ complicated but often rely‌ on dense kernels). Furthermore,‌‌ both in the dense and sparse cases, the‌ scalability on thousands of‌ nodes is still limited,‌‌ new numerical approaches need to be found. We‌ will specifically design sparse‌ hybrid direct/iterative methods that‌‌ represent a promising approach.

Overall end point. The‌ overall goal of the‌ MORSE associate team is‌‌ to enable advanced numerical algorithms to be executed‌ on a scalable unified‌ runtime system for exploiting‌‌ the full potential of future exascale machines.
Functional‌ Description:
Chameleon is a‌ dense linear algebra software‌‌ relying on sequential task-based algorithms where sub-tasks of‌ the overall algorithms are‌ submitted to a Runtime‌‌ system. A Runtime system‌ such as StarPU is able to manage automatically‌ data transfers between not shared memory area (CPUs-GPUs,‌ distributed nodes). This kind of implementation paradigm allows‌ to design high performing linear algebra algorithms on‌ very different type of architecture: laptop, many-core nodes,‌ CPUs-GPUs, multiple nodes. For example, Chameleon is able‌ to perform a Cholesky factorization (double-precision) at 80‌ TFlop/s on a dense matrix of order 400‌ 000 (i.e. 4 min 30 s).
Release Contributions:‌

Chameleon includes the following features:

- BLAS 3,‌ LAPACK one-sided and LAPACK norms tile algorithms -‌ Support QUARK and StarPU runtime systems and PaRSEC‌ since 2018 - Exploitation of homogeneous and heterogeneous‌ platforms through the use of BLAS/LAPACK CPU kernels‌ and cuBLAS/MAGMA CUDA kernels - Exploitation of clusters‌ of interconnected nodes with distributed memory (using OpenMPI)‌
URL:
https://gitlab.inria.fr/solverstack/chameleon
Publications:
hal-04984070, hal-05137639, hal-05148627‌, hal-04867213, hal-01618526, hal-03609275, hal-03149953‌, hal-02513433, hal-04088833, hal-04498634, hal-01332774‌, hal-03763824, hal-01387575, hal-03789625, hal-03773486‌, hal-01585079, hal-04883872, inria-00547614
Contact:
Mathieu‌ Faverge
Participants:
Mathieu Faverge, Florent Pruvost, Emmanuel Agullo,‌ Samuel Thibault
Partners:
Innovative Computing Laboratory (ICL), King‌ Abdullha University of Science and Technology, University of‌ Colorado Denver

7.1.2 ELF

Name:
Efficient Deep Learning‌ Framework
Keywords:
Neural networks, Pytorch, Python, GPU, Deep‌ learning, Automatic parallelization
Functional Description:

ELF is a‌ deep learning framework designed for efficient and easy-to-launch‌ multi-GPU training. It enables users to input a‌ PyTorch model and train it on an HPC‌ cluster by automatically handling data, model and other‌ types of parallelization across multiple devices.

By optimizing‌ the training schedule, minimizing communication overhead, and maximizing‌ GPU utilization, ELF ensures highly optimized execution. Users‌ don’t need to manually implement parallelization—ELF does it‌ automatically while maintaining computational correctness throughout training iterations.‌
Publication:
hal-05151601v2
Contact:
Yulia Gusak

7.1.3 PaStiX

Name:‌
Parallel Sparse matriX package
Keywords:
Direct solvers, Parallel‌ numerical solvers, Linear Systems Solver
Scientific Description:
PaStiX‌ is based on an efficient static scheduling and‌ memory manager, in order to solve 3D problems‌ with more than 50 million of unknowns. The‌ mapping and scheduling algorithm handles a combination of‌ 1D and 2D block distributions. A dynamic scheduling‌ can also be applied to take care of‌ NUMA architectures while taking into account very precisely‌ the computational costs of the BLAS 3 primitives,‌ the communication costs and the cost of local‌ aggregations.
Functional Description:

PaStiX is a scientific library‌ that provides a high performance parallel solver for‌ very large sparse linear systems based on block‌ direct and block ILU(k) methods. It can handle‌ low-rank compression techniques to reduce the computation and‌ the memory complexity. Numerical algorithms are implemented in‌ single or double precision (real or complex) for‌ LLt, LDLt and LU factorization with static pivoting‌ (for non symmetric matrices having a symmetric pattern).‌ The PaStiX library uses the graph partitioning and‌ sparse matrix block ordering packages Scotch or Metis.‌

The PaStiX solver is suitable for any heterogeneous parallel/distributed architecture when its‌ performance is predictable, such‌ as clusters of multicore‌‌ nodes with GPU accelerators or KNL processors. In‌ particular, we provide a‌ high-performance version with a‌‌ low memory overhead for multicore node architectures, which‌ fully exploits the advantage‌ of shared memory by‌‌ using a hybrid MPI-thread implementation.

The solver also‌ provides some low-rank compression‌ methods to reduce the‌‌ memory footprint and/or the time-to-solution.
URL:
https://gitlab.inria.fr/solverstack/pastix
Publications:‌
inria-00346017, inria-00346018,‌ hal-01485507, hal-01824275,‌‌ hal-03361299, hal-04527103
Contact:
Pierre Ramet
Participants:
Alycia‌ Lisito, Grégoire Pichon, Mathieu‌ Faverge, Pierre Ramet

7.1.4‌‌ pmtool

Keywords:
Scheduling, Task scheduling, StarPU, Heterogeneity, GPGPU,‌ Performance analysis
Functional Description:‌
Analyse post-mortem the behavior‌‌ of StarPU applications. Provide lower bounds on makespan.‌ Study the performance of‌ different schedulers in a‌‌ simple context. Provide implementations of many scheduling algorithms‌ from the literature
URL:‌
https://gitlab.inria.fr/eyrauddu/pmtool
Publications:
hal-01386174,‌‌ hal-01878606
Contact:
Lionel Eyraud Dubois
Participant:
an anonymous‌ participant

7.1.5 StarPart

Keyword:‌
3-point-lighting technique
Functional Description:‌‌
StarPart is a flexible and extensible framework that‌ integrates state-of-the-art methods for‌ graph partitioning and sparse‌‌ matrix ordering. More precisely, StarPart is a framework‌ that offers a uniform‌ API to manipulate graph,‌‌ hypergraph and mesh structures. It is designed to‌ be easily extensible by‌ adding new methods and‌‌ to plug all these methods into a comprehensive‌ framework. It is initially‌ designed to provide graph‌‌ partitioning and sparse matrix ordering methods, that come‌ from sate-of-the-art software such‌ as Metis, Scotch, Patoh,‌‌ Zoltan, etc. Besides, it provides some facilities for‌ IO, diagnostic, benchmark, visualization‌ (VTK, SVG, ...). StarPart‌‌ is the core of the MetaPart project. It‌ is built upon the‌ LibGraph library.
URL:
https://gitlab.inria.fr/metapart/starpart‌‌
Contact:
Aurélien Esnard
Participant:
an anonymous participant

7.1.6‌ StarPU

Name:
The StarPU‌ Runtime System
Keywords:
Runtime‌‌ system, High performance computing
Scientific Description:

Traditional processors‌ have reached architectural limits‌ which heterogeneous multicore designs‌‌ and hardware specialization (eg. coprocessors, accelerators, ...) intend‌ to address. However, exploiting‌ such machines introduces numerous‌‌ challenging issues at all levels, ranging from programming‌ models and compilers to‌ the design of scalable‌‌ hardware solutions. The design of efficient runtime systems‌ for these architectures is‌ a critical issue. StarPU‌‌ typically makes it much easier for high performance‌ libraries or compiler environments‌ to exploit heterogeneous multicore‌‌ machines possibly equipped with GPGPUs or Cell processors:‌ rather than handling low-level‌ issues, programmers may concentrate‌‌ on algorithmic concerns.Portability is obtained by the means‌ of a unified abstraction‌ of the machine. StarPU‌‌ offers a unified offloadable task abstraction named "codelet".‌ Rather than rewriting the‌ entire code, programmers can‌‌ encapsulate existing functions within codelets. In case a‌ codelet may run on‌ heterogeneous architectures, it is‌‌ possible to specify one function for each architectures‌ (eg. one function for‌ CUDA and one function‌‌ for CPUs). StarPU takes care to schedule and‌ execute those codelets as‌ efficiently as possible over‌‌ the entire machine. In order to relieve programmers‌ from the burden of‌ explicit data transfers, a‌‌ high-level data management library‌ enforces memory coherency over the machine: before a‌ codelet starts (eg. on an accelerator), all its‌ data are transparently made available on the compute‌ resource.Given its expressive interface and portable scheduling policies,‌ StarPU obtains portable performances by efficiently (and easily)‌ using all computing resources at the same time.‌ StarPU also takes advantage of the heterogeneous nature‌ of a machine, for instance by using scheduling‌ strategies based on auto-tuned performance models.

StarPU is‌ a task programming library for hybrid architectures.

The‌ application provides algorithms and constraints: - CPU/GPU implementations‌ of tasks, - A graph of tasks, using‌ StarPU's rich C API.

StarPU handles run-time concerns:‌ - Task dependencies, - Optimized heterogeneous scheduling, -‌ Optimized data transfers and replication between main memory‌ and discrete memories, - Optimized cluster communications.

Rather‌ than handling low-level scheduling and optimizing issues, programmers‌ can concentrate on algorithmic concerns!
Functional Description:
StarPU‌ is a runtime system that offers support for‌ heterogeneous multicore machines. While many efforts are devoted‌ to design efficient computation kernels for those architectures‌ (e.g. to implement BLAS kernels on GPUs), StarPU‌ not only takes care of offloading such kernels‌ (and implementing data coherency across the machine), but‌ it also makes sure the kernels are executed‌ as efficiently as possible.
Release Contributions:
StarPU is‌ a runtime system that offers support for heterogeneous‌ multicore machines. While many efforts are devoted to‌ design efficient computation kernels for those architectures (e.g.‌ to implement BLAS kernels on GPUs), StarPU not‌ only takes care of offloading such kernels (and‌ implementing data coherency across the machine), but it‌ also makes sure the kernels are executed as‌ efficiently as possible.
URL:
https://starpu.gitlabpages.inria.fr/
Publications:
tel-04213186,‌ inria-00326917, inria-00378705, inria-00384363, inria-00411581,‌ inria-00421333, inria-00467677, inria-00523937, inria-00547614,‌ inria-00547616, inria-00547847, inria-00550877, inria-00590670,‌ inria-00606195, inria-00606200, inria-00619654, hal-00643257,‌ hal-00648480, hal-00654193, hal-00661320, hal-00697020,‌ hal-00714858, hal-00725477, hal-00772742, hal-00773114,‌ hal-00773571, hal-00773610, hal-00776610, tel-00777154,‌ hal-00803304, hal-00807033, hal-00824514, hal-00851122,‌ hal-00853423, hal-00858350, hal-00911856, hal-00920915,‌ hal-00925017, hal-00926144, tel-00948309, hal-00966862,‌ hal-00978364, hal-00978602, hal-00987094, hal-00992208,‌ hal-01005765, hal-01011633, hal-01081974, hal-01101045,‌ hal-01101054, hal-01120507, hal-01147997, tel-01162975,‌ hal-01180272, hal-01181135, hal-01182746, hal-01223573,‌ tel-01230876, hal-01283949, hal-01284004, hal-01284136,‌ hal-01284235, hal-01316982, hal-01332774, hal-01353962,‌ hal-01355385, hal-01361992, hal-01372022, hal-01386174,‌ hal-01387482, hal-01409965, hal-01410103, hal-01473475,‌ hal-01474556, tel-01483666, hal-01502749, hal-01507613,‌ hal-01517153, tel-01538516, hal-01616632, hal-01618526,‌ hal-01718280, tel-01816341, hal-01842038, tel-01959127,‌ hal-02120736, hal-02275363, hal-02296118, hal-02403109,‌ hal-02421327, hal-02872765, hal-02914793, hal-02933803,‌ hal-02943753, hal-02970529, hal-02985721, hal-03144290,‌ hal-03273509, hal-03290998, hal-03298021, hal-03318644,‌ hal-03348787, hal-03552243, hal-03609275, hal-03623220, hal-03773486, hal-03773985,‌ hal-03789625, hal-03936659,‌ tel-03989856, hal-04005071,‌‌ hal-04088833, hal-04115280, hal-04146714, hal-04236246,‌ tel-04260094, tel-04316145,‌ hal-04548787, hal-04646530,‌‌ hal-04668550, hal-04690154, hal-05147860, hal-05199066,‌ hal-05226796
Contact:
Nathalie Furmento‌
Participants:
Olivier Aumage, Nathalie‌‌ Furmento, Samuel Thibault, 38 anonymous participants

7.1.7 rockmate‌

Name:
rockmate
Keywords:
Deep‌ learning, Optimization, Python, Pytorch,‌‌ GPU, Automatic differentiation
Scientific Description:

We propose Rockmate‌ to control the memory‌ requirements when training PyTorch‌‌ DNN models. Rockmate is an automatic tool that‌ starts from the model‌ code and generates an‌‌ equivalent model, using a predefined amount of memory‌ for activations, at the‌ cost of a few‌‌ re-computations. Rockmate automatically detects the structure of computational‌ and data dependencies and‌ rewrites the initial model‌‌ as a sequence of complex blocks. We show‌ that such a structure‌ is widespread and can‌‌ be found in many models in the literature‌ (Transformer based models, ResNet,‌ RegNets,...). This structure allows‌‌ us to solve the problem in a fast‌ and efficient way, using‌ an adaptation of Checkmate‌‌ (too slow on the whole model but general)‌ at the level of‌ individual blocks and an‌‌ adaptation of Rotor (fast but limited to sequential‌ models) at the level‌ of the sequence itself.‌‌ We show through experiments on many models that‌ Rockmate is as fast‌ as Rotor and as‌‌ efficient as Checkmate, and that it allows in‌ many cases to obtain‌ a significantly lower memory‌‌ consumption for activations (by a factor of 2‌ to 5) for a‌ rather negligible overhead (of‌‌ the order of 10% to 20%). Rockmate is‌ open source and available‌ at https://github.com/topal-team/rockmate.

Complete paper:‌‌ https://openreview.net/pdf?id=wLAMOoL0KD
Functional Description:

Given a PyTorch model, a‌ sample input, and a‌ GPU memory budget, Rockmate‌‌ builds a new torch.nn.Module, which performs forward and‌ backward pass while keeping‌ the memory of activations‌‌ under the given budget.

The new model produces‌ the same outputs and‌ gradients as the original‌‌ one. Training the model with a lower memory‌ than PyTorch Autodiff is‌ achieved by re-computing some‌‌ of the activations instead of storing them for‌ gradient calculation. Based on‌ the budget, Rockmate determines‌‌ automatically which activations should be recomputed.
URL:
https://github.com/topal-team/rockmate‌
Contact:
Lionel Eyraud Dubois‌
Participants:
Lionel Eyraud Dubois,‌‌ Yulia Gusak, Olivier Beaumont, Xunyi Zhao

7.1.8 rotor‌

Name:
Re-materializing Optimally with‌ pyTORch
Keywords:
Deep learning,‌‌ Optimization, Python, GPU, Automatic differentiation
Scientific Description:

This‌ software implements in PyTorch‌ a new activation checkpointing‌‌ method which allows to significantly decrease memory usage‌ when training Deep Neural‌ Networks with the back-propagation‌‌ algorithm. Similarly to checkpointing techniques coming from the‌ literature on Automatic Differentiation,‌ it consists in dynamically‌‌ selecting the forward activations that are saved during‌ the training phase, and‌ then automatically recomputing missing‌‌ activations from those previously recorded. We propose an‌ original computation model that‌ combines two types of‌‌ activation savings: either only storing the layer inputs,‌ or recording the complete‌ history of operations that‌‌ produced the outputs (this‌ uses more memory, but requires fewer recomputations in‌ the backward phase), and we provide in https://hal.inria.fr/hal-02352969‌ an algorithm to compute the optimal computation sequence‌ for this model.

Our PyTorch implementation processes the‌ entire chain, dealing with any sequential DNN whose‌ internal layers may be arbitrarily complex and automatically‌ executing it according to the optimal checkpointing strategy‌ computed given a memory limit. In https://hal.inria.fr/hal-02352969, through‌ extensive experiments, we show that our implementation consistently‌ outperforms existing checkpoint-ing approaches for a large class‌ of networks, image sizes and batch sizes.
Functional‌ Description:
Allows to train very large convolutional networks‌ on limited memory by optimally selecting which activations‌ should be kept and which should be recomputed.‌ This code is meant to replace the checkpoint.py‌ utility available in pytorch, by providing more efficient‌ rematerialization strategies. The algorithm is easier to tune:‌ the only required parameter is the available memory,‌ instead of the number of segments.
URL:
https://gitlab.inria.fr/hiepacs/rotor‌
Publication:
hal-02352969
Contact:
Lionel Eyraud Dubois
Participant:
5‌ anonymous participants

7.1.9 VITE

Name:
Visual Trace Explorer‌
Keywords:
Visualization, Execution trace
Functional Description:
ViTE is‌ a trace explorer. It is a tool made‌ to visualize execution traces of large parallel programs.‌ It supports Pajé, a trace format created by‌ Inria Grenoble, and OTF and OTF2 formats, developed‌ by the University of Dresden and allows the‌ programmer a simpler way to analyse, debug and/or‌ profile large parallel applications.
URL:
https://solverstack.gitlabpages.inria.fr/vite/
Publications:
hal-00707236‌, hal-04725983
Contact:
Mathieu Faverge
Participants:
Mathieu Faverge,‌ Philippe Swartvagher

8 New results

As explained in‌ Section 3.4, our contributions can be read‌ at the intersection of the research domains described‌ in Section 4 and research axes described in‌ Section 3.3 as shown in the following table:‌

	Axis 3.3.1 –	Axis 3.3.2 –	Axis 3.3.3‌ –	Axis 3.3.4 –
	Runtime	Compression	Energy	Comm.‌ & Fault Tol.
Domain 4.1 – Lin. Alg.,‌ Tensors	Topic 3.4.1	Topic 3.4.2	Topic 3.4.3	Topic‌ 3.4.7
Domain 4.2 – Training	Topic 3.4.4	Topic‌ 3.4.5	Topic 3.4.6	Topic 3.4.8

8.1 Scalable and‌ portable LU factorization with partial pivoting on top‌ of runtime systems (Topic 3.4.1)

Participants: Alycia‌ Lisito, Mathieu Faverge, Pierre Ramet.‌

Task-based runtime systems have demonstrated efficiency in leveraging‌ the capabilities of large, heterogeneous architectures. Many linear‌ algebra algorithms and applications have been implemented on‌ top of runtime systems to increase their performance.‌ However, the High Performance Linpack (HPL) benchmark, used‌ by the TOP500 to rank supercomputers, has not‌ yet been successfully implemented using taskbased runtime systems.‌ In this paper, we explore solutions to implement‌ efficient LU factorization with partial pivoting using the‌ sequential task-flow programming model. We show that, due‌ to the pivoting strategy, this algorithm generates a‌ large number of very small tasks, which usually‌ overload the runtime system and make it inefficient.‌ We propose two solutions to improve the efficiency‌ and reduce the number of tasks. First, we‌ apply wellknown blocking strategies in the context of task-based algorithms. Secondly, we‌ explore batching techniques to‌ reduce the number of‌‌ tasks submitted to the runtime system. Moreover, in‌ distributed architectures, partial pivoting‌ generates many reductions on‌‌ the critical path throughout the factorization which needs‌ to be carefully handled‌ to reach high performance.‌‌ Two task-based reduction algorithms are proposed to express‌ these operations and improve‌ the runtime reactivity on‌‌ the critical path. These proposals have been implemented‌ in the dense linear‌ algebra library CHAMELEON on‌‌ top of the STARPU runtime system. Experiments conducted‌ on our cluster with‌ these optimizations show that‌‌ our LU with partial pivoting asymptotically reaches the‌ performance of the non-pivoting‌ algorithm.

This work has‌‌ been presented at IPDPS Conference, June 2025, Milan,‌ Italy 20.

8.2‌ Batching the tasks of‌‌ the LU factorization with partial pivoting on top‌ of runtime systems (Topic‌ 3.4.1)

Participants: Alycia‌‌ Lisito, Mathieu Faverge, Florent Pruvost,‌ Pierre Ramet.

Task-based‌ runtime systems have demonstrated‌‌ efficiency in leveraging the capabilities of large heterogeneous‌ architectures. Many linear algebra‌ algorithms and applications have‌‌ been implemented on top of runtime systems to‌ increase their performance. However,‌ the LU factorization with‌‌ partial pivoting has not yet been successfully implemented‌ using task-based runtime systems.‌ This operation is used‌‌ to solve large dense linear systems in numerical‌ simulations, such as the‌ Maxwell equations in electromagnetism.‌‌ This factorization is a major part of the‌ High Performance Linpack (HPL)‌ benchmark used in the‌‌ TOP500 to evaluate and rank supercomputers. We explore‌ solutions to implement efficient‌ LU factorization with partial‌‌ pivoting using the sequential task-flow programming model. These‌ solutions have been implemented‌ in the dense linear‌‌ algebra library Chameleon on top of the StarPU‌ runtime system. We showed‌ that, due to the‌‌ pivoting strategy, this algorithm generates a large number‌ of very small tasks,‌ which usually overloads the‌‌ runtime system and makes it inefficient. With a‌ naive task batching strategy,‌ we improved the efficiency‌‌ and reduced the number of tasks. We propose‌ solutions to adapt the‌ batch size to the‌‌ granularity of the tasks. In order to do‌ that, we first distinguish‌ two types of tasks‌‌ and set an adapted batch size for each.‌ Then, we introduce a‌ heuristic based on the‌‌ number of operations per tasks to adapt the‌ batch size to the‌ computational complexity of the‌‌ tasks during the factorization. Experiments conducted on our‌ cluster with these optimizations‌ show that our LU‌‌ factorization with partial pivoting asymptotically reaches about 96%‌ of the performance of‌ the non-pivoting algorithm. Thanks‌‌ to the adaptive batch size mechanism, the performance‌ peak is reached even‌ faster.

This work has‌‌ been presented at COMPAS Conference, June 2025, Bordeaux,‌ France 27.

8.3‌ Toward an algebraic multigrid‌‌ method for the indefinite Helmholtz equation (Topic 3.4.2‌)

Participants: Clement Richefort‌, Pierre Ramet.‌‌

It is well known that multigrid methods are‌ very competitive in solving‌ a wide range of‌‌ SPD problems. However achieving‌ such performance for non-SPD matrices remains an open‌ problem. In particular, three main issues may arise‌ when solving a Helmholtz problem : some eigenvalues‌ may be negative or even complex, requiring the‌ choice of an adapted smoother for capturing them,‌ and because the near-kernel space is oscillatory, the‌ geometric smoothness assumption cannot be used to build‌ efficient interpolation rules. Moreover, the coarse correction is‌ not equivalent to a projection method since the‌ indefinite matrix does not define a norm. We‌ present some investigations about designing a method that‌ converges in a constant number of iterations with‌ respect to the wavenumber. The method builds on‌ an ideal reduction-based framework and related theory for‌ SPD matrices to improve an initial least squares‌ minimization coarse selection operator formed from a set‌ of smoothed random vectors. A new coarse correction‌ is proposed to minimize the residual in an‌ appropriate norm for indefinite problems. We also present‌ numerical results at the end of the paper.‌

This paper has been published in SIAM SISC‌ 11.

8.4 Hierarchical partitioning for the numerical‌ simulation of complex 3D objects (Topic 3.4.2)‌

Participants: Dimitri Walther, Mathieu Faverge, Pierre‌ Ramet.

The Boundary Element Method (BEM) offers‌ numerous advantages for simulating complex physical phenomena. By‌ placing the unknowns (or degrees of freedom) on‌ the interfaces between different media, it becomes possible‌ to model problems with distant boundary conditions (such‌ as fluid flow around an object, acoustic or‌ electromagnetic wave diffraction, radiative heat transfer, etc.). However,‌ this approach results in a fully coupled system‌ with a dense matrix. When this dense matrix‌ can be decomposed into low-rank sub-blocks, it is‌ possible to construct a hierarchical matrix (H-matrix) that‌ approximates the original system to a desired level‌ of accuracy. In favorable scenarios, this approximation reduces‌ spatial complexity from $O (n^{2})‌$ to $O (n log n)$ by‌ compressing the matrix sub-blocks. This work investigates the‌ relationship between the partitioning of degrees of freedom‌ and the compression rate of the H-matrix. A‌ new hierarchical partitioning technique, specifically designed to optimize‌ H-matrix compression, is introduced. Unlike existing algorithms based‌ on geometric information (such as Median cut, Cobblestone,‌ or Space-filling curves), this new method relies on‌ the construction of a connectivity graph of the‌ degrees of freedom. This graph is built in‌ quasi-linear time ( $O (n log n‌)$ ) from the mesh of the studied‌ object and partitioned in log-quadratic time ( $O‌ (n^{2} log n)$ ) using‌ a multi-level partitioning approach. An additional constraint is‌ imposed to balance the partition loads, facilitating optimization‌ on task-based execution environments. Numerical experiments are conducted‌ on a variety of test cases from electromagnetic‌ simulations.

This work has been presented at COMPAS‌ Conference, June 2025, Bordeaux, France 46. It‌ was awarded the prize for best poster.

8.5‌ Optimal scheduling algorithms for software-defined radio pipelined and replicated task chains on‌ multicore architecture (Axis 3.3.1‌)

Participants: Laércio Lima‌‌ Pilla.

Software-Defined Radio (SDR) represents a move‌ from dedicated hardware to‌ software implementations of digital‌‌ communication standards. This approach offers flexibility, shorter time‌ to market, maintainability, and‌ lower costs, but it‌‌ requires an optimized distribution tasks in order to‌ meet performance requirements. Thus,‌ we studied the problem‌‌ of scheduling SDR linear task chains of stateless‌ and stateful tasks for‌ streaming processing. We modeled‌‌ this problem as a pipelined workflow scheduling problem‌ based on pipelined and‌ replicated parallelism on homogeneous‌‌ resources. We proposed an optimal dynamic programming solution‌ and an optimal greedy‌ algorithm named OTAC for‌‌ maximizing throughput while also minimizing resource utilization. Moreover,‌ the optimality of the‌ proposed scheduling algorithm was‌‌ proved. We evaluated our solutions and compared their‌ execution times and schedules‌ to other algorithms using‌‌ synthetic task chains and an implementation of the‌ DVB-S2 communication standard on‌ the AFF3CT SDR Domain‌‌ Specific Language. Our results demonstrated how OTAC quickly‌ finds optimal schedules, leading‌ consistently to better results‌‌ than other algorithms, or equivalent results with fewer‌ resources.

This paper has‌ been published in the‌‌ Journal of Parallel and Distributed Computing in October‌ 2025 13.

8.6‌ Task-Based HPC in the‌‌ Cloud: Price-Performance Analysis of N-Body Simulations with StarPU‌ (Axis 3.3.1)

Participants:‌ Laércio Lima Pilla.‌‌

Public cloud environments present significant challenges for traditional‌ High Performance Computing (HPC)‌ applications due to infrastructure‌‌ limitations that differ substantially from dedicated HPC systems.‌ Unlike traditional HPC clusters‌ optimized for tightly coupled‌‌ parallel workloads, cloud platforms were designed primarily for‌ web services and data‌ processing applications. Key obstacles‌‌ include high-latency networks, hardware virtualization overhead, and limited‌ availability of specialized accelerators,‌ all of which can‌‌ severely impact the performance of compute-intensive applications such‌ as physics simulations. This‌ study investigated the feasibility‌‌ of running HPC workloads on public cloud infrastructure‌ using standard and cost-effective‌ instance configurations rather than‌‌ expensive specialized “HPC” offerings. We deployed heterogeneous clusters‌ on Amazon Web Services‌ using the HPC@Cloud Toolkit,‌‌ incorporating various instance types, including GPU-accelerated nodes with‌ different computational capabilities. Our‌ evaluation focused on N-body‌‌ simulations implemented using a task-based parallel programming model,‌ leveraging the StarPU runtime‌ system to dynamically schedule‌‌ computational tasks across various processing units. Our experimental‌ results demonstrated three key‌ findings: (1) smaller GPU-equipped‌‌ instances (g6.2xlarge) achieve performance comparable to‌ larger instances while costing‌ approximately one-sixth the price,‌‌ challenging conventional scaling assumptions for cloud-based HPC; (2)‌ strategic GPU utilization yields‌ up to $8 .‌‌ 2 \times$ performance improvements over CPU-only configurations while‌ reducing total execution costs‌ by $24 . 4‌‌ \times$ ; and (3) while task-based programming models‌ effectively address network limitations‌ through dynamic scheduling, complex‌‌ tree-based algorithms like TBFMM face significant optimization challenges‌ in cloud environments due‌ to load balancing issues‌‌ and expensive parameter tuning requirements. These findings provide‌ practical guidance for researchers‌ and practitioners seeking cost-effective‌‌ cloud HPC deployments, demonstrating‌ that commodity cloud infrastructures can be viable for‌ regular computational workloads but require careful algorithmic-resource matching‌ for optimal efficiency.

This work has been published‌ in IEEE International Conference on Cloud Engineering, September‌ 2025, Rennes, France 25.

8.7 Task-Based HPC‌ in the Cloud: Price-Performance Analysis of N-Body Simulations‌ with StarPU (Topic 3.4.2)

Participants: Laércio Lima‌ Pilla.

Tensor-train (TT) decomposition has garnered tremendous‌ popularity for its efficiency in handling high-dimensional data‌ arising in scientific and quantum computing as well‌ as machine learning applications. It provides a compact‌ representation for matrices and vectors with a Kronecker‌ product-like low-rank structure and enables efficient matrix-vector operations‌ in this compressed form. The vector scalar product‌ is among such key operations, comprising a series‌ of tensor contractions in a specific tensor network‌ topology whose order significantly impacts the computational cost.‌ In this work, we proposed efficient algorithms for‌ finding near-optimal contraction orderings for tensor networks representing‌ scalar products in the TT format. We showed‌ that our algorithms outperform all existing contraction ordering‌ methods for general tensor networks where the best‌ existing method incurs up to 15% higher cost‌ for $x^{T} y$ , twice the cost‌ for $x^{T} A y$ , and ten‌ times higher cost for $x^{T} A B‌ y$ scalar products where $x, y$ and‌ $A, B$ are vectors and matrices expressed‌ in the TT format, respectively.

This work has‌ been published in the European Conference on Parallel‌ Processing, August 2025, Dresden, Germany 24.

8.8‌ MetaCS-FL: A Metaheuristic-Based Framework for Client Selection in‌ Federated Learning Systems (Topic 3.4.6)

Participants: Alan‌ Lira Nunes, Laércio Lima Pilla.

Federated‌ Learning (FL) enables the collaborative training of distributed‌ machine learning models, with each participant (client) using‌ their own local private data. In Cross-Device FL‌ systems, clients usually include unreliable and heterogeneous mobile‌ and edge devices with highly imbalanced and small‌ local datasets. Given these characteristics, the selection of‌ clients to participate in the training plays an‌ essential role in the efficacy of these systems,‌ as a poor selection can lead to long‌ execution times, high energy consumption, and low accuracy.‌ In this work, we proposed MetaCS-FL, a‌ client selection framework built to support different metaheuristics,‌ initial solution methods, and user-defined triggers for new‌ client selections. It also employs client profiling and‌ historical and current performance data to produce more‌ efficient selections of clients and the volume of‌ data they should use for training locally. We‌ evaluated our framework in an extensive series of‌ experiments, including comparisons with state-of-the-art algorithms, revealing the‌ effectiveness of our approach. Having FedAvg as the‌ baseline for comparisons, MetaCS-FL reduced total time (resp.‌ energy consumption), by up to 64.83% (resp. 56.79%)‌ for CIFAR-10, and by up to 67.59% (resp.‌ 60.87%) for Fashion-MNIST while reaching the target testing‌ accuracy.

This report has been published in HAL‌ in July 2025 40, and its paper is currently under evaluation.‌

8.9 Approximation Algorithms for‌ Scheduling With/Without Deadline Constraints‌‌ Where Rejection Costs are Proportional to Processing Times‌ (Axis 3.3.3)

Participants:‌ Olivier Beaumont, Lionel‌‌ Eyraud-Dubois, Laércio Lima Pilla.

We studied‌ two offline job scheduling‌ problems where tasks can‌‌ be processed on a limited number of energy-efficient‌ edge machines or offloaded‌ to an unlimited supply‌‌ of energy-inefficient cloud machines (called rejected). The objective‌ was to minimize total‌ energy consumption. First, we‌‌ considered scheduling without deadlines, formulating it as a‌ scheduling problem with rejection,‌ where rejection costs are‌‌ proportional to processing times. We proposed a novel‌ $\frac{5}{4} (1‌ + ϵ)$ -approximation‌‌ algorithm, BEKP by associating it to a Multiple‌ Subset Sum problem, improving‌ upon the existing $(‌‌ \frac{3}{2} - \frac{1}{2 m})$ -approximation‌ for arbitrary rejection costs.‌ Next, we addressed scheduling‌‌ with deadlines, aiming to minimize the weighted number‌ of rejected jobs. We‌ positioned this problem within‌‌ the literature and introduced a new $(1‌ - \frac{{(m -‌ 1)}^{m}}{{m‌‌}^{m}})$ -approximation algorithm, MDP, inspired by an‌ interval selection algorithm with‌ a $(1 -‌‌ \frac{m^{m}}{{(m + 1)}^{m‌}})$ -approximation for arbitrary‌ rejection costs. Experimental results‌‌ demonstrate that BEKP and MDP obtain better results‌ (lower costs or higher‌ profits) than other state-of-the-art‌‌ algorithms while maintaining a competitive or better time‌ complexity.

This work was‌ developed in the context‌‌ of the Challenge PULSE, and the paper has‌ been published in IEEE‌ Transactions on Parallel and‌‌ Distributed Systems in December 2025 9.

8.10‌ Energy-Aware Scheduling Strategies for‌ Partially-Replicable Task Chains on‌‌ Heterogeneous Processors (Axis 3.3.3)

Participants: Laércio Lima‌ Pilla.

The arrival‌ of heterogeneous (or hybrid)‌‌ multicore architectures has brought new performance trade-offs for‌ applications, and efficiency opportunities‌ to systems. They have‌‌ also increased the challenges related to thread scheduling,‌ as tasks' execution times‌ will vary depending if‌‌ they are placed on big (performance) cores or‌ little (efficient) ones. In‌ this work, we focused‌‌ on the challenges heterogeneous multicore processors bring to‌ partially-replicable task chains, such‌ as the ones that‌‌ implement digital communication standards in Software-Defined Radio (SDR).‌ Our objective was to‌ maximize the throughput of‌‌ these task chains while also minimizing their power‌ consumption. We modeled this‌ problem as a pipelined‌‌ workflow scheduling problem using pipelined and replicated parallelism‌ on two types of‌ resources whose objectives were‌‌ to minimize the period and to use as‌ many little cores as‌ necessary. We proposed two‌‌ greedy heuristics (FERTAC and 2CATAC) and one optimal‌ dynamic programming (HeRAD) solution‌ to the problem. We‌‌ evaluated our solutions and compared the quality of‌ their schedules (in period‌ and resource utilization) and‌‌ their execution times using synthetic task chains. We‌ also studied an open‌ source implementation of the‌‌ DVB-S2 communication standard based on the StreamPU runtime.‌ Leading processor vendors were‌ covered with ARM, Apple,‌‌ AMD, and Intel platforms.‌ Both the achieved throughput and the energy consumption‌ were evaluated. Our results demonstrated the benefits and‌ drawbacks of the different proposed solutions.

This work‌ has been published in Heterogeneity in Computing Workshop,‌ June 2025, Milan, Italy 22, and its‌ extended version 39 is currently under evaluation.

8.11‌ HiRemate: Hierarchical Approach for Efficient Re-materialization of Large‌ Neural Networks (Domain 4.2)

Participants: Olivier Beaumont‌, Lionel Eyraud Dubois, Yulia Gusak.‌

Training modern neural networks poses a significant memory‌ challenge, as storing intermediate results during the forward‌ and backward passes requires considerable memory resources. To‌ address this issue without affecting model accuracy, re-materialization‌ techniques have been introduced to recompute selected intermediate‌ results instead of storing them, thus fulfilling the‌ memory size constraint. The main algorithmic problem is‌ to compute a re-materialization schedule that minimizes the‌ computational overhead within a given memory budget. Our‌ proposed HiRemate framework is based on a new‌ hierarchical approach that provides generality and quality: we‌ can handle any class of network graphs and‌ satisfy the memory constraint with a low computational‌ overhead during training. The framework exhibits low algorithmic‌ complexity, making it possible to scale up and‌ handle very large models. The framework automatically builds‌ a dataflow graph from a PyTorch model, decomposes‌ the graph hierarchically, and then builds an nn.Module‌ that executes forward and backward passes within the‌ given memory budget.

This work has been published‌ in the Forty-Second International Conference on Machine Learning‌ (ICML 2025), July 2025, Vancouver, Canada 19.‌

8.12 Fault-tolerant numerical iterative algorithms at scale (Topic‌ 3.4.7)

Participants: Thomas Herault.

This year,‌ we developed a coherent set of models and‌ strategies to make large-scale iterative computations —- with‌ a strong emphasis on linear algebra kernels and‌ solvers —- both more resilient to errors and‌ more efficient in their use of communication and‌ storage. A first contribution revisits protection against silent‌ data corruptions (SDCs), where errors may remain undetected‌ for several iterations. Instead of relying on costly‌ full replication, we studied the use of partial‌ detectors whose detection latency is bounded, and we‌ derived optimal execution schemes (segment lengths, and how‌ many in-memory checkpoints must be kept) that guarantee‌ correctness while reducing overhead. The analysis and Monte-Carlo‌ results show that, across a broad range of‌ parameters, partial detection can significantly outperform replication, sometimes‌ yielding substantial speedups for realistic error rates 14‌.

8.13 Partial Detectors Versus Replication To Cope‌ With Silent Errors (Axis 3.3.4)

Participants: Thomas‌ Herault.

We proposed a holistic fault-tolerance methodology‌ for numerical iterative algorithms that jointly addresses the‌ three dominant error sources at scale: fail-stop failures,‌ computation silent errors, and memory bit flips. The‌ key idea is a hierarchical periodic pattern that‌ interleaves mechanisms at different frequencies —- (i) frequent‌ computation verifications (“chunks”), (ii) less frequent memory verification‌ + in-memory checkpoints (“segments”), and (iii) even less‌ frequent global checkpoints to tolerate fail-stop failures (“patterns”) —- and we provide‌ an analytical framework to‌ derive the optimal pattern‌‌ minimizing the expected time per iteration. We instantiated‌ and evaluated this approach‌ on Preconditioned Conjugate Gradient,‌‌ illustrating scenarios where the optimal pattern can dramatically‌ reduce resilience overheads compared‌ to more naïve strategies‌‌ 16, 38.

8.14 Fixed-Work vs. Fixed-Time‌ Checkpointing on Large-Scale Failure-Prone‌ Platforms (Axis 3.3.4)‌‌

Participants: Thomas Herault.

We addressed a very‌ practical systems constraint that‌ directly impacts large-scale linear‌‌ algebra runs: the prevalence of fixed-length reservations on‌ HPC systems. We studied‌ checkpointing not only in‌‌ the classical “fixed-work” setting, but also in the‌ dual fixed-time setting, where‌ the goal is to‌‌ maximize the expected progress achieved within a reservation.‌ We show that fixed-time‌ checkpointing is surprisingly harder‌‌ than fixed-work checkpointing, propose dynamic threshold-based heuristics that‌ perform well for short/medium‌ reservations, and derive an‌‌ (discretized) optimal dynamic-programming strategy, including extensions to stochastic‌ checkpoint durations. These results‌ provide actionable guidance for‌‌ running iterative solvers robustly under real scheduling constraints‌ 8.

8.15 PaRSEC:‌ Scalability, flexibility, and hybrid‌‌ architecture support for task-based applications in ECP (Axis‌ 3.3.1)

Participants: Thomas‌ Herault.

The work‌‌ conducted during the Exascale Computing Project (ECP) provided‌ several key lessons on‌ the role and design‌‌ of task-based runtime systems for future large-scale platforms.‌ In particular, ECP confirmed‌ that data movement, rather‌‌ than raw computation, is the dominant performance limiter‌ on heterogeneous and accelerated‌ systems, making it essential‌‌ for the runtime to manage communication, data placement,‌ and the overlap of‌ computation and transfers. The‌‌ diversity of ECP target architectures also showed that‌ performance portability cannot be‌ achieved through static programming‌‌ models, but instead requires runtimes that dynamically adapt‌ scheduling, task granularity, and‌ resource usage based on‌‌ runtime information. In addition, the coexistence of legacy‌ MPI-based components with task-based‌ execution emphasized the importance‌‌ of interoperability, while the scale and duration of‌ ECP runs highlighted the‌ need to treat resilience‌‌ as a first-class runtime concern, naturally enabled by‌ dataflow-based execution models. These‌ lessons directly inform the‌‌ objectives of the NumPEx project, the French counterpart‌ to ECP, in which‌ the TOPAL team is‌‌ actively involved. By building on the experience gained‌ in ECP, our participation‌ in NumPEx aims to‌‌ transfer and extend these concepts to the French‌ exascale ecosystem, contributing runtime-level‌ solutions for communication efficiency,‌‌ adaptability, and fault tolerance on next-generation European supercomputing‌ platforms 10.

8.16‌ Tensor Contractions on Top‌‌ of Runtime Systems: Application to the Coupled-Cluster Method‌ (Topic 3.4.1)

Participants:‌ Thomas Herault.

This‌‌ year, we investigated how the benefits of task-based‌ and distributed runtime systems,‌ well established for dense‌‌ linear algebra, can be extended to tensor computations,‌ which play a central‌ role in modern high-performance‌‌ computing. Our work focused on tensor contractions arising‌ in computational quantum chemistry,‌ in particular in coupled-cluster‌‌ methods, where tensors have a small number of‌ dimensions but very large‌ sizes. We extended the‌‌ Chameleon dense linear algebra‌ library to support tensor contractions by expressing them‌ as sequences of optimized matrix operations, combined with‌ flexible tensor permutations. To address the challenges of‌ data layout and memory footprint, we identified a‌ set of elementary and composable tensor transformations and‌ implemented them on top of the StarPU runtime‌ system. We validated this approach on the computation‌ of coupled-cluster residuals with density fitting, demonstrating both‌ its efficiency and its scalability on modern heterogeneous‌ platforms 28.

8.17 Scalable Block-Sparse Matrix Multiplication‌ Using Template Task Graphs (Topic 3.4.1)

Participants:‌ Thomas Herault.

This year, we advanced the‌ use of task-based runtime systems for sparse linear‌ algebra by addressing communication scalability issues in distributed‌ block-sparse matrix multiplication. Building on the Template Task‌ Graph (TTG) programming model, we introduced application-defined scheduling‌ constraints that allow the runtime to control task‌ eligibility without resorting to ad-hoc control flow. Applied‌ to block-sparse matrix multiplication, these constraints make it‌ possible to throttle and structure communication, limiting the‌ number of concurrent broadcasts and reducing network contention‌ while preserving overlap between communication and computation. Experimental‌ results demonstrate that this approach significantly improves scalability‌ on large problem sizes and highlights the importance‌ of exposing high-level execution constraints to the runtime.‌ This work reinforces the role of expressive task-based‌ runtimes as a key enabler for scalable and‌ communication-efficient linear algebra on modern distributed systems 23‌.

8.18 Comparing and Contrasting User and Runtime‌ Directed Data Placement Strategies for Owner-Compute, Multi-accelerator Distributed‌ Task Based Scheduling (Topic 3.4.1)

Participants: Thomas‌ Herault.

This work explores data placement strategies‌ in task-based runtime systems for linear algebra applications‌ on multi-accelerator, distributed platforms. Using the PaRSEC runtime,‌ we compared runtime-directed heuristics and user-directed placement strategies‌ in the context of owner-compute scheduling, focusing on‌ dense matrix multiplication and Cholesky factorization. The results‌ show that while automated strategies can significantly improve‌ locality and outperform naïve approaches, they remain consistently‌ outperformed by carefully designed user-directed placements, particularly at‌ scale. The study also highlights the limitations of‌ relying on unified virtual memory and demonstrates the‌ importance of explicitly managing data received from the‌ network, especially on modern systems where network interfaces‌ are directly attached to accelerators. Overall, this work‌ emphasizes that runtime systems must expose flexible mechanisms‌ for expressing data placement policies, allowing expert users‌ to guide execution while preserving a clear separation‌ between algorithmic expression and performance tuning 17.‌

8.19 Optimizing Parallel Heterogeneous System Efficiency: Dynamic Task‌ Graph Adaptation with Recursive Tasks (Topic 3.4.1)‌

Participants: Abdou Guermouche.

Task-based programming models are‌ currently an ample trend to leverage heterogeneous parallel‌ systems in a productive way (OpenACC, Kokkos, Legion,‌ OmpSs, PaRSEC, StarPU, XKaapi, ...). Among these models,‌ the Sequential Task Flow (STF) model is widely‌ embraced (PaRSEC's DTD, OmpSs, StarPU) since it allows‌ to express task graphs naturally through a sequential-looking‌ submission of tasks, and tasks dependencies are inferred‌ automatically. However, STF is limited to task graphs with task sizes that‌ are fixed at submission,‌ posing a challenge in‌‌ determining the optimal task granularity. Notably, in heterogeneous‌ systems, the optimal task‌ size varies across different‌‌ processing units, so a single task size would‌ not fit all units.‌ StarPU's recursive tasks allow‌‌ graphs with several task granularities by turning some‌ tasks into sub-graphs dynamically‌ at runtime. The decision‌‌ to transform these tasks into sub-graphs is decided‌ by a StarPU component‌ called the Splitter. After‌‌ deciding to transform some tasks, classical scheduling approaches‌ are used, making this‌ component generic, and orthogonal‌‌ to the scheduler. In this paper, we propose‌ a new policy for‌ the Splitter, which is‌‌ designed for heterogeneous platforms, that relies on linear‌ programming aimed at minimizing‌ execution time and maximizing‌‌ resource utilization. This results in a dynamic well-balanced‌ set comprising both small‌ tasks to fill multiple‌‌ CPU cores, and large tasks for efficient execution‌ on accelerators like GPU‌ devices. We then present‌‌ an experimental evaluation showing that just-in-time adaptations of‌ the task graph lead‌ to improved performance across‌‌ various dense linear algebra algorithms 12.

8.20‌ Improving energy efficiency of‌ HPC applications using unbalanced‌‌ GPU power capping (Topic 3.4.3)

Participants: Abdou‌ Guermouche, Hayfa Tayeb‌.

Energy efficiency represents‌‌ a significant challenge in the domain of High-Performance‌ Computing (HPC). One potential‌ key parameter to improve‌‌ energy efficiency is the use of power capping,‌ a technique for controlling‌ the power limits of‌‌ a device, such as a CPU or GPU.In‌ this paper, we propose‌ to examine the impact‌‌ of GPU power capping in the context of‌ HPC applications using heterogeneous‌ computing systems. To this‌‌ end, we first conduct an extensive study of‌ the impact of GPU‌ power capping on a‌‌ compute intensive kernel, namely matrix multiplication kernel (GEMM),‌ on different Nvidia GPU‌ architectures. Interestingly, such compute-intensive‌‌ kernels are up to 30 % more energy‌ efficient when the GPU‌ is set to 55-70‌‌ % of its Thermal Design Power (TDP). Using‌ the best power capping‌ configuration provided by this‌‌ study, we investigate how setting different power caps‌ for GPU devices of‌ a heterogeneous computing node‌‌ can improve the energy efficiency of the running‌ application. We consider dense‌ linear algebra task-based operations,‌‌ namely matrix multiplication and Cholesky factorization.We show how‌ the underlying runtime system‌ scheduler can then automatically‌‌ adapt its decisions to take advantage of the‌ heterogeneous performance capability of‌ each GPU. The results‌‌ show that for a given platform equipped with‌ four GPU devices, applying‌ a power cap on‌‌ all GPUs improves the energy efficiency for matrix‌ multiplication up to 24.3‌ % (resp. 33.78 %)‌‌ for double (resp. single) precision 26.

8.21‌ Sparse Matrix Ordering for‌ Fine Grain Parallel Triangular‌‌ Solve Using SIMD (Topic 3.4.1)

Participants: Abdou‌ Guermouche.

The evolution‌ of processor hardware increasingly‌‌ supports fine grain parallelism through SIMD (Single Instruction,‌ Multiple Data) vector instruction‌ sets and hardware threading.‌‌ For instance, the new‌ ARM SVE instruction set allows for hardware implementation‌ of up to 32 double precision SIMD vector‌ sizes per hardware thread. In this work, we‌ focus on vectorization of the triangular solves required‌ in BiCGStab preconditioned with ILU(0) that is particularly‌ numerically effective for IFPEN applications. In our context,‌ expressing some parallelism can be achieved by changing‌ the sparse structure of the matrices through unknown‌ ordering; that can be recast in terms of‌ graph ordering and coloring. We use a graph‌ coloring method named ColorRCM to exhibit fine grain‌ parallelism to feed the SIMD computing units while‌ improving the convergence of the Krylov solver compared‌ to classical greedy graph coloring method. We first‌ evaluate the performance of SIMD-SpTRSV using the permutation‌ provided by ColorRCM and achieve an acceleration between‌ 1.7 and 6 in AVX2 compared to Intel‌ MKL 21.4. Then we examine the impact of‌ ColorRCM ordering on ILU(0)-BiCGStab performance on 201 matrices,‌ including those from the Suite Sparse matrix (The‌ University of Florida Sparse Matrix Collection collection and‌ from the IFPEN porous media flow simulations. The‌ solver configuration uses the ColorRCM ordering and vectorized‌ with AVX2 instructions showed the best convergence times‌ in two thirds of the tests 21.‌

8.22 Mind Bubbles and Memory: Bounds on Scheduling‌ Pipeline Parallelism with Rematerialization (Domain 4.2)

Participants:‌ Adrien Aguila–Multner, Yulia Gusak, Olivier Beaumont‌, Lionel Eyraud Dubois.

Training large neural‌ networks, especially Transformer-based Large Language Models (LLMs), requires‌ massive high-performance computing (HPC) resources. Within each microbatch,‌ computations follow a strictly sequential flow through a‌ stack of transformer blocks: a forward pass to‌ compute the loss, and a backward pass to‌ propagate gradients. This sequential structure limits intrinsic parallelism.‌ To improve performance, several complementary strategies have been‌ developed: data, tensor, sequence, and pipeline parallelism, typically‌ combined to achieve scalability over tens of thousands‌ of GPUs.

In 35, we present a‌ formal analysis of pipeline parallelism (PP) for large-scale‌ training. In PP, the model is partitioned into‌ multiple stages, and microbatches are injected into the‌ pipeline to overlap computation. The main challenge is‌ to minimize idle periods (pipeline bubbles) while managing‌ memory usage, since each GPU must store intermediate‌ activations from multiple in-flight microbatches. Existing scheduling algorithms‌ such as GPIPE, 1F1B, HANAYO, and MEGATRON reduce‌ idle time but lack formal lower bounds or‌ explicit modeling of memory constraints.

We develop a‌ unified analytical approach for PP scheduling, deriving lower‌ bounds on completion time for both single-wave and‌ multi-wave regimes. Our analysis explicitly incorporates a memory‌ constraint K, denoting the number of activations that‌ can be stored per GPU. Exact results are‌ provided for two extreme cases (minimal memory (‌ $K = 1$ ) and large memory (‌ $K \geq m$ )), while general lower bounds‌ are established for intermediate configurations. Our analysis highlights‌ the intrinsic coupling between pipeline utilization and memory‌ footprint, providing a foundation for evaluating and comparing pipeline scheduling algorithms under‌ realistic memory constraints.

8.23‌ Optimized Forward-Backward Rematerialization for‌‌ Memory-Efficient Pipeline Parallel Training (Topic 3.4.4)

Participants:‌ Adrien Aguila–Multner, Olivier‌ Beaumont, Lionel Eyraud‌‌ Dubois, Yulia Gusak.

Pipeline parallelism is‌ a key technique for‌ scaling deep network training‌‌ across multiple devices. Recent works have significantly reduced‌ pipeline idle time by‌ improving scheduling efficiency. Decoupling‌‌ the computation of gradients with respect to weights‌ and activations led to‌ the development of schedules‌‌ with almost no idle time. However, these methods‌ still require substantial memory,‌ limiting their applicability on‌‌ resource-constrained hardware.

In 36, our first contribution‌ is to introduce recomputation‌ to the backward pass,‌‌ extending rematerialization beyond the forward pass. This enables‌ executing schedules with decoupled‌ gradient computations under much‌‌ tighter memory constraints. Our second contribution is a‌ unified optimization approach that,‌ given a model and‌‌ hardware memory constraints, formulates and solves an Integer‌ Linear Programming (ILP) problem‌ to determine the optimal‌‌ per-microbatch, per-GPU rematerialization strategy for a given schedule,‌ applicable to both one-wave‌ and multi-wave pipeline schedules.‌‌ Our third contribution shows that, as device memory‌ constraints vary, the relative‌ advantages of different pipeline‌‌ schedules also change in the presence of rematerialization.‌ We provide corresponding insights‌ and a PyTorch framework‌‌ that enables finding and executing the optimal combination‌ of pipeline scheduling and‌ rematerialization strategies. Experiments demonstrate‌‌ the effectiveness of all three contributions, showing that‌ our approach enables efficient‌ training of larger models‌‌ under tight memory budgets, adapts optimally to varying‌ memory capacities, and reduces‌ recomputation overhead compared to‌‌ existing recomputation solutions.

8.24 Leveraging Expert Usage to‌ Speed up LLM Inference‌ with Expert Parallelism (Topic‌‌ 3.4.4)

Participants: Olivier Beaumont, Raphael Bourgouin‌.

Large language models‌ have become indispensable for‌‌ many text-processing applications. Their inference 15, i.e.‌ their use to generate‌ text, is a time-consuming‌‌ task since tokens have to be generated one‌ after the other, even‌ if the computational load‌‌ has been reduced by model sparsification, e.g. by‌ using a Mixture of‌ Experts (MoE) models. In‌‌ the MoE context, a subset of experts is‌ selected at each stage.‌ Note that not all‌‌ subsets of experts (pairs of experts in most‌ cases) in a given‌ layer have the same‌‌ probability of being selected. When experts are mapped‌ to different GPUs, there‌ is a risk of‌‌ load imbalance if the selected experts end up‌ on a small number‌ of GPUs. This paper‌‌ proposes to leverage this heterogeneity in expert usage‌ to map experts of‌ popular subsets onto distinct‌‌ GPUs, allowing them to be processed in parallel‌ and thus reducing the‌ time needed for inference.‌‌ Even though this mapping problem is NP-complete, it‌ is possible to design‌ simple greedy strategies that‌‌ significantly reduce the need for sequential expert processing.‌ Our proof-ofconcept confirms that‌ our mapping strategies effectively‌‌ reduce inference time on the Mixtral model.

8.25‌ Pallas: a generic‌ trace format for large‌‌ HPC trace analysis (Axis‌ 3.3.1)

Participant: Philippe Swartvagher.

Identifying performance‌ bottlenecks in a parallel application is tedious, especially‌ because it requires analyzing the behaviour of various‌ software components, as bottlenecks may have several causes‌ and symptoms. For example, a load imbalance may‌ cause long MPI waiting times, or contention on‌ disk may degrade the performance of I/O operations.‌ Detecting a performance problem means investigating the execution‌ of an application and applying several performance analysis‌ techniques. To do so, one can use a‌ tracing tool to collect information describing the behaviour‌ of the application. At the end of the‌ execution, a trace file in a specific format‌ is available to the application user, which can‌ be used to conduct a complete post-mortem investigation.‌ Several challenges emerge from the generation and use‌ of traces. Tracing applications may alter the performance‌ of the application, and can create thousands of‌ heavy trace files, especially at a large scale.‌ Most importantly, the post-mortem analysis needs to load‌ these thousands of trace files in memory, and‌ process them. This quickly becomes impractical for large‌ scale applications, as memory gets exhausted and the‌ number of opened files exceeds the system capacity.‌ In this paper, we propose Pallas 18,‌ a generic trace format tailored for conducting various‌ post-mortem performance analysis of traces describing large executions‌ of HPC applications. During the execution of the‌ application, Pallas collects events and detects their repetitions‌ on-the-fly. When storing the trace to disk, Pallas‌ groups the data from similar events or groups‌ of events together in order to later speed‌ up trace reading. We demonstrate that the Pallas‌ online detection of the program structure does not‌ significantly degrade the performance of the applications. Moreover,‌ the Pallas format allows faster trace analysis compared‌ to other evaluated trace formats. Overall, the Pallas‌ trace format allows an interactive analysis of a‌ trace that is required when a user investigates‌ a performance problem.

9 Bilateral contracts and grants‌ with industry

9.1 Bilateral Grants with Industry

Participants:‌ Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Faverge‌, Abdou Guermouche, Yulia Gusak, Pierre‌ Ramet.

Some on the ongoing PhD thesis‌ are developed within bilateral contract with industry for‌ PhD advisory:

Airbus (2022-). This collaboration concerns the‌ parallelization and optimization of the Flusepa application, which‌ models the separation of boosters for space launchers‌ at Airbus Safran Launchers. Flusepa combines computational fluid‌ mechanics, algorithms (AMR) and task-based parallelism based on‌ the StarPU runtime system. We are involved in‌ the supervision of the PhD. of Alice Lasserre‌ in this context.
CEA-Cesta for the PhD of‌ Abel Calluaud. A direct solver developed at CEA‌ relies on the approximation by hierarchical matrices to‌ reduce both computational and memory costs. Although these‌ developments have met a growing demand for increased‌ simulation accuracy, there are still open problems to‌ pursue these research efforts in an HPC context.‌ In this thesis, we propose to develop and compare several approaches to‌ adapt the granularity of‌ hierarchical tasks and extract‌‌ parallelism to exploit the multicore computational nodes associated‌ with massively parallel architectures‌ such as GPUs.
CEA-Cesta‌‌ for the PhD of Dimitri Walther. In the‌ context of numerical simulation‌ of electromagnetism, integral methods‌‌ are among the most widely used because of‌ their power. These methods‌ lead to the solution‌‌ of dense linear problems and are therefore very‌ expensive. For this reason,‌ hierarchical compression methods have‌‌ been developed that drastically reduce the cost associated‌ with these matrices. They‌ are based on a‌‌ hierarchical partitioning of the matrix, and therefore of‌ the mesh, and the‌ efficiency of the compression‌‌ depends on this partitioning. In this context, the‌ aim of the thesis‌ is to develop efficient‌‌ and scalable hierarchical partitioners to optimise the compression‌ of the matrix.
Eviden‌ for the PhD of‌‌ Alycia Lisito. For over three years, we have‌ been collaborating with Eviden‌ on the development of‌‌ an HPL benchmark on top of runtime systems.‌ This work is continued‌ as part of Alycia‌‌ Lisito's thesis funded by a CIFRE contract. To‌ guarantee a high level‌ of flexibility and portability,‌‌ it is possible to use a task-based implementation‌ through an executive support‌ (or runtime). This programming‌‌ model has already proved its effectiveness in the‌ implementation of various parallel‌ algorithms, in particular for‌‌ dense linear algebra (LU decomposition, Cholesky decomposition, QR,‌ etc.). In this thesis,‌ we will use Inria's‌‌ existing software stack, through the dense linear algebra‌ library Chameleon and the‌ executive support StarPU. These‌‌ reference libraries for runtime linear algebra will be‌ studied to enable the‌ scaling up of more‌‌ complex algorithms such as HPL.
Eviden for the‌ PhD of Jean Conan.‌ Within the framework of‌‌ High-Performance Computing (HPC) tenders, Atos Bull must provide‌ contractual performance guarantees for‌ future supercomputers. However, direct‌‌ measurement is often impossible during the bidding phase,‌ either because the hardware‌ components (processors, accelerators) are‌‌ not yet commercially available or because the scale‌ of the proposed system‌ exceeds the testing resources‌‌ available internally. Performance prediction has become a critical‌ tool, not only for‌ meeting client requirements but‌‌ also for upstream architecture sizing (such as network‌ topology) and optimizing massively‌ parallel software. The transition‌‌ to exascale computing introduces unprecedented complexity, driven by‌ the increasing heterogeneity of‌ compute nodes and the‌‌ intricate structure of high-speed networks. The objective of‌ the thesis is thus‌ to explore novel methodologies‌‌ for performance prediction, with a primary focus on‌ simulation techniques, Reduce optimization‌ overhead by minimizing the‌‌ number of large-scale physical runs required to determine‌ optimal execution parameters, and‌ finally accurately model the‌‌ impact of node heterogeneity and network architecture on‌ overall system performance.
Diabolocom‌ and Inria are now‌‌ on the final stage of the contract negotiation‌ to start a PhD‌ thesis co-supervised by Yulia‌‌ Gusak and Olivier Beaumont on Optimization of Multi-Stage‌ Generative Model Pipelines for‌ Cost-Efficient, Scalable Inference.

10‌‌ Partnerships and cooperations

10.1‌ International initiatives

10.1.1 Associate Teams in the framework‌ of an Inria International Lab or in the‌ framework of an Inria International Program

ELF Associate‌ Team on on Efficient deep Learning Frameworks.

Partners‌

TOPAL
California Institute of Technology (Caltech)

Nowadays, Deep‌ Learning (DL) and Artificial Intelligence (AI) technologies are‌ incorporated in more and more areas to solve‌ various problems of video, audio, natural language processing,‌ content generation, etc. Frameworks based on neural networks,‌ which are core modules of deep learning models,‌ have been already successfully used for action recognition,‌ weather forecasting, robotic surgery and other inspiring applications‌ [24, 44, 48]. The drawbacks of modern neural‌ networks are that they usually require a significant‌ amount of data and a lot of GPU‌ devices to be trained, which makes them expensive‌ in terms of energy and money costs, and‌ harmful in terms of air emissions [27]. The‌ general question we are going to address during‌ the work of the associate team is: given‌ your application and your computation platform, how to‌ perform the model training efficiently in terms of‌ time/energy?

10.2 International research visitors

10.2.1 Visits to‌ international teams

Research stays abroad

Olivier Beaumont visited‌ Loris Marchal and Pablo Piantanida for ten days‌ in July 2025 at ETS (École de technologie‌ supérieure) to work on inference optimization using speculative‌ decoding. This collaboration led to the submission of‌ an international cooperation project, which is currently under‌ evaluation.

10.3 European initiatives

10.3.1 H2020 projects

EUPEX‌

Participants: Olivier Beaumont.

EUPEX project on cordis.europa.eu‌

Title:
EUROPEAN PILOT FOR EXASCALE
Duration:
From January‌ 1, 2022 to December 31, 2025
Partners:
- INSTITUT‌ NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA),‌ France
- GRAND EQUIPEMENT NATIONAL DE CALCUL INTENSIF (GENCI),‌ France
- VSB - TECHNICAL UNIVERSITY OF OSTRAVA (VSB‌ - TU Ostrava), Czechia
- FORSCHUNGSZENTRUM JULICH GMBH (FZJ),‌ Germany
- COMMISSARIAT A L ENERGIE ATOMIQUE ET AUX‌ ENERGIES ALTERNATIVES (CEA), France
- IDRYMA TECHNOLOGIAS KAI EREVNAS‌ (FOUNDATION FOR RESEARCH AND TECHNOLOGYHELLAS), Greece
- SVEUCILISTE U‌ ZAGREBU FAKULTET ELEKTROTEHNIKE I RACUNARSTVA (UNIVERSITYOF ZAGREB FACULTY‌ OF ELECTRICAL ENGINEERING AND COMPUTING), Croatia
- UNIVERSITA DEGLI‌ STUDI DI TORINO (UNITO), Italy
- CYBELETECH (Cybeletech), France‌
- UNIVERSITA DI PISA (UNIPI), Italy
- GRAN SASSO SCIENCE‌ INSTITUTE (GSSI), Italy
- ISTITUTO NAZIONALE DI ASTROFISICA (INAF),‌ Italy
- UNIVERSITA DEGLI STUDI DEL MOLISE, Italy
- E‌ 4 COMPUTER ENGINEERING SPA (E4), Italy
- UNIVERSITA DEGLI‌ STUDI DELL'AQUILA (UNIVAQ), Italy
- CONSIGLIO NAZIONALE DELLE RICERCHE‌ (CNR), Italy
- JOHANN WOLFGANG GOETHE-UNIVERSITAET FRANKFURT AM MAIN‌ (GUF), Germany
- EUROPEAN CENTRE FOR MEDIUM-RANGE WEATHER FORECASTS‌ (ECMWF), United Kingdom
- BULL SAS (BULL), France
- POLITECNICO‌ DI MILANO (POLIMI), Italy
- EXASCALE PERFORMANCE SYSTEMS -‌ EXAPSYS IKE, Greece
- ALMA MATER STUDIORUM - UNIVERSITA‌ DI BOLOGNA (UNIBO), Italy
- PARTEC AG (PARTEC), Germany‌
- ISTITUTO NAZIONALE DI GEOFISICA E VULCANOLOGIA, Italy
- CINECA‌ CONSORZIO INTERUNIVERSITARIO (CINECA), Italy
- SECO SPA (SECO SRL),‌ Italy
- CONSORZIO INTERUNIVERSITARIO NAZIONALE PER L'INFORMATICA (CINI), Italy‌
Inria contact:
Olivier Beaumont
Coordinator:
Jean-Robert Bacou (Eviden)‌
Summary:

The EUPEX consortium aims to design, build,‌ and validate the first EU platform for HPC, covering end-to-end the spectrum‌ of required technologies with‌ European assets: from the‌‌ architecture, processor, system software, development tools to the‌ applications. The EUPEX prototype‌ will be designed to‌‌ be open, scalable and flexible, including the modular‌ OpenSequana-compliant platform and the‌ corresponding HPC software ecosystem‌‌ for the Modular Supercomputing Architecture. Scientifically, EUPEX is‌ a vehicle to prepare‌ HPC, AI, and Big‌‌ Data processing communities for upcoming European Exascale systems‌ and technologies. The hardware‌ platform is sized to‌‌ be large enough for relevant application preparation and‌ scalability forecast, and a‌ proof of concept for‌‌ a modular architecture relying on European technologies in‌ general and on European‌ Processor Technology (EPI) in‌‌ particular. In this context, a strong emphasis is‌ put on the system‌ software stack and the‌‌ applications.

Being the first of its kind, EUPEX‌ sets the ambitious challenge‌ of gathering, distilling and‌‌ integrating European technologies that the scientific and industrial‌ partners use to build‌ a production-grade prototype. EUPEX‌‌ will lay the foundations for Europe's future digital‌ sovereignty. It has the‌ potential for the creation‌‌ of a sustainable European scientific and industrial HPC‌ ecosystem and should stimulate‌ science and technology more‌‌ than any national strategy (for numerical simulation, machine‌ learning and AI, Big‌ Data processing).

The EUPEX‌‌ consortium – constituted of key actors on the‌ European HPC scene –‌ has the capacity and‌‌ the will to provide a fundamental contribution to‌ the consolidation of European‌ supercomputing ecosystem. EUPEX aims‌‌ to directly support an emerging and vibrant European‌ entrepreneurial ecosystem in AI‌ and Big Data processing‌‌ that will leverage HPC as a main enabling‌ technology.

DARE

Participants: Olivier‌ Beaumont, Lionel Eyraud-Dubois‌‌, Mathieu Faverge, Pierre Ramet, Florent‌ Pruvost.

DARE

Title:‌
A new era for‌‌ supercomputing in Europe
Duration:
From March 1, 2025‌ to March 1, 2026‌
Partners (partial list):
- BARCELONA‌‌ SUPERCOMPUTING CENTER (BSC)
- CODASIP GMBH (CODA-DE)
- AXELERA AI‌ SRL (AXE-IT)
- OPENCHIP SOFTWARE‌ TECHNOLOGIES SL (OCT)
- INTERUNIVERSITAIR‌‌ MICRO-ELECTRONICA CENTRUM (IMEC)
- FORSCHUNGSZENTRUM JUELICH GMBH (JSC)
- CINECA‌ CONSORZIO INTERUNIVERSITARIO (CINECA)
- E4‌ COMPUTER ENGINEERING SPA (E4)‌‌
- CHALMERS TEKNISKA HOGSKOLA AB (CHALMERS)
- POLITECNICO DI MILANO‌ (POLIMI)
- UNIVERSIDAD COMPLUTENSE DE‌ MADRID (UCM)
- UNIVERSITAT POLITECNICA‌‌ DE V ALENCIA (UPV)
- INSTITUT NATIONAL DE RECHERCHE‌ EN INFORMATIQUE ET AUTOMATIQUE‌ (INRIA)
- THALES (TRT)
- TECHNISCHE‌‌ UNIVERSITAET MUENCHEN (TUM)
- BULL SAS (BULL)
Inria contact:‌
Olivier Sentyies
Coordinator:
Osman‌ Unsal (BSC)
Summary:

DARE‌‌ explores new paths toward greater European autonomy in‌ HPC and AI by‌ advancing open technologies and‌‌ fostering homegrown innovation. The project aims to reduce‌ strategic dependencies and strengthen‌ Europe’s ability to shape‌‌ its digital future.

DARE’s technologies will power future‌ European supercomputers, enabling breakthroughs‌ in science, industry, and‌‌ AI. By strengthening Europe’s HPC supply chain and‌ IP portfolio, DARE creates‌ long-term economic, technological, and‌‌ societal benefits across critical sectors.

DARE sets out‌ to lay the technological‌ foundations for European digital‌‌ autonomy in HPC and AI. By combining open‌ RISC-V architectures, chiplet technologies,‌ and a co-designed software‌‌ ecosystem, DARE aims to‌ deliver working prototypes, shape the EU HPC roadmap,‌ and boost Europe’s ability to build and sustain‌ its own supercomputing value chain.

10.4 National initiatives‌

10.4.1 Inria Challenge

Challenge Cupseli: Collaborative Unified Platform‌ for a Scalable and Efficient Learning Infrastructure

Duration:‌
2025 – 2029
Coordinator:
Olivier Beaumont (Inria) and‌ Alexandru Dobrila (Hivenet)
Local contact:
Olivier Beaumont &‌ Lionel Eyraud Dubois & Julia Gusak & Thomas‌ Herault & Philippe Swartvagher
Partners:
Hivenet
Inria teams:‌
- ARGO and MIMOVE, Inria Paris
- COAST, Inria Nancy‌ – Grand Est
- MAGELLAN, STACK and WIDE, Inria‌ Centre at Rennes University
- OCKHAM, Inria Centre of‌ Lyon
- COATI and NEO, Inria Centre at Université‌ Côte d’Azur
- TADAAM and TOPAL, Inria Centre of‌ the University of Bordeaux
Summary:
The Cupseli challenge‌ aims to demonstrate that it is possible to‌ run complex applications (particularly in the field of‌ machine learning) on heterogeneous, distributed, and volatile resources,‌ while achieving strong parallel efficiency and preserving both‌ accuracy and confidentiality. Building on the combined expertise‌ of hive and Inria in storage technologies illustrated‌ in Alvearium, this strategic partnership explores algorithmic‌ and system solutions to optimize computation, memory, and‌ communications, while ensuring security and fault tolerance. The‌ work is organized around three axes: Frugality (adapting‌ training and inference to limited and dynamic resources),‌ Security and Confidentiality (protecting data and models through‌ encryption, secure enclaves, and defenses against attacks), and‌ Volatility (ensuring robustness and performance despite the unpredictable‌ arrival and departure of resources). The shared goal‌ is to offer a green and sovereign alternative‌ to data centers, by leveraging already-existing resources for‌ the benefit of AI and Big Data applications.‌

Challenge PULSE: Pushing low-carbon services towards the Edge‌

Duration:
2022 – 2026
Coordinator:
Romain Rouvoy
Local‌ contact:
Olivier Beaumont & Lionel Eyraud Dubois
Partners:‌
Qarnot Computing, ADEME
Inria teams:
- Avalon
- Ctrl-A
- Spirals‌
- Stack
- Storm
- Topal
Summary:
The Pulse challenge aims‌ to develop and promote best practices in geo-repaired‌ hardware and software infrastructures for more environmentally friendly‌ intensive computing. The idea is to analyze which‌ solutions are the most relevant, and which levers‌ need to be focused on, to reduce the‌ impact of infrastructures while maximizing the usefulness of‌ their emissions. To this end, the challenge is‌ structured around two complementary research axes to address‌ this technological and environmental issue: holistic analysis of‌ the environmental impact of intensive computing, and implementing‌ more virtuous edge services.

10.5 Public policy support‌

Olivier Beaumont conducted an expert assessment, in collaboration‌ with IRD and other Inria colleagues, on the‌ current state and future development prospects of high-performance‌ computing in Africa. This study was commissioned by‌ the French Development Agency (Agence Française de‌ Développement).

11 Dissemination

11.1 Promoting scientific activities‌

11.1.1 Scientific events: organisation

General chair, scientific chair‌

Member of the organizing committees

Philippe Swartvagher and‌ Emmanuel Agullo (Concace team) were organizing chairs of‌ Compas 2025, the French Conference on Parallelism,‌ Architecture and System.

11.1.2 Scientific events: selection

Chair of conference program committees‌

Abdou Guermouche was chair‌ of the track System‌‌ Software and Cloud Computing of the SC25 (International‌ Conference for High Performance‌ Computing, Networking, Storage, and‌‌ Analysis) international conference.
Thomas Herault was chair of‌ the Programming Environments and‌ System Software track of‌‌ the ISC High Performance 2026 international conference.

Member‌ of the conference program‌ committees

Olivier Beaumont was‌‌ involved in the following program committes: SC25 (Algorithms)‌HPDC25 IPDPS26 (Algorithms)
Lionel‌ Eyraud Dubois was involved‌‌ in the program committee of EuroPar 2025.‌
Philippe Swartvagher was involved‌ in the following program‌‌ committees: Cluster 2025, PMBS 2025 workshop,‌ and reproducibility committee of‌ SC 25 (41‌‌, 42).
Abdou Guermouche was involved in‌ the following program committees‌ : HCW 2025, Heterogeneity‌‌ in Computing Workshop.
Mathieu Faverge was involved‌ in the program committee‌ of : SBAC-PAD 2025‌‌ (Parallel Applications and Algorithms).
Yulia Gusak was‌ involved in the following‌ program committees: ICML 2025‌‌, NeurIPS 2025, AAAI 2026, ICLR‌ 2026.
Laércio Lima‌ Pilla was involved in‌‌ the following program committees: ESSA 2025, HCW‌ 2025, IC2E 2025‌, and SC25 (Algorithms)‌‌.
Thomas Herault was involved in the following‌ program committees: ISC High‌ Performance 2025, ICPP‌‌ 2025, SC25 (Algorithms), HiPC'25 (System Software)‌, and the Workshop‌ on Asynchronous Many-Tasks Applications‌‌ 2025 30.

Reviewer

The members of the‌ TOPAL project have also‌ performed reviewing for the‌‌ following list of conferences: IPDPS'25, SC 25‌, HIPC'25

11.1.3 Journal‌

Member of the editorial‌‌ boards

Olivier Beaumont is Associate Editor in Chief‌ for the Journal of‌ Parallel and Distributed Computing‌‌ Elsevier JPDC
Olivier Beaumont is Guest Editor for‌ a Special Issue of‌ IEEE Internet Computing with‌‌ Shadi Ibrahim et al. on Serverless Computing.‌
Thomas Herault is Associate‌ Editor for Algorithms of‌‌ The Journal of Parallel and Distributed Computing (JPDC)‌.

Reviewer - reviewing‌ activities

The members of‌‌ the TOPAL project have performed reviewing for Journal‌ of Parallel and Distributed‌ Computing (Lionel Eyraud‌‌ Dubois , Abdou Guermouche ), ACM Transactions on‌ Mathematical Software (Pierre‌ Ramet , Abdou Guermouche‌‌ ), IEEE Transactions on Parallel and Distributed Systems‌ (Lionel Eyraud Dubois‌ , Abdou Guermouche ,‌‌ Mathieu Faverge ), SoftwareX (Abdou Guermouche ),‌ Parallel Computing (Laércio‌ Lima Pilla , Abdou‌‌ Guermouche ), 4OR - A Quarterly Journal of‌ Operations Research (Lionel‌ Eyraud Dubois ).

11.1.4‌‌ Invited talks

Yulia Gusak gave a talk at‌ Sharp+Foundary @ COLT workshop‌ entitled “Training Neural Networks‌‌ Under Memory Constraints“.
Yulia Gusak gave a talk‌ at AI4Industry'25 entitled “Efficient‌ Training of Neural Networks“.‌‌
Yulia Gusak gave a talk at the 18th‌ Scheduling for large-scale systems‌ workshop, entitled “Optimizing neural‌‌ networks training using different types of parallelisms (data/tensor/model/pipeline)‌ and re-materialization“
Laércio Lima‌ Pilla gave a talk‌‌ at the 18th Scheduling for large-scale systems workshop,‌ entitled “Exploring scheduling solutions‌ for Federated Learning training”.‌‌
Olivier Beaumont gave a‌ talk at the 18th Scheduling for large-scale systems‌ workshop, entitled “Optimized Forward-Backward Rematerialization for Memory-Efficient Pipeline‌ Parallel Training”.

11.1.5 Leadership within the scientific community‌

Olivier Beaumont is a member of the IEEE‌ CS Babbage Award selection commitee

11.1.6 Scientific expertise‌

Olivier Beaumont conducted an expert assessment, in collaboration‌ with IRD and other Inria colleagues, on the‌ current state and future development prospects of high-performance‌ computing in Africa. This study was commissioned by‌ the French Development Agency (Agence Française de‌ Développement).
Olivier Beaumont acted as external evaluator‌ for several EuroHPC calls: Inno4Scale, Energy,‌ FFPlus
Pierre Ramet is Scientific Advisor at the‌ CEA-DAM CESTA.
Pierre Ramet participated in the HCERES‌ evaluation committee of the IRFU (Institut de recherche‌ sur les lois fondamentales de l'Univers) at CEA‌ Saclay. The final report has been published in‌ March 2025.
Abdou Guermouche acted as external evaluator‌ for one ANRT proposal.

11.1.7 Research administration

Pierre‌ Ramet is the head of the CNRS Satanas‌ department.
Pierre Ramet is member of Scientific comittee‌ of the LaBRI.
Philippe Swartvagher is the communication‌ referent for the NumPEx/Exa-SofT project.
Philippe Swartvagher is‌ the point of contact in Bordeaux for Grid5000/SLICES-FR‌ infrastructure.
Philippe Swartvagher is the representative of the‌ TOPAL team at the Bordeaux CUMI.
Philippe Swartvagher‌ is elected member at the Center Committee of‌ Inria Bordeaux.
Abdou Guermouche is the scientific lead‌ of the numerical library work package of the‌ ExaSoft project (PEPR NumPEx).
Abdou Guermouche is member‌ of the Scientific Committee of LaBRI.
Yulia Gusak‌ is a PI of the ELF associate team‌ between Topal and Caltech.
Laércio Lima Pilla is‌ a member of the societal challenges commission at‌ the LaBRI.
Laércio Lima Pilla is a member‌ of the committee on gender equality and equal‌ opportunities of the Inria Research center at the‌ University of Bordeaux.
Laércio Lima Pilla is a‌ member of the National Gender Equality and Equal‌ Opportunities Committee at Inria.

11.2 Teaching - Supervision‌ - Juries - Educational and pedagogical outreach

Undergraduate‌ level/Licence:
- Aurélien Esnard : Network (54h), Software technologies‌ (80h) at Bordeaux University.
- Pierre Ramet : System‌ programming 24h, Databases 32h, Object programming 48h, Distributed‌ programming 16h, Cryptography 16h, Introduction to unsupervised learning‌ 16h at Bordeaux University.
- Philippe Swartvagher : C‌ Programming (46h), Web Programming (36h), Tools for Programming‌ and C project (30h) at Bordeaux INP (‌Enseirb-MatMeca).
- Abdou Guermouche System programming 36h at‌ Bordeaux University.
- Mathieu Faverge : Programming environment (26h),‌ Numerical algorithmic (25h), C projects (25h) at Bordeaux‌ INP (Enseirb-MatMeca).
Post graduate level/Master:
- Aurélien‌ Esnard : Network management (24h), Network security (24h)‌ at Bordeaux University.
- Lionel Eyraud Dubois : Graphs‌ and Algorithms (20h), Complexity and Approximation (20h) at‌ Bordeaux University.
- Olivier Beaumont : Parallel Algorithms, 20h‌ at Bordeaux INP.
- Pierre Ramet : Cryptography 20h‌ and Numerical algorithms 40h at Bordeaux INP (‌Enseirb-MatMeca).
- Philippe Swartvagher : Parallel Algorithms (17h),‌ Project of network and system programming (25h), Operating Systems (15h) at Bordeaux‌ INP (Enseirb-MatMeca).‌
- Abdou Guermouche Network management‌‌ 92h, Network security 64h, Operating system 24h at‌ Bordeaux University.
- Mathieu Faverge‌ : System programming: lecture,‌‌ practice and project (54h), Linear Algebra for high‌ Performance Computing (9h) at‌ Bordeaux INP (Enseirb-MatMeca‌‌). He is also in charge of the‌ master 2 internship for‌ the Computer Science department‌‌ at Bordeaux INP (Enseirb-MatMeca) and he is in‌ charge, with Abdou Guermouche‌ , of the High‌‌ Performance Computing - High Performance Data Analytics specialty‌ at Enseirb-MatMeca. This is‌ a common training curriculum‌‌ between the Computer Science and the MatMeca departments‌ at Bordeaux INP and‌ with the Bordeaux University‌‌ in the context of the Computer Science Research‌ Master.
- Yulia Gusak :‌ Efficient Deep Learning (Outils‌‌ pour l'apprentissage) (19h) at Bordeaux INP (Enseirb-MatMeca‌).
- Laércio Lima Pilla‌ : Algorithms for High-Performance‌‌ Computing Platforms (16h) at Bordeaux INP (Enseirb-MatMeca‌) and Bordeaux University,‌ Reading articles and scientific‌‌ documentation (3h) at Bordeaux University.
- Thomas Herault :‌ Introduction to tensor algebra‌ for the Engineer in‌‌ Computer Science (9h) at Bordeaux INP (Enseirb-MatMeca‌); Open MP programming‌ (8h) at Bordeaux INP‌‌ (Enseirb-MatMeca).

11.2.1 Supervision

PhD in progress:‌ Brieuc Nicolas ; Scalable‌ tensor algebra on top‌‌ of runtime system; started Oct 2024; advisors Thomas‌ Herault , Mathieu Faverge‌ ,Abdou Guermouche .‌‌
PhD in progress: Nicolas Ducarton ; Fault tolerance‌ and task-based programming for‌ large-scale systems ; started‌‌ April 2025; advisors Thomas Herault , Samuel Thibault‌ ,Amina Guermouche .‌
PhD in progress: Abel‌‌ Calluaud; Combined compiler and runtime approach for a‌ direct hierarchical solver; started‌ Nov. 2022; advisors Pierre‌‌ Ramet , Mathieu Faverge .
PhD in progress:‌ Jean-François David; Dynamic Scheduling‌ for Inference in Deep‌‌ Neural Networks; advisors Olivier Beaumont , Lionel Eyraud‌ Dubois .
PhD in‌ progress: Alycia Lisito; Design‌‌ and implementation of a portable linear algebra benchmark‌ on runtime systems for‌ performance evaluation of heterogeneous‌‌ Exascale architectures ; started Nov. 2023; advisors Pierre‌ Ramet , Mathieu Faverge‌ , Matthieu Kuhn (Eviden).‌‌
PhD in progress: Dimitri Walther; ; started Nov.‌ 2024; advisors Pierre Ramet‌ , Mathieu Faverge ,‌‌ M. Lecouvez (CEA Cesta).
PhD in progress: Hayfa‌ Tayeb ; Optimization of‌ high-performance applications on heterogeneous‌‌ computing nodes; started Nov. 2021; A. Guermouche ,‌ B. Bramas , M.‌ Faverge. Defended March 25th,‌‌ 2025.
PhD in progress: Albert D'Aviau de Piolant‌ ; started October 2023;‌ Energy aware scheduling for‌‌ exascale architectures. Advisors: Abdou Guermouche and Amina Guermouche.‌
PhD in progress: Thomas‌ Morin ; started October‌‌ 2023; Scheduling recursive task graphs. Advisors: Abdou Guermouche,‌ Samuel Thibault, Pierre-André Wacrenier.‌
PhD in progress :‌‌ Alice Lasserre ; Started Oct. 2022; Optimization of‌ a task-based simulation code‌ on a distributed supercomputer;‌‌ Advisors: Jean-Marie Couteyen-Carpaye, Raymond Namyst and Abdou Guermouche.‌
PhD in progress: Samuel‌ Mendoza; On the Scalability‌‌ of sparse linear system solvers using the task-based‌ paradigm. Started Sept. 2025;‌ advisors Abdou Guermouche ,‌‌ Emmanuel Agullo and Alfredo‌ Buttari.
PhD in progress: Jean Conan; Simulation-based performance‌ prediction of scientific computing applications on exascale supercomputers;‌ Started March 2025; advisors Abdou Guermouche , Louis‌ Poirel and Arnaud Legrand.
PhD in progress: Adrien‌ Aguilla–Multner , Started October 2024; Efficient Training of‌ Neural Networks 36, 35. Advisors: Yulia‌ Gusak , Olivier Beaumont .
PhD defended: Diane‌ Orhan ; Modeling and dynamic optimization of software‌ radio chains on heterogeneous architectures; defended in December‌ 2025; advisors Denis Barthou , Christophe Jégo ,‌ and Laércio Lima Pilla .
PhD in progress:‌ Alan Lira Nunes ; Scheduling algorithms for the‌ optimization of distributed machine learning models on heterogeneous‌ resources; started in August 2022; advisors Cristina Boeres‌ , Lúcia Drummond , and Laércio Lima Pilla‌ .
PhD in progress: Vanderlei Munhoz Pereira Filho‌ ; Scheduling of task-based parallel applications on heterogeneous‌ Cloud computing environments; started in February 2025; advisors‌ Olivier Aumage , Márcio Castro , and Laércio‌ Lima Pilla .
PhD in progress: Giorgio Bettonte‌ ; Large-Scale Artificial Intelligence Inference Optimization in Distributed‌ Cloud Environments; started in October 2025; advisors Olivier‌ Beaumont , Thomas Lambert , and Laércio Lima‌ Pilla .
PhD in progress: Tristan Riehs ;‌ Integrate scheduling of asynchronous network communications and task‌ scheduling; started in October 2025; advisors Samuel Thibault‌ (Storm team), Alexandre Denis (Tadaam team), and Philippe‌ Swartvagher .
Lionel Eyraud-Dubois and Philippe Swartvagher supervised‌ the internship of Theo Grandsart about the use‌ of task-based runtime systems to implement LLMs.
Philippe‌ Swartvagher , with Alexandre Denis (Tadaam team) and‌ Samuel Thibault (Storm team), supervised the internship of‌ Tanguy Chatelain, about the anticipation of communications in‌ task-based parallelism 64.
Thomas Herault and Philippe‌ Swartvagher supervised the internship of Joachim Robert about‌ communications for AI applications in an heterogeneous and‌ geo-distributed network 45.
Thomas Herault and Philippe‌ Swartvagher supervised the pre-PhD period of Fares Boudjaoui‌ about the scheduling of communications in an heterogeneity‌ network.
Internship on task-based systems for efficient deep‌ learning (Enrique Galves ). Supervised by Yulia‌ Gusak and Olivier Beaumont .
Internship on diffusion‌ model inference speed-up via parallelization within solver steps‌ and solver composition (Victor Lucas Rosada Canesin‌ ). Supervised by Yulia Gusak .
Internship on‌ efficient teacher–student pipeline-parallel training, with application to Reinforcement‌ Learning from human feedback (Mohamed Kherraz ).‌ Supervised by Yulia Gusak .

11.2.2 Juries

Pierre‌ Ramet : chair of the PhD jury of‌ Lise Jolicoeur.
Olivier Beaumont : chair of‌ the PhD jury of Luis Lopes Marques and‌ Diane Orhan
Lionel Eyraud Dubois acted as "opponent"‌ for the defense of Pirah Noor Soomro at‌ Chalmers University of Technology.
Thomas Herault : chair‌ of the HDR jury of Francieli Boito ;‌ examiner in the jury of Atte Torri PhD‌ defense; examiner in the jury of Abdessalam Benhari‌ PhD defense.
Yulia Gusak : member of the‌ PhD jury of Yannick Malot on Quantized DNN‌ learning algorithms with limited hardware overhead for Edge implementation.
Yulia Gusak :‌ member of the PhD‌ monitoring committee (comité de‌‌ suivi) of Méline Trochon on Adaptive Checkpoint-Restart System‌ with Knowledge of the‌ Network Load.
Yulia Gusak‌‌ : member of the PhD monitoring committee of‌ Rafael Silva on Artificial‌ Intelligence for Cardiac Monitoring:‌‌ Portable Multimodal Cardiac Function Analysis.

11.3 Popularization

11.3.1‌ Participation in Live events‌

As part of the‌‌ "Circuit Scientifique Bordelais", Philippe Swartvagher presented to high‌ school pupils from the‌ Lycée Stendhal at Aiguillon‌‌ what is research in computer science and how‌ to become a researcher.‌
As part of the‌‌ "Fête de la Science", Olivier Beaumont presented‌ HPC to students at‌ Lycée Gaston Crampe, Aire-sur-l'Adour‌‌ (Landes)
Olivier Beaumont participated in several internal events‌ (closed doors,...) to present‌ the activities of the‌‌ Inria Bordeaux center teams at the interface between‌ HPC and AI.
As‌ part of Maths en‌‌ Jeans, Olivier Beaumont worked with groups of‌ students from Andernos high‌ school on combinatorial problems‌‌ linked training.
On several occasions, we have welcomed‌ 3rd and 2nd grade‌ students into the team,‌‌ with the participation of Topal's PhD students, for‌ periods of 2 hours‌ to half a day.‌‌

12 Scientific production

12.1 Major publications

1 inproceedings‌O.Olivier Beaumont,‌ P.Philippe Duchon,‌‌ L.Lionel Eyraud-Dubois, J.Julien Langou and‌ M.Mathieu Vérité.‌ Symmetric Block-Cyclic Distribution: Fewer‌‌ Communications Leads to Faster Dense Cholesky Factorization.‌SC 2022 - Supercomputing‌Dallas, Texas, United States‌‌November 2022HAL
2 inproceedingsO.Olivier Beaumont‌, L.Lionel Eyraud-Dubois‌, M.Mathieu Vérité‌‌ and J.Julien Langou. I/O-Optimal Algorithms for‌ Symmetric Linear Algebra Kernels‌.ACM Symposium on‌‌ Parallelism in Algorithms and ArchitecturesPhiladelphie, United States‌July 2022HAL
3‌ articleR.Robert Falgout‌‌, M.Matthieu Lecouvez, P.Pierre Ramet‌ and C.Clément Richefort‌. Toward an Algebraic‌‌ Multigrid Method for the Indefinite Helmholtz Equation.‌SIAM Journal on Scientific‌ ComputingAugust 2025,‌‌ S285-S310HAL DOI
4 articleM.Mathieu Faverge‌, N.Nathalie Furmento‌, A.Abdou Guermouche‌‌, G.Gwenolé Lucas, R.Raymond Namyst‌, S.Samuel Thibault‌ and P.Pierre‐andré Wacrenier‌‌. Programming Heterogeneous Architectures Using Hierarchical Tasks.‌Concurrency and Computation: Practice‌ and Experience3525‌‌2023HAL DOI
5 inproceedingsJ.Julia Gusak‌, X.Xunyi Zhao‌, T.Théotime Le‌‌ Hellard, Z.Zhe Li, L.Lionel‌ Eyraud-Dubois and O.Olivier‌ Beaumont. HiRemate: Hierarchical‌‌ Approach for Efficient Re-materialization of Large Neural Networks‌.Proceedings of the‌ 42nd International Conference on‌‌ Machine LearningForty-Second International Conference on Machine Learning‌ (ICML 2025)267Vancouver,‌ Canada2025HAL back‌‌ to text back to text
6 inproceedingsA.‌Alycia Lisito, M.‌Mathieu Faverge, M.‌‌Matthieu Kuhn, F.Florent Pruvost and P.‌Pierre Ramet. Scalable‌ and portable LU factorization‌‌ with partial pivoting on top of runtime systems‌.IPDPS25 - 39th‌ IEEE International Parallel and‌‌ Distributed Processing SymposiumMilan,‌ ItalyJune 2025HAL
7 inproceedingsX.Xunyi‌ Zhao, T.Théotime Le Hellard, L.‌Lionel Eyraud-Dubois, J.Julia Gusak and O.‌Olivier Beaumont. Rockmate: an Efficient, Fast, Automatic‌ and Generic Tool for Re-materialization in PyTorch.‌ICML 2023Honolulu (HI), United StatesJuly 2023‌HAL

12.2 Publications of the year

International journals‌

8 articleQ.Quentin Barbut, L.Lucas‌ Perotin, A.Anne Benoit, T.Thomas‌ Herault, Y.Yves Robert and F.Frédéric‌ Vivien. Fixed-Work vs. Fixed-Time Checkpointing on Large-Scale‌ Failure-Prone Platforms.International Journal of High Performance‌ Computing Applications401August 2025, 96-114‌HAL DOI back to text
9 articleO.‌Olivier Beaumont, R.Rémi Bouzel, L.‌Lionel Eyraud-Dubois, E.Esragul Korkmaz, L.‌ L.Laércio Lima Pilla and A.Alexandre van‌ Kempen. Approximation Algorithms for Scheduling with/without Deadline‌ Constraints where Rejection Costs are Proportional to Processing‌ Times.IEEE Transactions on Parallel and Distributed‌ Systems3612December 2025, 2596-2608HAL‌DOI back to text
10 articleA.Aurelien‌ Bouteiller, T.Thomas Herault, Q.Qinglei‌ Cao, J.Joseph Schuchart and G.George‌ Bosilca. PaRSEC: Scalability, flexibility, and hybrid architecture‌ support for task-based applications in ECP.International‌ Journal of High Performance Computing Applications391‌2025, 147-166HALDOI back to text‌
11 articleR.Robert Falgout, M.Matthieu‌ Lecouvez, P.Pierre Ramet and C.Clément‌ Richefort. Toward an Algebraic Multigrid Method for‌ the Indefinite Helmholtz Equation.SIAM Journal on‌ Scientific ComputingAugust 2025, S285-S310HAL DOI‌back to text
12 articleN.Nathalie Furmento‌, A.Abdou Guermouche, G.Gwenolé Lucas‌, T.Thomas Morin, S.Samuel Thibault‌ and P.-A.Pierre-André Wacrenier. Optimizing Parallel Heterogeneous‌ System Efficiency: Dynamic Task Graph Adaptation with Recursive‌ Tasks.Journal of Parallel and Distributed Computing‌205June 2025, 105157HAL DOI back‌ to text
13 articleD.Diane Orhan,‌ L.Laércio Lima Pilla, D.Denis Barthou‌, A.Adrien Cassagne, O.Olivier Aumage‌, R.Romain Tajan, C.Christophe Jégo‌ and C.Camille Leroux. Optimal Scheduling Algorithms‌ for Software-Defined Radio Pipelined and Replicated Task Chains‌ on Multicore Architectures.Journal of Parallel and‌ Distributed Computing2025, 105106In press. HAL‌DOI back to text
14 articleA.Alix‌ Tremodeux, A.Anne Benoit, E.Emmanuel‌ Agullo, T.Thomas Herault, L.Luc‌ Giraud and Y.Yves Robert. Fault-tolerant numerical‌ iterative algorithms at scale.International Journal of‌ High Performance Computing Applications2025HAL back to‌ text

International peer-reviewed conferences

15 inproceedingsO.Olivier‌ Beaumont, R.Raphaël Bourgouin, M.Maxime‌ Darrin, L.Loris Marchal and P.Pablo‌ Piantanida. Leveraging Expert Usage to Speed up‌ LLM Inference with Expert Parallelism.Lecture Notes‌ in Computer ScienceEuro-Par 2025: Parallel ProcessingDresden, GermanyAugust 2025HAL‌back to text
16‌ inproceedingsA.Anne Benoit‌‌, T.Thomas Herault, Y.Yves Robert‌ and A.Alix Tremodeux‌. Partial Detectors Versus‌‌ Replication To Cope With Silent Errors.Euro-Par‌ 2025 - 31 st‌ International European Conference on‌‌ Parallel and Distributed ComputingDresden, GermanyAugust 2025‌HAL back to text‌
17 inproceedingsA.Aurelien‌‌ Bouteiller, Q.Qinglei Cao, J.Joseph‌ Schuchart and T.Thomas‌ Herault. Comparing and‌‌ Contrasting User and Runtime Directed Data Placement Strategies‌ for Owner-Compute, Multi-accelerator Distributed‌ Task Based Scheduling.‌‌Asynchronous Many-Task Systems and Applications Third International Workshop,‌ WAMTA 2025WAMTA 2025‌ - Workshop on Asynchronous‌‌ Many-Task Systems and Applications.Lecture Notes in Computer‌ ScienceLecture Notes in‌ Computer ScienceLNCS-15690Saint‌‌ Louis, Missouri, United StatesSpringer Nature SwitzerlandOctober‌ 2026, 140-153HAL‌DOI back to text‌‌
18 inproceedingsC.Catherine Guelque, V.Valentin‌ Honoré, P.Philippe‌ Swartvagher, G.Gaël‌‌ Thomas and F.François Trahay. PALLAS: a‌ generic trace format for‌ large HPC trace analysis‌‌.IPDPS 2025: 39th IEEE International Parallel &‌ Distributed Processing Symposium39th‌ IEEE International Parallel &‌‌ Distributed Processing Symposium(IPDPS)Milan, Italy2025HAL back‌ to text
19 inproceedings‌J.Julia Gusak,‌‌ X.Xunyi Zhao, T.Théotime Le Hellard‌, Z.Zhe Li‌, L.Lionel Eyraud-Dubois‌‌ and O.Olivier Beaumont. HiRemate: Hierarchical Approach‌ for Efficient Re-materialization of‌ Large Neural Networks.‌‌Proceedings of the 42nd International Conference on Machine‌ LearningForty-Second International Conference‌ on Machine Learning (ICML‌‌ 2025)267Vancouver, Canada2025HAL back to‌ text
20 inproceedingsA.‌Alycia Lisito, M.‌‌Mathieu Faverge, M.Matthieu Kuhn, F.‌Florent Pruvost and P.‌Pierre Ramet. Scalable‌‌ and portable LU factorization with partial pivoting on‌ top of runtime systems‌.IPDPS25 - 39th‌‌ IEEE International Parallel and Distributed Processing SymposiumMilan,‌ ItalyJune 2025HAL‌back to text
21‌‌ inproceedingsA.-K. M.Aboul-Karim Mohamed El Maarouf,‌ L.Luc Giraud,‌ A.Abdou Guermouche and‌‌ T.Thomas Guignon. Sparse Matrix Ordering for‌ Fine Grain Parallel Triangular‌ Solve Using SIMD.‌‌Parallel Processing and Applied Mathematics (PPAM 2024)PPAM‌ 2024 - 15th International‌ Conference on Parallel Processing‌‌ & Applied MathematicsLNCSLecture Notes in Computer‌ Science15579Ostrava, Czech‌ RepublicSpringer Nature Switzerland‌‌April 2025, 51-64HAL DOI back to‌ text
22 inproceedingsD.‌Diane Orhan, Y.‌‌Yacine Idouar, L.Laércio Lima Pilla,‌ A.Adrien Cassagne,‌ D.Denis Barthou and‌‌ C.Christophe Jégo. Scheduling Strategies for Partially-Replicable‌ Task Chains on Two‌ Types of Resources.‌‌2025 IEEE International Parallel and Distributed Processing Symposium‌ Workshops (IPDPSW)Milano, Italy‌IEEEJune 2025,‌‌ 896-905HAL DOI back to text back to‌ text
23 inproceedingsJ.‌Joseph Schuchart, A.‌‌Aurelien Bouteiller, T.Thomas Herault, E.‌Edvard Valeev, G.‌George Bosilca and R.‌‌ J.Robert J Harrison‌. Scalable Block-Sparse Matrix Multiplication Using Template Task‌ Graphs.WAMTA 2025 - Workshop on Asynchronous‌ Many-Task Systems and Applications15690Lecture Notes in‌ Computer ScienceSaint-Louis, Missouri, United StatesSpringer Nature‌ SwitzerlandOctober 2025, 120-132HAL DOI back‌ to text
24 inproceedingsA.Atte Torri,‌ P.Przemysław Dominikowski, B.Brice Pointal,‌ O.Oguz Kaya, L.Laércio Lima Pilla‌ and O.Olivier Coulaud. Near-Optimal Contraction Strategies‌ for the Scalar Product in the Tensor-Train Format‌.Euro-Par 2025: Parallel ProcessingEuro-Par 2025 -‌ 31 International European Conference on Parallel and Distributed‌ Computing15902Lecture Notes in Computer ScienceDresden,‌ GermanySpringer Nature SwitzerlandAugust 2025, 63-77‌HAL DOI back to text
25 inproceedingsN.‌Nicolas Vanz, V.Vanderlei Munhoz, M.‌Márcio Castro, L.Laércio Lima Pilla and‌ O.Olivier Aumage. Task-Based HPC in the‌ Cloud: Price-Performance Analysis of N-Body Simulations with StarPU‌.IC2E 2025 - 13th IEEE International Conference‌ on Cloud EngineeringRennes, FranceSeptember 2025HAL‌back to text
26 inproceedingsA.Albert d'Aviau‌ de Piolant, H.Hayfa Tayeb, B.‌Bérenger Bramas, M.Mathieu Faverge, A.‌Abdou Guermouche and A.Amina Guermouche. Improving‌ energy efficiency of HPC applications using unbalanced GPU‌ power capping.HCW (Ipdps workshop)Milan (Italie),‌ ItalyJune 2025HALback to text

Conferences‌ without proceedings

27 inproceedingsA.Alycia Lisito,‌ M.Mathieu Faverge, M.Matthieu Kuhn,‌ F.Florent Pruvost and P.Pierre Ramet.‌ Batching the tasks of the LU factorization with‌ partial pivoting on top of runtime systems.‌COMPAS 2025 - Conférence francophone d'informatique en Parallélisme,‌ Architecture et SystèmeBordeaux, FranceJune 2025HAL‌back to text
28 inproceedingsB.Brieuc Nicolas‌, M.Mathieu Faverge and T.Thomas Herault‌. Contraction de tenseurs au-dessus de supports d'exécutions‌ Application à la méthode Coupled-Cluster.COMPAS2025COMPAS‌ 2025 - Conférence francophone d'informatique en Parallélisme, Architecture‌ et SystèmeCOMPAS2025Bordeaux, FranceJune 2025HAL‌back to text
29 inproceedingsD.Dimitri Walther‌, M.Mathieu Faverge, M.Matthieu Lecouvez‌ and P.Pierre Ramet. Algebraic hierarchical partitioning‌ to improve H-matrix compression.PP 2026 -‌ SIAM Conference on Parallel Processing for Scientific Computing‌Berlin, GermanyMarch 2026HAL

Edition (books, proceedings,‌ special issue of a journal)

30 proceedingsAsynchronous‌ Many-Task Systems and Applications: Third International Workshop, WAMTA‌ 2025.Asynchronous Many-Task Systems and Applications: Third‌ International Workshop, WAMTA 202515690Lecture Notes in‌ Computer ScienceSaint Louis, Missouri, United StatesSpringer‌ Nature SwitzerlandOctober 2025HAL DOI back to‌ text
31 periodicalServerless Computing.IEEE Internet‌ Computing286January 2025, 5-7HAL‌DOI

Doctoral dissertations and habilitation theses

32 thesis‌J.-F.Jean-François David. Dynamic scheduling for deep‌ neural network inference.Université de BordeauxMarch‌ 2025HAL
33 thesisH.Hayfa Tayeb.‌ Optimizing HPC applications with vectorization and multi-criteria task scheduling on heterogeneous systems‌.Université de Bordeaux‌March 2025HAL

Reports‌‌ & preprints

34 reportC.Céline Acary-Robert,‌ E.Emmanuel Agullo,‌ B.Benjamin Arrondeau,‌‌ L.Lars Bilke, D.Dylan Bissuel,‌ L.Ludovic Courtès,‌ C. J.Collin J.‌‌ Doering, R.Romain Garbage, K.Konrad‌ Hinsen, A.Arun‌ Isaac, E.Emmanuel‌‌ Medernach, S.Sorina Camarasu-Pop, P.Pjotr‌ Prins, C.Cayetano‌ Santos, P.Philippe‌‌ Swartvagher, S.Simon Tournier and R.Ricardo‌ Wurmus. Guix-HPC Activity‌ Report 2023-2024.Inria‌‌ Bordeaux - Sud OuestFebruary 2025, 1-36‌HAL
35 miscA.‌Adrien Aguila--Multner, O.‌‌Olivier Beaumont, L.Lionel Eyraud-Dubois and J.‌Julia Gusak. Mind‌ Bubbles and Memory: Bounds‌‌ on Scheduling Pipeline Parallelism with Rematerialization.September‌ 2025HAL back to‌ text back to text‌‌
36 miscA.Adrien Aguila--Multner, O.Olivier‌ Beaumont, L.Lionel‌ Eyraud-Dubois and J.Julia‌‌ Gusak. Optimized Forward-Backward Rematerialization for Memory-Efficient Pipeline‌ Parallel Training.May‌ 2025HAL back to‌‌ text back to textback to text back‌ to text
37 misc‌E.Emmanuel Agullo,‌‌ A.Alfredo Buttari, A.Abdou Guermouche and‌ A.Antoine Jego.‌ Redundant computations in task-based‌‌ parallelism with applications to communication-reducing algorithms.July‌ 2025HAL
38 report‌A.Anne Benoit,‌‌ T.Thomas Herault, Y.Yves Robert and‌ A.Alix Tremodeux.‌ Partial Detectors Versus Replication‌‌ To Cope With Silent Errors.RR-9581Inria‌March 2025HAL back‌ to text
39 misc‌‌Y.Yacine Idouar, A.Adrien Cassagne,‌ L.Laércio Lima Pilla‌, J.Julien Sopena‌‌, M.Manuel Bouyer, D.Diane Orhan‌, L.Lionel Lacassagne‌, D.Dimitri Galayko‌‌, D.Denis Barthou and C.Christophe Jego‌. Energy-Aware Scheduling Strategies‌ for Partially-Replicable Task Chains‌‌ on Heterogeneous Processors.September 2025HAL back‌ to text
40 misc‌A.Alan Lira Nunes‌‌, C.Cristina Boeres, L.Laércio Lima‌ Pilla and L. M.‌Lúcia Maria de A.‌‌ Drummond. MetaCS-FL: A Metaheuristic-Based Framework for Client‌ Selection in Federated Learning‌ Systems.July 2025‌‌HAL back to text
41 reportP.Philippe‌ Swartvagher. Reproducibility Report‌ for SC25 Paper HP-MDR:‌‌ High-performance and Portable Data Refactoring and Progressive Retrieval‌ with Advanced GPUs.‌Inria, LaBRI, Bordeaux INP‌‌November 2025, 2314-2315HAL DOI back to‌ text
42 reportP.‌Philippe Swartvagher. Reproducibility‌‌ Report for SC25 Paper Story of Two GPUs:‌ Characterizing the Resilience of‌ Hopper H100 and Ampere‌‌ A100 GPUs.Inria, LaBRI, Bordeaux INPNovember‌ 2025, 2278-2279HAL‌DOI back to text‌‌

Other scientific publications

43 inproceedingsP.Przemysław Dominikowski‌, A.Atte Torri‌, B.Brice Pointal‌‌, O.Oguz Kaya, L.Laercio Lima‌ Pilla and O.Olivier‌ Coulaud. Exploring Near-Optimal‌‌ Contraction Strategies for the Scalar Product in the‌ Tensor-Train Format.IPDPS‌ 2025 - 39th IEEE‌‌ International Parallel & Distributed‌ Processing SymposiumMilan, ItalyJune 2025, 1274-1276‌HAL DOI
44 miscN.-A.Nasr-Allah Hitar and‌ R.Raphaël Bourgouin. Empirical Law Synthesis via‌ Galois LFSRs : Uniformity Statistical Assurance and High-Performance‌ C++ Implementation for Non-Uniform Distributions: Galois LFSR analyses‌ via extension galois fields.2025HAL
45‌ thesisJ.Joachim Robert. Communications réseau pour‌ apprentissage automatique dans un environnement distribué, hétérogène et‌ volatile.ENSEIRB-MATMECATalenceSeptember 2025, 29‌HAL back to text
46 inproceedingsD.Dimitri‌ Walther, M.Mathieu Faverge, M.Matthieu‌ Lecouvez and P.Pierre Ramet. Hierarchical partitioning‌ for the electromagnetic simulation of complex 3D objects‌.COMPAS 2025 - Conférence francophone d'informatique en‌ Parallélisme, Architecture et SystèmeBordeaux, FranceJune 2025‌HAL back to text

12.3 Cited publications

47‌ articleE.Emmanuel Agullo, O.Olivier Aumage‌, M.Mathieu Faverge, N.Nathalie Furmento‌, F.Florent Pruvost, M.Marc Sergent‌ and S. P.Samuel Paul Thibault. Achieving‌ High Performance on Supercomputers with a Sequential Task-based‌ Programming Model.IEEE Transactions on Parallel and‌ Distributed Systems2017, 1-1DOI back to‌ text
48 articleE.Emmanuel Agullo, A.‌Alfredo Buttari, A.Abdou Guermouche and F.‌Florent Lopez. Implementing Multifrontal Sparse Solvers for‌ Multicore Architectures with Sequential Task Flow Runtime Systems‌.ACM Trans. Math. Softw.432August‌ 2016, 13:1--13:22HALDOI back to text‌
49 inproceedingsE.Emmanuel Agullo, A.Alfredo‌ Buttari, A.Abdou Guermouche and F.Florent‌ Lopez. Task-Based Multifrontal QR Solver for GPU-Accelerated‌ Multicore Architectures..HiPCBest paper awardIEEE‌ Computer Society2015, 54-63HAL DOI back‌ to text
50 inproceedingsP.Pedro Alonso,‌ M. F.Manuel F. Dolz, F. D.‌Francisco D. Igual, R.Rafael Mayo and‌ E. S.Enrique S. Quintana-Ortí. Reducing Energy‌ Consumption of Dense Linear Algebra Operations on Hybrid‌ CPU-GPU Platforms.2012 IEEE 10th International Symposium‌ on Parallel and Distributed Processing with Applications2012‌, 56-62DOI back to text
51 article‌P.Pedro Alonso, M. F.Manuel F.‌ Dolz, R.Rafael Mayo and E. S.‌Enrique S. Quintana-Ortí. Modeling power and energy‌ consumption of dense matrix factorizations on multicore processors‌.Concurrency and Computation: Practice and Experience26‌172014, 2743-2757URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3162DOI back‌ to text
52 inproceedingsH.Hartwig Anzt,‌ J.Jack Dongarra and E. S.Enrique S.‌ Quintana-Ortí. Adaptive Precision Solvers for Sparse Linear‌ Systems.Proceedings of the 3rd International Workshop‌ on Energy Efficient SupercomputingE2SC '15New York,‌ NY, USAAustin, TexasAssociation for Computing Machinery‌2015, URL: https://doi.org/10.1145/2834800.2834802DOI back to text‌
53 inproceedingsO.Olivier Beaumont, P.Philippe‌ Duchon, L.Lionel Eyraud-Dubois, J.Julien‌ Langou and M.Mathieu Verite. Symmetric Block-Cyclic‌ Distribution: Fewer Communications leads to Faster Dense Cholesky‌ Factorization.SC'22: Proceedings of the International Conference for High Performance Computing,‌ Networking, Storage and Analysis‌(best paper, Algorithm Track)‌‌IEEE and ACM2022back to text back‌ to text
54 techreport‌O.Olivier Beaumont,‌‌ L.Lionel Eyraud-Dubois, J.Julien Herrmann,‌ A.Alexis Joly and‌ A.Alena Shilova.‌‌ Optimal checkpointing for heterogeneous chains: how to train‌ deep neural networks with‌ limited memory.RR-9302‌‌Inria Bordeaux Sud-OuestNovember 2019HAL back to‌ text back to text‌
55 inproceedingsO.Olivier‌‌ Beaumont, L.Lionel Eyraud-Dubois and A.Alena‌ Shilova. Efficient Combination‌ of Rematerialization and Offloading‌‌ for Training DNNs.NeurIPS 2021 - Thirty-fifth‌ Conference on Neural Information‌ Processing SystemsVirtual-only Conference‌‌December 2021HAL back to text back to‌ text
56 inproceedingsO.‌Olivier Beaumont, L.‌‌Lionel Eyraud-Dubois and A.Alena Shilova. MadPipe:‌ Memory Aware Dynamic Programming‌ Algorithm for Pipelined Model‌‌ Parallelism.2022 IEEE International Parallel and Distributed‌ Processing Symposium Workshops (IPDPSW)‌IEEE2022back to‌‌ text
57 inproceedingsO.Olivier Beaumont, L.‌Lionel Eyraud-Dubois and A.‌Alena Shilova. Optimal‌‌ GPU-CPU Offloading Strategies for Deep Neural Network Training‌.Euro-Par 2020: Parallel‌ ProcessingChamSpringer International‌‌ Publishing2020, 151--166back to text back‌ to text
58 inproceedings‌O.Olivier Beaumont,‌‌ L.Lionel Eyraud-Dubois and A.Alena Shilova.‌ Pipelined Model Parallelism: Complexity‌ Results and Memory Considerations‌‌.Proceedings of Europar 2021Lisbon, PortugalSpringer‌August 2021HAL back‌ to text back to‌‌ text
59 inproceedingsO.Olivier Beaumont, L.‌Lionel Eyraud-Dubois and M.‌Mathieu Verite. 2D‌‌ Static Resource Allocation for Compressed Linear Algebra and‌ Communication Constraints.2020‌ IEEE 27th International Conference‌‌ on High Performance Computing, Data, and Analytics (HiPC)‌IEEE2020, 181--191‌back to text
60‌‌ inproceedingsO.Olivier Beaumont, L.Lionel Eyraud-Dubois‌, M.Mathieu Vérité‌ and J.Julien Langou‌‌. I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels‌.ACM Symposium on‌ Parallelism in Algorithms and‌‌ ArchitecturesAssociation for Computing Machinery : SIGACT, SIGARCH‌Philadelphie, United StatesJuly‌ 2022HAL back to‌‌ text back to text
61 articleO.Olivier‌ Beaumont, J.Julien‌ Herrmann, G.Guillaume‌‌ Pallez and A.Alena Shilova. Optimal memory-aware‌ backpropagation of deep join‌ networks.Philosophical Transactions‌‌ of the Royal Society A37821662020‌, 20190049back to‌ text
62 inproceedingsR.‌‌Rocío Carratalá-Sáez, M.Mathieu Faverge, G.‌Grégoire Pichon, E.‌ S.Enrique Salvador Quintana-Ortí‌‌ and G.Guillaume Sylvand. Exploiting Generic Tiled‌ Algorithms Toward Scalable H-Matrices‌ Factorizations on Top of‌‌ Runtime Systems.SIAM PP20-SIAM Conference on Parallel‌ Processing for Scientific Computing‌2020back to text‌‌
63 inproceedingsR.Rocío Carratalá-Sáez, M.Mathieu‌ Faverge, G.Grégoire‌ Pichon, G.Guillaume‌‌ Sylvand and E. S.Enrique S Quintana-Ortí.‌ Tiled Algorithms for Efficient‌ Task-Parallel ?-Matrix Solvers.‌‌2020 IEEE International Parallel and Distributed Processing Symposium‌ Workshops (IPDPSW)IEEE2020‌, 757--766back to‌‌ text
64 mastersthesisT.‌Tanguy Chatelain. Anticipation des communications réseau grâce‌ à la connaissance du futur dans le parallélisme‌ à tâche.MA ThesisEnseirb-MatmecaSeptember 2025‌HAL back to text
65 inproceedingsV.Viktoriia‌ Chekalina, G.Georgiy Novikov, J.Julia‌ Gusak, A.Alexander Panchenko and I.Ivan‌ Oseledets. Efficient gpt model pre-training using tensor‌ train matrix representation.Proceedings of the 37th‌ Pacific Asia Conference on Language, Information and Computation‌2023, 600--608back to text
66 article‌T.Tianqi Chen, B.Bing Xu,‌ C.Chiyuan Zhang and C.Carlos Guestrin.‌ Training deep nets with sublinear memory cost.‌arXiv preprint arXiv:1604.061742016back to text
67‌ articleD.Daria Cherniuk, S.Stanislav Abukhovich‌, A.-H.Anh-Huy Phan, I.Ivan Oseledets‌, A.Andrzej Cichocki and J.Julia Gusak‌. Quantization aware factorization for deep neural network‌ compression.Journal of Artificial Intelligence Research81‌2024, 973--988back to text
68 article‌R. D.Robert D Falgout, S.Stephanie‌ Friedhoff, T. V.Tz V Kolev,‌ S. P.Scott P MacLachlan and J. B.‌Jacob B Schroder. Parallel time integration with‌ multigrid.SIAM Journal on Scientific Computing36‌62014, C635--C661back to text
69‌ articleM. J.Martin J Gander and S.‌Stefan Vandewalle. Analysis of the parareal time-parallel‌ time-integration method.SIAM Journal on Scientific Computing‌2922007, 556--578back to text‌
70 articleP.P. Ghysels, X. S.‌X. S. Li, F.-H.F.-H. Rouet,‌ S.S. Williams and A.A. Napov.‌ An Efficient Multicore Implementation of a Novel HSS-Structured‌ Multifrontal Solver Using Randomized Sampling.SIAM Journal‌ on Scientific Computing3852016, S358-S384‌back to text
71 inproceedingsA. N.Aidan‌ N Gomez, M.Mengye Ren, R.‌Raquel Urtasun and R. B.Roger B Grosse‌. The reversible residual network: Backpropagation without storing‌ activations.Proceedings of the 31st International Conference‌ on Neural Information Processing Systems2017, 2211--2221‌back to text
72 inproceedingsA.Audrunas Gruslys‌, R.Rémi Munos, I.Ivo Danihelka‌, M.Marc Lanctot and A.Alex Graves‌. Memory-efficient backpropagation through time.Advances in‌ Neural Information Processing Systems2016, 4125--4133back‌ to text
73 inproceedingsU.Udit Gupta,‌ Y. G.Young Geun Kim, S.Sylvia‌ Lee, J.Jordan Tse, H.-H. S.‌Hsien-Hsin S Lee, G.-Y.Gu-Yeon Wei,‌ D.David Brooks and C.-J.Carole-Jean Wu.‌ Chasing Carbon: The Elusive Environmental Footprint of Computing‌.2021 IEEE International Symposium on High-Performance Computer‌ Architecture (HPCA)IEEE2021, 854--867back to‌ text
74 inproceedingsJ.Julia Gusak, D.‌Daria Cherniuk, A.Alena Shilova, A.‌Alexander Katrutsa, D.Daniel Bershatsky, X.‌Xunyi Zhao, L.Lionel Eyraud-Dubois, O.‌Oleg Shlyazhko, D.Denis Dimitrov, I.Ivan Oseledets and O.‌Olivier Beaumont. Survey‌ on Large Scale Neural‌‌ Network Training.The 31st International Joint Conference‌ on Artificial Intelligence (IJCAI)‌2022back to text‌‌back to text
75 articleA.A. Ida‌, T.T. Iwashita‌, T.T. Mifune‌‌ and Y.Y. Takahashi. Parallel Hierarchical Matrices‌ with Adaptive Cross Approximation‌ on Symmetric Multiprocessing Clusters‌‌.Journal of Information Processing2242014‌, 642--650back to‌ text
76 techreportE.‌‌Esragul Korkmaz, M.Mathieu Faverge, G.‌Grégoire Pichon and P.‌Pierre Ramet. Deciding‌‌ Non-Compressible Blocks in Sparse Direct Solvers using Incomplete‌ Factorization.RR-9396Inria‌ Bordeaux - Sud Ouest‌‌2021, 16HALback to text
77‌ inproceedingsN.Navjot Kukreja‌, A.Alena Shilova‌‌, O.Olivier Beaumont, J.Jan Huckelheim‌, N.Nicola Ferrier‌, P.Paul Hovland‌‌ and G.Gerard Gorman. Training on the‌ Edge: The why and‌ the how.2019‌‌ IEEE International Parallel and Distributed Processing Symposium Workshops‌ (IPDPSW)IEEE2019,‌ 899--903back to text‌‌
78 inproceedingsX.Xavier Lacoste, M.Mathieu‌ Faverge, G.George‌ Bosilca, P.Pierre‌‌ Ramet and S.Samuel Thibault. Taking Advantage‌ of Hybrid Systems for‌ Sparse Direct Solvers via‌‌ Task-Based Runtimes.2014 IEEE International Parallel &‌ Distributed Processing Symposium Workshops,‌ Phoenix, AZ, USA, May‌‌ 19-23, 2014IEEE Computer Society2014, 29--38‌URL: https://doi.org/10.1109/IPDPSW.2014.9DOI back‌ to text
79 book‌‌T.T. Mary. Block Low-Rank multifrontal solvers:‌ complexity, performance and scalability‌.Université Toulouse 3‌‌ Paul SabatierPh.D. Dissertation2017back to text‌
80 articleS.Salli‌ Moustafa, F.François‌‌ Févotte, M.Mathieu Faverge, L.Laurent‌ Plagne and P.Pierre‌ Ramet. Efficient Parallel‌‌ Solution of the 3D Stationary Boltzmann Transport Equation‌ for Diffusive Problems.‌Journal of Computational Physics‌‌March 2019HAL DOIback to text
81‌ inproceedingsD.Deepak Narayanan‌, A.Aaron Harlap‌‌, A.Amar Phanishayee, V.Vivek Seshadri‌, N. R.Nikhil‌ R Devanur, G.‌‌ R.Gregory R Ganger, P. B.Phillip‌ B Gibbons and M.‌Matei Zaharia. PipeDream:‌‌ generalized pipeline parallelism for DNN training.Proceedings‌ of the 27th ACM‌ Symposium on Operating Systems‌‌ Principles2019, 1--15back to text back‌ to text
82 article‌D.David Patterson,‌‌ J.Joseph Gonzalez, Q.Quoc Le,‌ C.Chen Liang,‌ L.-M.Lluis-Miquel Munguia,‌‌ D.Daniel Rothchild, D.David So,‌ M.Maud Texier and‌ J.Jeff Dean.‌‌ Carbon emissions and large neural network training.‌arXiv preprint arXiv:2104.103502021‌back to text
83‌‌ inproceedingsA.-H.Anh-Huy Phan, K.Konstantin Sobolev‌, K.Konstantin Sozykin‌, D.Dmitry Ermilov‌‌, J.Julia Gusak, P.Petr Tichavsk\`y‌, V.Valeriy Glukhov‌, I.Ivan Oseledets‌‌ and A.Andrzej Cichocki. Stable Low-rank Tensor‌ Decomposition for Compression of‌ Convolutional Neural Network.‌‌European Conference on Computer‌ Vision (ECCV)Springer2020, 522--539back to‌ text back to text
84 articleG.Grégoire‌ Pichon, E.Eric Darve, M.Mathieu‌ Faverge, P.Pierre Ramet and J.Jean‌ Roman. Sparse supernodal solver using block low-rank‌ compression: Design, performance and analysis.International Journal‌ of Computational Science and Engineering27July 2018‌, 255 - 270HAL DOI back to‌ text
85 inproceedingsG.Grégoire Pichon, M.‌Mathieu Faverge and P.Pierre Ramet. Recent‌ Developments Around the Block Low-Rank PaStiX Solver.‌SIAM Conference on Parallel Processing for Scientific Computing‌ (SIAM PP 2020)2020back to text
86‌ inproceedingsD.Dalal Sukkari, H.Hatem Ltaief‌, D.David Keyes and M.Mathieu Faverge‌. Leveraging Task-Based Polar Decomposition Using PARSEC on‌ Massively Parallel Systems.2019 IEEE International Conference‌ on Cluster Computing (CLUSTER)IEEE2019, 1--12‌back to text

TOPAL - 2025

TOPAL - 2025

2025​‌﻿﻿Activity reportProject-TeamTOPAL​​﻿﻿

Keywords

Computer Science﻿‌​‌ and Digital Science

Other Research﻿﻿﻿‌ Topics and Application Domains﻿‌​‌

1​​​‌ Team members, visitors, external﻿﻿﻿‌ collaborators

Research Scientists

Faculty Members

PhD﻿‌​‌ Students

Technical Staff

Interns and​​​‌ Apprentices

Administrative Assistants​​​‌

2​​﻿﻿ Overall objectives

3 Research program

3.1﻿‌​‌ Objectives

3.2 Overall Positionning

3.3 Research﻿﻿﻿‌ Axes

3.3.1 Use of﻿‌​‌ Runtime systems

3.3.2 Design of compression﻿‌​‌ techniques

3.3.3 Energy minimization

3.3.4 Communication and﻿‌​‌ Fault Tolerance

3.4 Main﻿﻿﻿‌ Research Topics

3.4.1﻿﻿﻿‌ Task-based Linear Algebra and﻿‌​‌ Tensor Computations

3.4.2 Multi-Linear​​​‌ Algebra and Tensor Decompositions﻿​﻿﻿

3.4.3﻿​﻿﻿ Energy Minimization in Linear​‌﻿﻿ Solvers

3.4.4 Task-based Approaches﻿‌​‌ for Deep Learning

3.4.5​​​‌ Tensor Compression for Inference﻿﻿﻿‌

3.4.6 Carbon Saving and﻿‌​‌ Energy-Efficient Training

3.4.7 Communication-Aware Resilience Patterns﻿​﻿﻿ for Iterative Linear Algebra​‌﻿﻿

3.4.8﻿​﻿﻿ Communication-Aware Resilience Patterns for​‌﻿﻿ Training and Inference

4 Application​​​‌ domains

4.1 Multi-Linear Algebra﻿﻿﻿‌ and Solvers

4.2 Training and​​​‌ Inference for DNNs

5 Social and﻿​﻿﻿ environmental responsibility

5.1 Footprint​‌﻿﻿ of research activities

5.2 Impact of﻿﻿﻿‌ research results

5.2.1 Carbon﻿‌​‌ Impact of Cloud Platforms﻿​​﻿

5.2.2 Democratization﻿‌​‌ of Large Models Training﻿​​﻿

6 Highlights of​‌﻿﻿ the year

7 Latest​‌﻿﻿ software developments, platforms, open​​﻿﻿ data

7.1 Latest software​​​‌ developments

7.1.1 Chameleon

7.1.2 ELF﻿​﻿﻿

7.1.3 PaStiX

7.1.4﻿‌​‌ pmtool

7.1.5 StarPart

7.1.6​​​‌ StarPU

7.1.7 rockmate​​​‌

7.1.8 rotor​​​‌

7.1.9 VITE﻿​﻿﻿

8 New﻿​﻿﻿ results

8.1 Scalable and​‌﻿﻿ portable LU factorization with​​﻿﻿ partial pivoting on top​​​‌ of runtime systems (Topic﻿​﻿﻿ 3.4.1)

8.2﻿﻿﻿‌ Batching the tasks of﻿‌​‌ the LU factorization with﻿​​﻿ partial pivoting on top​​​‌ of runtime systems (Topic﻿﻿﻿‌ 3.4.1)

8.3﻿﻿﻿‌ Toward an algebraic multigrid﻿‌​‌ method for the indefinite﻿​​﻿ Helmholtz equation (Topic 3.4.2​​​‌)

8.4 Hierarchical​​﻿﻿ partitioning for the numerical​​​‌ simulation of complex 3D﻿​﻿﻿ objects (Topic 3.4.2)​‌﻿﻿

8.5​‌﻿﻿ Optimal scheduling algorithms for​​﻿﻿ software-defined radio pipelined and﻿​​﻿ replicated task chains on​​​‌ multicore architecture (Axis 3.3.1﻿﻿﻿‌)

8.6﻿﻿﻿‌ Task-Based HPC in the﻿‌​‌ Cloud: Price-Performance Analysis of﻿​​﻿ N-Body Simulations with StarPU​​​‌ (Axis 3.3.1)

8.7 Task-Based HPC​‌﻿﻿ in the Cloud: Price-Performance​​﻿﻿ Analysis of N-Body Simulations​​​‌ with StarPU (Topic 3.4.2﻿​﻿﻿)

8.8​‌﻿﻿ MetaCS-FL: A Metaheuristic-Based Framework​​﻿﻿ for Client Selection in​​​‌ Federated Learning Systems (Topic﻿​﻿﻿ 3.4.6)

8.9 Approximation Algorithms for﻿﻿﻿‌ Scheduling With/Without Deadline Constraints﻿‌​‌ Where Rejection Costs are﻿​​﻿ Proportional to Processing Times​​​‌ (Axis 3.3.3)

8.10​​​‌ Energy-Aware Scheduling Strategies for﻿﻿﻿‌ Partially-Replicable Task Chains on﻿‌​‌ Heterogeneous Processors (Axis 3.3.3﻿​​﻿)

8.11​​​‌ HiRemate: Hierarchical Approach for﻿​﻿﻿ Efficient Re-materialization of Large​‌﻿﻿ Neural Networks (Domain 4.2​​﻿﻿)

8.12 Fault-tolerant numerical iterative​​﻿﻿ algorithms at scale (Topic​​​‌ 3.4.7)

8.13 Partial Detectors​​﻿﻿ Versus Replication To Cope​​​‌ With Silent Errors (Axis﻿​﻿﻿ 3.3.4)

8.14 Fixed-Work vs. Fixed-Time​​​‌ Checkpointing on Large-Scale Failure-Prone﻿﻿﻿‌ Platforms (Axis 3.3.4)﻿‌​‌

8.15 PaRSEC:﻿﻿﻿‌ Scalability, flexibility, and hybrid﻿‌​‌ architecture support for task-based﻿​​﻿ applications in ECP (Axis​​​‌ 3.3.1)

8.16﻿﻿﻿‌ Tensor Contractions on Top﻿‌​‌ of Runtime Systems: Application﻿​​﻿ to the Coupled-Cluster Method​​​‌ (Topic 3.4.1)

8.17​​﻿﻿ Scalable Block-Sparse Matrix Multiplication​​​‌ Using Template Task Graphs﻿​﻿﻿ (Topic 3.4.1)

8.18 Comparing and﻿​﻿﻿ Contrasting User and Runtime​‌﻿﻿ Directed Data Placement Strategies​​﻿﻿ for Owner-Compute, Multi-accelerator Distributed​​​‌ Task Based Scheduling (Topic﻿​﻿﻿ 3.4.1)

8.19 Optimizing Parallel Heterogeneous​​﻿﻿ System Efficiency: Dynamic Task​​​‌ Graph Adaptation with Recursive﻿​﻿﻿ Tasks (Topic 3.4.1)​‌﻿﻿

8.20​​​‌ Improving energy efficiency of﻿﻿﻿‌ HPC applications using unbalanced﻿‌​‌ GPU power capping (Topic﻿​​﻿ 3.4.3)

8.21​​​‌ Sparse Matrix Ordering for﻿﻿﻿‌ Fine Grain Parallel Triangular﻿‌​‌ Solve Using SIMD (Topic﻿​​﻿ 3.4.1)

8.22 Mind Bubbles and​​﻿﻿ Memory: Bounds on Scheduling​​​‌ Pipeline Parallelism with Rematerialization﻿​﻿﻿ (Domain 4.2)

8.23﻿﻿﻿‌ Optimized Forward-Backward Rematerialization for﻿‌​‌ Memory-Efficient Pipeline Parallel Training﻿​​﻿ (Topic 3.4.4)

8.24﻿​​﻿ Leveraging Expert Usage to​​​‌ Speed up LLM Inference﻿﻿﻿‌ with Expert Parallelism (Topic﻿‌​‌ 3.4.4)

8.25​​​‌ Pallas: a generic﻿﻿﻿‌ trace format for large﻿‌​‌ HPC trace analysis (Axis​​​‌ 3.3.1)

9​​﻿﻿ Bilateral contracts and grants​​​‌ with industry

9.1 Bilateral﻿​﻿﻿ Grants with Industry

10﻿‌​‌ Partnerships and cooperations

2025‌Activity reportProject-TeamTOPAL

Computer Science‌‌ and Digital Science

Other Research‌ Topics and Application Domains‌‌

1‌ Team members, visitors, external‌ collaborators

PhD‌‌ Students

Interns and‌ Apprentices

Administrative Assistants‌

2 Overall objectives

3.1‌‌ Objectives

3.3 Research‌ Axes

3.3.1 Use of‌‌ Runtime systems

3.3.2 Design of compression‌‌ techniques

3.3.4 Communication and‌‌ Fault Tolerance

3.4 Main‌ Research Topics

3.4.1‌ Task-based Linear Algebra and‌‌ Tensor Computations

3.4.2 Multi-Linear‌ Algebra and Tensor Decompositions

3.4.3 Energy Minimization in Linear‌ Solvers

3.4.4 Task-based Approaches‌‌ for Deep Learning

3.4.5‌ Tensor Compression for Inference‌

3.4.6 Carbon Saving and‌‌ Energy-Efficient Training

3.4.7 Communication-Aware Resilience Patterns for Iterative Linear Algebra‌

3.4.8 Communication-Aware Resilience Patterns for‌ Training and Inference

4 Application‌ domains

4.1 Multi-Linear Algebra‌ and Solvers

4.2 Training and‌ Inference for DNNs

5 Social and environmental responsibility

5.1 Footprint‌ of research activities

5.2 Impact of‌ research results

5.2.1 Carbon‌‌ Impact of Cloud Platforms

5.2.2 Democratization‌‌ of Large Models Training

6 Highlights of‌ the year

7 Latest‌ software developments, platforms, open data

7.1 Latest software‌ developments

7.1.2 ELF

7.1.4‌‌ pmtool

7.1.6‌ StarPU

7.1.7 rockmate‌

7.1.8 rotor‌

7.1.9 VITE

8 New results

8.1 Scalable and‌ portable LU factorization with partial pivoting on top‌ of runtime systems (Topic 3.4.1)

8.2‌ Batching the tasks of‌‌ the LU factorization with partial pivoting on top‌ of runtime systems (Topic‌ 3.4.1)

8.3‌ Toward an algebraic multigrid‌‌ method for the indefinite Helmholtz equation (Topic 3.4.2‌)

8.4 Hierarchical partitioning for the numerical‌ simulation of complex 3D objects (Topic 3.4.2)‌

8.5‌ Optimal scheduling algorithms for software-defined radio pipelined and replicated task chains on‌ multicore architecture (Axis 3.3.1‌)

8.6‌ Task-Based HPC in the‌‌ Cloud: Price-Performance Analysis of N-Body Simulations with StarPU‌ (Axis 3.3.1)

8.7 Task-Based HPC‌ in the Cloud: Price-Performance Analysis of N-Body Simulations‌ with StarPU (Topic 3.4.2)

8.8‌ MetaCS-FL: A Metaheuristic-Based Framework for Client Selection in‌ Federated Learning Systems (Topic 3.4.6)

8.9 Approximation Algorithms for‌ Scheduling With/Without Deadline Constraints‌‌ Where Rejection Costs are Proportional to Processing Times‌ (Axis 3.3.3)

8.10‌ Energy-Aware Scheduling Strategies for‌ Partially-Replicable Task Chains on‌‌ Heterogeneous Processors (Axis 3.3.3)

8.11‌ HiRemate: Hierarchical Approach for Efficient Re-materialization of Large‌ Neural Networks (Domain 4.2)

8.12 Fault-tolerant numerical iterative algorithms at scale (Topic‌ 3.4.7)

8.13 Partial Detectors Versus Replication To Cope‌ With Silent Errors (Axis 3.3.4)

8.14 Fixed-Work vs. Fixed-Time‌ Checkpointing on Large-Scale Failure-Prone‌ Platforms (Axis 3.3.4)‌‌

8.15 PaRSEC:‌ Scalability, flexibility, and hybrid‌‌ architecture support for task-based applications in ECP (Axis‌ 3.3.1)

8.16‌ Tensor Contractions on Top‌‌ of Runtime Systems: Application to the Coupled-Cluster Method‌ (Topic 3.4.1)

8.17 Scalable Block-Sparse Matrix Multiplication‌ Using Template Task Graphs (Topic 3.4.1)

8.18 Comparing and Contrasting User and Runtime‌ Directed Data Placement Strategies for Owner-Compute, Multi-accelerator Distributed‌ Task Based Scheduling (Topic 3.4.1)

8.19 Optimizing Parallel Heterogeneous System Efficiency: Dynamic Task‌ Graph Adaptation with Recursive Tasks (Topic 3.4.1)‌

8.20‌ Improving energy efficiency of‌ HPC applications using unbalanced‌‌ GPU power capping (Topic 3.4.3)

8.21‌ Sparse Matrix Ordering for‌ Fine Grain Parallel Triangular‌‌ Solve Using SIMD (Topic 3.4.1)

8.22 Mind Bubbles and Memory: Bounds on Scheduling‌ Pipeline Parallelism with Rematerialization (Domain 4.2)

8.23‌ Optimized Forward-Backward Rematerialization for‌‌ Memory-Efficient Pipeline Parallel Training (Topic 3.4.4)

8.24 Leveraging Expert Usage to‌ Speed up LLM Inference‌ with Expert Parallelism (Topic‌‌ 3.4.4)

8.25‌ Pallas: a generic‌ trace format for large‌‌ HPC trace analysis (Axis‌ 3.3.1)

9 Bilateral contracts and grants‌ with industry

9.1 Bilateral Grants with Industry

10‌‌ Partnerships and cooperations

10.1‌ International initiatives

10.1.1 Associate Teams in the framework‌ of an Inria International Lab or in the‌ framework of an Inria International Program

10.2 International research visitors

10.2.1 Visits to‌ international teams

Research stays abroad

10.3 European initiatives

EUPEX‌

10.4 National initiatives‌

Challenge Cupseli: Collaborative Unified Platform‌ for a Scalable and Efficient Learning Infrastructure

Challenge PULSE: Pushing low-carbon services towards the Edge‌

10.5 Public policy support‌

11 Dissemination