EN FR
EN FR
TOPAL - 2025

2025​‌Activity reportProject-TeamTOPAL​​

RNSR: 202324391S
  • Research center​​​‌ Inria Centre at the​ University of Bordeaux
  • In​‌ partnership with:Bordeaux INP,​​ Université de Bordeaux, CNRS​​​‌
  • Team name: Tools and​ Optimization for high Performance​‌ Applications and Learning
  • In​​ collaboration with:Laboratoire Bordelais​​​‌ de Recherche en Informatique​ (LaBRI)

Creation of the​‌ Project-Team: 2023 March 01​​

Each year, Inria research​​​‌ teams publish an Activity​ Report presenting their work​‌ and results over the​​ reporting period. These reports​​​‌ follow a common structure,​ with some optional sections​‌ depending on the specific​​ team. They typically begin​​​‌ by outlining the overall​ objectives and research programme,​‌ including the main research​​ themes, goals, and methodological​​​‌ approaches. They also describe​ the application domains targeted​‌ by the team, highlighting​​ the scientific or societal​​​‌ contexts in which their​ work is situated.

The​‌ reports then present the​​ highlights of the year,​​​‌ covering major scientific achievements,​ software developments, or teaching​‌ contributions. When relevant, they​​ include sections on software,​​​‌ platforms, and open data,​ detailing the tools developed​‌ and how they are​​ shared. A substantial part​​​‌ is dedicated to new​ results, where scientific contributions​‌ are described in detail,​​ often with subsections specifying​​​‌ participants and associated keywords.​

Finally, the Activity Report​‌ addresses funding, contracts, partnerships,​​ and collaborations at various​​​‌ levels, from industrial agreements​ to international cooperations. It​‌ also covers dissemination and​​ teaching activities, such as​​ participation in scientific events,​​​‌ outreach, and supervision. The‌ document concludes with a‌​‌ presentation of scientific production,​​ including major publications and​​​‌ those produced during the‌ year.

Keywords

Computer Science‌​‌ and Digital Science

  • A1.1.4.​​ High performance computing
  • A1.1.5.​​​‌ Exascale
  • A1.1.9. Fault tolerant‌ systems
  • A1.2.10. Digital Communications‌​‌
  • A1.3. Distributed Systems
  • A1.3.4.​​ Peer to peer
  • A1.3.5.​​​‌ Cloud
  • A1.6. Green Computing‌
  • A2.6.4. Ressource management
  • A6.2.5.‌​‌ Numerical Linear Algebra
  • A6.2.7.​​ HPC for machine learning​​​‌
  • A7.1. Algorithms
  • A7.1.1. Distributed‌ algorithms
  • A7.1.2. Parallel algorithms‌​‌
  • A8.1. Discrete mathematics, combinatorics​​
  • A8.2. Optimization
  • A9.2. Machine​​​‌ learning
  • A9.2.4. Optimization and‌ learning
  • A9.2.6. Neural networks‌​‌
  • A9.2.8. Deep learning
  • A9.7.​​ AI algorithmics
  • A9.9. Distributed​​​‌ AI, Multi-agent

Other Research‌ Topics and Application Domains‌​‌

  • B4.2.2. Fusion
  • B9.5.1. Computer​​ science
  • B9.5.2. Mathematics

1​​​‌ Team members, visitors, external‌ collaborators

Research Scientists

  • Olivier‌​‌ Beaumont [Team leader​​, INRIA, Senior​​​‌ Researcher, HDR]‌
  • Lionel Eyraud Dubois [‌​‌INRIA, Researcher]​​
  • Yulia Gusak [INRIA​​​‌, Researcher]
  • Thomas‌ Herault [INRIA,‌​‌ Senior Researcher, HDR​​]
  • Laercio Lima Pilla​​​‌ [CNRS, Researcher‌]

Faculty Members

  • Aurélien‌​‌ Esnard [UNIV BORDEAUX​​, Associate Professor]​​​‌
  • Mathieu Faverge [BORDEAUX‌ INP, Associate Professor‌​‌]
  • Abdou Guermouche [​​UNIV BORDEAUX, Associate​​​‌ Professor, HDR]‌
  • Pierre Ramet [UNIV‌​‌ BORDEAUX, Professor,​​ HDR]
  • Philippe Swartvagher​​​‌ [BORDEAUX INP,‌ Associate Professor]

PhD‌​‌ Students

  • Adrien Aguila–Multner [​​INRIA]
  • Giorgio Bettonte​​​‌ [HIVE COMPUTING SERVICES‌ SAS, CIFRE,‌​‌ from Oct 2025]​​
  • Abel Anas Calluaud [​​​‌CEA, CIFRE,‌ until Oct 2025]‌​‌
  • Jean Conan [BULL​​, CIFRE]
  • Jean​​​‌ Francois David [INRIA‌, until Feb 2025‌​‌]
  • Andrei Drozdov [​​DIABOLOCOM, CIFRE,​​​‌ from Oct 2025]‌
  • Alan Lira Nunes [‌​‌INRIA and UFF,​​ Joint-doctorate (cotutelle) with UFF,​​​‌ Brazil]
  • Alycia Lisito‌ [BULL]
  • Samuel‌​‌ Mendoza [INRIA,​​ from Sep 2025]​​​‌
  • Brieuc Nicolas [INRIA‌]
  • Hayfa Tayeb [‌​‌INRIA, until Mar​​ 2025]
  • Dimitri Walther​​​‌ [CEA, CIFRE‌]

Technical Staff

  • Pierre‌​‌ Estérie [INRIA,​​ Engineer]

Interns and​​​‌ Apprentices

  • Fares Boudjaoui [‌INRIA, Apprentice,‌​‌ from Dec 2025]​​
  • Raphael Bourgouin [INRIA​​​‌, from Oct 2025‌]
  • Raphael Bourgouin [‌​‌INRIA, Intern,​​ from May 2025 until​​​‌ Aug 2025]
  • Raphael‌ Bourgouin [INRIA,‌​‌ until Apr 2025]​​
  • Killian Chateau [INRIA​​​‌, Intern, until‌ Apr 2025]
  • Enrique‌​‌ Galvez [INRIA,​​ Intern, until Jan​​​‌ 2025]
  • Theo Grandsart‌ [INRIA, from‌​‌ Nov 2025]
  • Theo​​ Grandsart [INRIA,​​​‌ Intern, from May‌ 2025 until Aug 2025‌​‌]
  • Mohamed Kherraz [​​INRIA, Intern,​​​‌ from Apr 2025 until‌ Sep 2025]
  • Matteo‌​‌ Marcos [INRIA,​​ Intern, from Mar​​​‌ 2025 until Jul 2025‌]
  • Samuel Mendoza [‌​‌INRIA, from Apr​​ 2025 until Aug 2025​​​‌]
  • Zhaniya Nurkhanova [‌INRIA, Intern,‌​‌ until Apr 2025]​​​‌
  • Joachim Robert [INRIA​, Intern, from​‌ Apr 2025 until Aug​​ 2025]
  • Victor Lucas​​​‌ Rosada Canesin [INRIA​, Intern, from​‌ Sep 2025 until Sep​​ 2025]
  • Victor Lucas​​​‌ Rosada Canesin [INRIA​, Intern, from​‌ May 2025 until Aug​​ 2025]

Administrative Assistants​​​‌

  • Catherine Cattaert Megrat [​INRIA]
  • Marie-Melissandre Roy​‌ [INRIA]

2​​ Overall objectives

The expertise​​​‌ of the team is​ at the heart of​‌ the issues between numerical​​ simulations, training and HPC.​​​‌ In this context, the​ ability to effectively use​‌ the ever-increasing power of​​ machines for numerical simulations​​​‌ (the shift to exascale​ for the next few​‌ years) is always central.​​ These new platforms are​​​‌ characterized by their huge​ size (in terms of​‌ number of cores) and​​ the heterogeneity of computing​​​‌ resources, with most of​ the computational power based​‌ on accelerators. We have​​ largely anticipated these evolutions,​​​‌ and in particular, the​ different members of the​‌ team have been making​​ efforts for several years​​​‌ to promote the use​ of dynamic runtimes such​‌ as StarPU, through a​​ long-running collaboration with Storm​​​‌ project team. Runtime systems​ allow heterogeneous resources to​‌ be used transparently and​​ allow some placement and​​​‌ scheduling decisions to be​ made dynamically, without the​‌ need to make static​​ planning in advance. Indeed,​​​‌ such a fully static​ allocation would not be​‌ able to cope with​​ the uncertainties of task​​​‌ and communication durations in​ increasingly complex environments and​‌ with increasingly shared resources.​​ The question of scaling​​​‌ up these solutions, their​ use in (Neural Network)​‌ training and the effective​​ management of large-scale distributed​​​‌ machines in particular, remains​ largely open.

As in​‌ many other fields, Machine​​ Learning is changing the​​​‌ landscape at many levels.​ Training of large networks​‌ represents a new application​​ for HPC because of​​​‌ the huge computational and​ memory needs it generates.​‌ Training has become a​​ major source of use​​​‌ for converged HPC systems​ such as the Jean​‌ Zay supercomputer at IDRIS.​​ If considered as an​​​‌ HPC workflow, it is​ an application that is​‌ quite different from traditional​​ numerical simulation applications, because​​​‌ the calculations are tensor-based​ rather than matrix-based and​‌ because the nature of​​ the dependencies makes parallelization​​​‌ more difficult and more​ intertwined with memory management​‌ issues.

On the other​​ hand, ML plays a​​​‌ central role in the​ analysis of data, particularly​‌ data produced by large​​ scientific instruments and large​​​‌ numerical simulations. In this​ context, it is important​‌ to bridge the data​​ placement, resource allocation and​​​‌ computational scheduling strategies that​ are used to perform​‌ simulations and to perform​​ data analysis. There again,​​​‌ we believe that dynamic​ runtime schedulers, coupled with​‌ static data placement strategies,​​ are a relevant and​​​‌ promising tool. Finally, training​ represents a very important​‌ market, has a strong​​ and growing influence on​​​‌ processor architectures, their accuracy​ and their arithmetics. This​‌ requires to further adapt​​ the algorithms, the management​​​‌ of ever-increasing heterogeneity and​ the control of computational​‌ accuracy, both for classical​​ numerical kernels and training​​ deep neural networks.

Another​​​‌ major concern is the‌ control of energy and‌​‌ carbon footprint minimizations. HPC​​ is not naturally and​​​‌ historically an area of‌ energy sobriety, but energy‌​‌ is a critical issue.​​ Firstly, energy is a​​​‌ major subject because the‌ race towards exascale has‌​‌ highlighted the difficulty of​​ electrically powering all these​​​‌ resources, and the increasing‌ presence of dark silicon‌​‌ in computing resources makes​​ resource allocation and power​​​‌ management problems extremely difficult.‌ Furthermore, the minimization of‌​‌ our carbon footprint is​​ a major societal issue​​​‌ and must be an‌ axis of evaluation for‌​‌ our research. In this​​ context, we believe that​​​‌ the solution cannot only‌ be at the architecture‌​‌ and system levels, but​​ that it is necessary​​​‌ to rethink parallel numerical‌ kernels and algorithms in‌​‌ such a way as​​ to allow prolonged use​​​‌ of the computing resources.‌

A new development in‌​‌ the team’s research is​​ the explicit focus on​​​‌ communication efficiency and fault‌ tolerance as central challenges‌​‌ of modern high-performance computing.​​ As platforms continue to​​​‌ scale in size and‌ heterogeneity, the cost of‌​‌ data movement increasingly dominates​​ execution time, while hardware​​​‌ and software failures can‌ no longer be considered‌​‌ exceptional. Addressing these issues​​ requires approaches that integrate​​​‌ communication management and resilience‌ directly into algorithms and‌​‌ runtime systems, rather than​​ treating them as external​​​‌ concerns. This problem statement‌ is particularly relevant to‌​‌ the team’s main application​​ domains—linear algebra and machine​​​‌ learning—where large-scale data exchanges,‌ iterative methods, and long-running‌​‌ computations make performance and​​ robustness tightly coupled.

Overall,​​​‌ the objective of the‌ project is to transfer‌​‌ our historical expertise in​​ linear algebra, runtime systems​​​‌ and combinatorial optimization (resource‌ allocation, scheduling) to new‌​‌ problems (decompositions and tensor​​ algebra, training in DNNs)​​​‌ which require a change‌ of scale and new‌​‌ algorithms for new computing​​ platforms (with different number​​​‌ representations and an ever‌ increasing heterogeneity of computing‌​‌ resources). In addition, these​​ new applications and new​​​‌ platforms require a central‌ focus on data, since‌​‌ the gap between the​​ costs (in energy and​​​‌ time) of storing and‌ moving data compared to‌​‌ the costs of computation​​ is always growing, which​​​‌ encourages innovative solutions (compression,‌ redundant computation) that can‌​‌ in turn contribute to​​ increasing the duration of​​​‌ use of computing resources.‌

3 Research program

3.1‌​‌ Objectives

We propose to​​ structure our research around​​​‌ two main application fields‌ (see Section 4):‌​‌ linear multi-dimensional algebra and​​ solvers on the one​​​‌ hand, and training in‌ particular of deep learning‌​‌ networks on the other​​ hand. In these two​​​‌ domains, our contributions will‌ be organized around three‌​‌ main research axes (see​​ Section 3.3): the​​​‌ use of task based‌ runtime systems (to provide‌​‌ robust solutions and to​​ increase the portability in​​​‌ the context of heterogeneous‌ large scale platforms), the‌​‌ use of compression (to​​ limit memory footprint and​​​‌ data transfers) and the‌ minimization of energy consumption‌​‌ and carbon impact (using​​ an approach of rewriting​​​‌ algorithms and placement strategies‌ to limit data movements).‌​‌ This matrix organization of​​​‌ our activities (see Section​ 3.4) is intended​‌ to maximize the interactions​​ between the different researchers​​​‌ of the team and​ facilitate knowledge sharing and​‌ joint participation in projects.​​

In these topics, the​​​‌ use of task based​ runtime systems and the​‌ design of efficient linear​​ algebra kernels and solvers​​​‌ belong to the historical​ expertise of the team​‌ and is shared by​​ all team members, especially​​​‌ in the context of​ linear algebra kernels. Our​‌ goal is to build​​ on this expertise to​​​‌ extend the use of​ task based runtime systems​‌ to other types of​​ applications such as training​​​‌ and to use the​ precise knowledge of these​‌ linear algebra kernels to​​ incorporate new criteria such​​​‌ as energy minimization. The​ application to training (and​‌ interference) in deep neural​​ networks and data compression​​​‌ are subjects we have​ been interested in for​‌ a few years, typically​​ during the last HiePACS​​​‌ evaluation period and within​ the Inria Challenge of​‌ AI, HPC and Big​​ Data led by Bruno​​​‌ Raffin. The extension of​ the techniques developed in​‌ linear algebra to tensor​​ algebra and tensor decompositions​​​‌ is natural, given the​ proximity of the fields​‌ and the practical importance​​ of the subject, but​​​‌ it is more recent​ and reinforced by the​‌ arrival of Julia Gusak,​​ who is an expert​​​‌ in the field. Finally,​ the objective of energy​‌ and carbon footprint minimization,​​ at the algorithmic and​​​‌ software levels rather than​ at the architecture level,​‌ is a field that​​ we wish to emphasize​​​‌ in our research, both​ because of its own​‌ fundamental importance and because​​ we believe that our​​​‌ expertise and the techniques​ that we have developed​‌ in recent years are​​ well adapted to it​​​‌ and that the approach​ we propose is original.​‌

3.2 Overall Positionning

The​​ general positioning of the​​​‌ team is to produce​ tools for users, academic​‌ or industrial, in the​​ form of algorithms and​​​‌ software libraries. These users​ can work either in​‌ numerical simulation or in​​ training. Nevertheless, as our​​​‌ experiences in simulation and​ training have already demonstrated,​‌ this interaction cannot be​​ carried out in the​​​‌ form of providing black​ boxes and it is​‌ crucial for us to​​ work directly with the​​​‌ users of our software​ to understand their needs​‌ and adapt our algorithms​​ and codes to the​​​‌ characteristics of their data.​ This interaction will be​‌ particularly critical to work​​ on data representation and​​​‌ compression, which requires a​ strong interaction with numerical​‌ methods and machine learning​​ in order to understand​​​‌ the application requirements and​ the characteristics of data,​‌ based on their significance.​​

At the other end​​​‌ of the spectrum, it​ is also essential for​‌ us to maintain close​​ relationships with both the​​​‌ architecture and system communities.​ Indeed, the very rapid​‌ growth of machine learning​​ applications has also renewed​​​‌ the landscape of computing​ resources with the emergence​‌ of very original solutions,​​ at the architectural and​​​‌ arithmetic level. Even if​ we cannot influence on​‌ these evolutions, it is​​ very important to propose​​ solutions that make the​​​‌ best use of them.‌ We also decided several‌​‌ years ago to rely​​ on task based runtime​​​‌ systems to implement our‌ software developments. This decision‌​‌ has many implications on​​ our developments and requires​​​‌ an extremely close collaboration‌ with their designers. In‌​‌ this context, we have​​ co-supervised several PhD theses​​​‌ related to StarPU with‌ the Storm project team‌​‌ and we will pursue​​ this strategy, which is​​​‌ crucial in particular to‌ take into account the‌​‌ challenges ahead of us:​​ the transition to exascale,​​​‌ the integration of the‌ energy, the extension to‌​‌ training applications and the​​ ever increasing heterogeneity of​​​‌ computing resources.

3.3 Research‌ Axes

3.3.1 Use of‌​‌ Runtime systems

Participants: Olivier​​ Beaumont, Aurélien Esnard​​​‌, Lionel Eyraud Dubois‌, Mathieu Faverge,‌​‌ Abdou Guermouche, Thomas​​ Herault, Laércio Lima​​​‌ Pilla, Philippe Swartvagher‌.

In previous works,‌​‌ our main goal was​​ to study the methodology​​​‌ needed to efficiently exploit‌ the new generation of‌​‌ high-performance computers with all​​ the constraints that it​​​‌ induces (number of cores,‌ heterogeneity, co-scheduling effects, etc.).‌​‌ To achieve this goal,​​ we successfully proposed a​​​‌ methodology based on the‌ use of modern task-based‌​‌ runtime systems to ensure​​ both portability and performance​​​‌ portability (the ability to‌ achieve high performance by‌​‌ only tuning few parameters​​ of the application). This​​​‌ work was done in‌ the context of several‌​‌ projects (ANR Solhar, ANR​​ SOLHARIS, Projet Région HPC​​​‌ Scalable Ecosystem, etc.). The‌ work done mainly targeted‌​‌ single multicore nodes equipped​​ with several accelerator devices​​​‌ and the extension of‌ these techniques to the‌​‌ multi-node case will be​​ the focus of our​​​‌ future works, especially with‌ the arrival of Philippe‌​‌ Swartvagher in the team.​​ Indeed, it has been​​​‌ observed that in the‌ context of distributed nodes,‌​‌ the placement strategies of​​ runtime systems are insufficient​​​‌ and generate too much‌ communication. In this context,‌​‌ it is therefore crucial​​ to develop efficient placement​​​‌ strategies 60, 53‌. The extension of‌​‌ these mixed (static/dynamic) strategies​​ in the case of​​​‌ tensors is largely open.‌

3.3.2 Design of compression‌​‌ techniques

Participants: Abdou Guermouche​​, Yulia Gusak,​​​‌ Mathieu Faverge, Pierre‌ Ramet, Philippe Swartvagher‌​‌.

The memory consumption​​ of the applications has​​​‌ been and will remain‌ an important challenge for‌​‌ solving larger problems that​​ will lead to exascale​​​‌ computations. In the recent‌ years we have demonstrated‌​‌ the interest of data​​ compression techniques in linear​​​‌ solvers, both to save‌ space and computations. Increasingly‌​‌ complex compression schemes require​​ programming models to evolve​​​‌ to properly express the‌ parallelism of these formats‌​‌ and to accommodate the​​ increasing irregularity of applications.​​​‌ In TOPAL, we‌ propose to continue the‌​‌ study of data compression​​ techniques (low-rank, mixed precision,​​​‌ ...) in the context‌ of solvers, but also‌​‌ in the context of​​ training and multi-linear algebra.​​​‌ This part will be‌ a very pertinent field‌​‌ for the study of​​ applications over runtime systems,​​​‌ because of the strong‌ irregularities that make the‌​‌ load balancing more complicated.​​​‌ At the same time,​ it is an original​‌ and promising approach for​​ energy reduction. Representing convolutional​​​‌ / fully-connected weights in​ tensor formats is an​‌ effective way to reduce​​ the parameters/FLOP in neural​​​‌ networks. However, post-quantization (reduction​ of parameters precision, for​‌ example, from float32 to​​ int8) of networks with​​​‌ factorized weights yields a​ significant drop in accuracy.​‌ Due to memory/power consumption​​ limitations of real devices,​​​‌ the quantization step is​ necessary, when pre-trained models​‌ are deployed. Therefore, our​​ goal is to find​​​‌ algorithms that build tensorized​ neural networks, where weight​‌ factors are directly contain​​ elements in low-precision format.​​​‌ Efficient implementation of operations​ on tensors represented in​‌ low-bit format will be​​ required, as well as​​​‌ development of regularization techniques​ to tackle instability issues​‌ when training deep learning​​ models with low-bit weights.​​​‌

3.3.3 Energy minimization

Participants:​ Olivier Beaumont, Lionel​‌ Eyraud Dubois, Mathieu​​ Faverge, Abdou Guermouche​​​‌, Yulia Gusak,​ Laércio Lima Pilla.​‌

Running computations with resource​​ frugality is an important​​​‌ challenge, both for the​ upcoming exascale shift and​‌ for generally reducing the​​ carbon impact of scientific​​​‌ computing. In addition to​ the usual objective of​‌ making computations run faster,​​ we thus intend to​​​‌ design and evaluate our​ techniques and algorithms with​‌ the purpose of limiting​​ their carbon footprint. In​​​‌ particular, given the lasting​ trend that the time​‌ and energy costs of​​ computing are becoming ever​​​‌ lower than the costs​ of accessing and communicating​‌ data, we want to​​ explore the tradeoffs of​​​‌ trading more computation for​ less data movements. This​‌ can be achieved in​​ several ways: compression techniques​​​‌ as described above, replication​ of some computations, or​‌ use of lower precision.​​ We are planning to​​​‌ work on this issue​ from two points of​‌ views: more frugal numerical​​ algorithms, and energy-aware scheduling​​​‌ techniques. As for the​ embedded architectures in the​‌ phone, but also in​​ the latest generation of​​​‌ laptops (Apple M1 Pro​ and Max chips), we​‌ are starting to see​​ the emergence of Big-Little​​​‌ type technologies in the​ design of HPC oriented​‌ chips. In general, thermal​​ design power (TDP) constraints​​​‌ push architects to increase​ the diversity and number​‌ of energy efficient circuits,​​ even if they cannot​​​‌ all be powered simultaneously.​ If this hardware solution​‌ is very debatable from​​ the point of view​​​‌ of carbon impact, it​ raises difficult and original​‌ questions about the optimization​​ of computing performance under​​​‌ energy constraints. This kind​ of approach opens new​‌ perspectives, both from the​​ point of view of​​​‌ scheduling algorithms but also​ in the design of​‌ computational kernels in linear​​ algebra. We are also​​​‌ seeing the emergence of​ new processors (ARM or​‌ RISC-V technologies, Rhea from​​ the SiPearl company within​​​‌ the EPI consortium, which​ should seriously compete with​‌ the supremacy of x86​​ architectures (Intel and AMD)​​​‌ with Nvidia accelerator cards​ in the search for​‌ a compromise between pure​​ performance and energy sobriety.​​​‌

In the field of​ training, a complementary opportunity​‌ is available. Indeed, contrary​​ to classical HPC, the​​ renewal of computational resources​​​‌ is often linked to‌ the need to run‌​‌ larger models (and data​​ with a better resolution​​​‌ to a lesser extent),‌ rather than by the‌​‌ acceleration of computations. In​​ this context, the possibility​​​‌ offered by tools such‌ as Rotor 7.1.8,‌​‌ Rockmate 7.1.7, ELF​​ 7.1.2 to limit memory​​​‌ requirements contributes to limiting‌ the carbon footprint. Our‌​‌ goal is to extend​​ the scope of these​​​‌ techniques, including to other‌ fields of application than‌​‌ training. Our collaboration with​​ Qarnot Computing is consistent​​​‌ with this objective. The‌ co-design environment of the‌​‌ TextaRossa and Eupex projects​​ 10 are also great​​​‌ avenues to explore these‌ questions.

3.3.4 Communication and‌​‌ Fault Tolerance

Participants: Olivier​​ Beaumont, Lionel Eyraud​​​‌ Dubois, Thomas Herault‌, Philippe Swartvagher.‌​‌

The new research axis​​ on communication and fault​​​‌ tolerance represents an opportunity‌ for the team to‌​‌ address a broader spectrum​​ of challenges arising in​​​‌ modern high-performance computing platforms.‌ As applications increasingly rely‌​‌ on large numbers of​​ interconnected components, communication costs​​​‌ and failures have become‌ central limitations to performance,‌​‌ scalability, and usability. This​​ axis builds on the​​​‌ expertise brought by the‌ arrival of Thomas Hérault‌​‌ as a research director,​​ together with the team’s​​​‌ existing strengths in communication‌ systems through Philippe Swartvagher,‌​‌ to explore new techniques​​ spanning communication optimization, resilience​​​‌ mechanisms, and their interaction‌ with runtime systems. By‌​‌ covering a wider range​​ of problems and solution​​​‌ strategies, this axis naturally‌ complements the existing research‌​‌ directions of the team​​ and reinforces their applicability​​​‌ to the targeted application‌ domains, enabling more scalable,‌​‌ efficient, and robust executions​​ on current and future​​​‌ computing platforms.

3.4 Main‌ Research Topics

The list‌​‌ of our contributions can​​ be read at the​​​‌ intersection of the research‌ domains described in Section‌​‌ 4 and research axes​​ described in Section 3.3​​​‌ as shown in the‌ following table:

Axis 3.3.1‌​‌ Axis 3.3.2 –​​ Axis 3.3.3 Axis​​​‌ 3.3.4
Runtime Compression‌ Energy Comm. & Fault‌​‌ Tol.
Domain 4.1 –​​ Lin. Alg., Tensors Topic​​​‌ 3.4.1 Topic 3.4.2 Topic‌ 3.4.3 Topic 3.4.7
Domain‌​‌ 4.2 – Training Topic​​ 3.4.4 Topic 3.4.5 Topic​​​‌ 3.4.6 Topic 3.4.8

3.4.1‌ Task-based Linear Algebra and‌​‌ Tensor Computations

Participants: Olivier​​ Beaumont, Aurélien Esnard​​​‌, Lionel Eyraud Dubois‌, Mathieu Faverge,‌​‌ Abdou Guermouche, Thomas​​ Herault, Pierre Ramet​​​‌, Philippe Swartvagher.‌

We plan to continue‌​‌ our activity on task-based​​ linear algebra to find​​​‌ solutions for expressing high‌ level algorithms in an‌​‌ elegant way while ensuring​​ high performance. First, we​​​‌ want to consider the‌ expressivity of the algorithms‌​‌ for large scale distributed​​ architectures while considering the​​​‌ specific problems of scheduling,‌ data and task mapping,‌​‌ and data granularity. This​​ work will be done​​​‌ in tight collaboration with‌ the Storm and Tadaam‌​‌ teams and is a​​ key objective of the​​​‌ ANR SOLHARIS project. Moreover,‌ the foundations of this‌​‌ topic fall back to​​ the HiePACS project. Thus,​​​‌ we plan to collaborate‌ and exchange with the‌​‌ CONCACE team on topics​​​‌ which are of interest​ to both teams (mainly​‌ expressivity and scalability). Second,​​ as mentioned above, we​​​‌ plan to study data​ compression techniques in linear​‌ algebra 70, 75​​, 79, which​​​‌ brings new algorithmic schemes​ that are outside of​‌ the scope of the​​ classical programming model used​​​‌ until now. As mid​ and long term objectives,​‌ we would like to​​ find new ways to​​​‌ express these linear algebra​ algorithms to efficiently exploit​‌ large heterogeneous architectures. A​​ second research topic focuses​​​‌ on the extension of​ the techniques developed in​‌ the framework of linear​​ algebra, in particular with​​​‌ the Chameleon library, to​ multi-linear algebra and tensors.​‌ The idea is to​​ build on the expertise​​​‌ we have in the​ field of compression and​‌ in the use of​​ runtimes to use heterogeneous​​​‌ resources in particular.

Another​ challenge would be to​‌ redesign the graph partitioning​​ & matrix ordering algorithms​​​‌ in a task-based runtime,​ in order to facilitate​‌ the integration of this​​ basic building block in​​​‌ modern tasked-based solvers. This​ work has already been​‌ initiated in the StarPart​​ 7.1.5 project.

3.4.2 Multi-Linear​​​‌ Algebra and Tensor Decompositions​

Participants: Olivier Beaumont,​‌ Lionel Eyraud Dubois,​​ Mathieu Faverge, Abdou​​​‌ Guermouche, Yulia Gusak​, Thomas Herault,​‌ Pierre Ramet.

Tensor​​ decompositions can be viewed​​​‌ as a natural generalization​ of SVD-type matrix decompositions​‌ from linear algebra. In​​ the tensor setting, several​​​‌ decomposition formats have been​ developed, each offering different​‌ trade-offs between expressiveness, computational​​ cost, and compression efficiency.​​​‌ These methods play an​ important role in the​‌ analysis of large-scale data,​​ as well as in​​​‌ the compression and inference/training​ acceleration of neural networks.​‌ The addition of Julia​​ Gusak to the project​​​‌ strengthens our expertise in​ this area 83,​‌ 65.

In addition​​ to the basic kernels​​​‌ to be integrated in​ Chameleon proposed in the​‌ Topic 3.4.1, we​​ will propose distributed tensor​​​‌ decomposition algorithms compression algorithms,​ focusing on low-order tensors​‌ with large mode dimensions,​​ which are common in​​​‌ neural network models.

3.4.3​ Energy Minimization in Linear​‌ Solvers

Participants: Mathieu Faverge​​, Abdou Guermouche.​​​‌

We plan to investigate​ how to reduce the​‌ energy consumption of linear​​ algebra libraries (either sparse​​​‌ or dense). To do​ so we will rely​‌ on an algorithmic approach​​ rather than a system​​​‌ approach. The idea, in​ a first step, is​‌ to consider several implementations​​ of a same kernel​​​‌ and select the implementation​ while taking into account​‌ energy consumption 51,​​ 50, 52.​​​‌ For instance a low-rank​ implementation of a given​‌ operation will be slower​​ than a regular high-performance​​​‌ implementation but it will​ tend to require less​‌ energy. In the longer​​ term, we plan also​​​‌ to investigate how to​ design energy efficient implementations​‌ of basic kernels. They​​ will then be used​​​‌ within higher level algorithms​ in order to find​‌ a better trade-off between​​ energy consumption and high​​​‌ performance. In the context​ of developing linear algebra​‌ solvers using compression techniques,​​ a research axis we​​ would like to develop​​​‌ is the energy consumption‌ study of these solvers:‌​‌ is it possible to​​ provide computation kernels with​​​‌ different energy consumption levels‌ that can be easily‌​‌ exchanged to lower the​​ final energy consumption of​​​‌ the application while keeping‌ the same numerical accuracy.‌​‌ Low-rank compression techniques, as​​ well as mixed-precision solution​​​‌ are envisioned toward this‌ objective.

3.4.4 Task-based Approaches‌​‌ for Deep Learning

Participants:​​ Olivier Beaumont, Lionel​​​‌ Eyraud Dubois, Mathieu‌ Faverge, Abdou Guermouche‌​‌, Yulia Gusak,​​ Thomas Herault, Laércio​​​‌ Lima Pilla, Pierre‌ Ramet, Philippe Swartvagher‌​‌.

In popular Deep​​ Learning frameworks like TensorFlow​​​‌ or PyTorch, the parallelization‌ of the training process‌​‌ is performed with a​​ large granularity, mostly relying​​​‌ on Data Parallelism. Specialized‌ frameworks have been proposed‌​‌ to explore finer parallel​​ schemes, like PipeDream for​​​‌ model parallelism 81.‌ These implementations are however‌​‌ very static and require​​ explicit and error-prone data​​​‌ management policies. We believe‌ that our expertise in‌​‌ using task-based runtime systems​​ can be used to​​​‌ provide much simpler approaches‌ for a finer grain‌​‌ control on the execution​​ of the corresponding task​​​‌ graphs and communications patterns,‌ for both training and‌​‌ inference phases. We plan​​ to design a prototype​​​‌ implementation that would allow‌ to easily use clever‌​‌ scheduling and optimization techniques​​ to improve the performance​​​‌ of inference. In the‌ longer term, we expect‌​‌ that this approach will​​ provide better scalability and​​​‌ flexibility, and unlock new‌ opportunities for optimization, for‌​‌ a wide range of​​ deep learning applications.

3.4.5​​​‌ Tensor Compression for Inference‌

Participants: Olivier Beaumont,‌​‌ Yulia Gusak.

We​​ envision a research activity​​​‌ focused on the use‌ of tensor compression for‌​‌ inference. Initially, the objective​​ is to combine tensor​​​‌ compression techniques and quantization‌ in order to enable‌​‌ inference under strict memory​​ constraints or low-latency requirements​​​‌ 67, 83.‌ These techniques can also‌​‌ be extended to the​​ context of on-device training,​​​‌ which in particular requires‌ memory-saving approaches 74.‌​‌ Finally, a more ambitious​​ goal would be to​​​‌ combine these approaches with‌ methods for designing neural‌​‌ networks that are inherently​​ efficient in terms of​​​‌ memory usage 71.‌

3.4.6 Carbon Saving and‌​‌ Energy-Efficient Training

Participants: Olivier​​ Beaumont, Lionel Eyraud​​​‌ Dubois, Yulia Gusak‌, Laércio Lima Pilla‌​‌.

The training phase​​ of Deep Neural Networks​​​‌ is notoriously very resource-hungry,‌ especially regarding its energy‌​‌ consumption. In the last​​ years, we have proposed​​​‌ several algorithmic solutions (re-materialization‌ 54, 5,‌​‌ offloading 57, their​​ combination 55, pipelining​​​‌ 58, 36)‌ to reduce the resource‌​‌ consumption of this training​​ phase, with a focus​​​‌ on reducing the training‌ time. We plan to‌​‌ broaden the scope of​​ these studies, by also​​​‌ taking into account the‌ energy usage. A heterogeneous‌​‌ context and a flexible​​ runtime system, as planned​​​‌ in Topic 3.4.4,‌ may also be an‌​‌ opportunity to reduce energy​​ consumption by allocating some​​​‌ tasks, typically the non-critical‌ ones, to the most‌​‌ efficient resources for them,​​​‌ or by selecting a​ different implementation with better​‌ energy efficiency. This can​​ be seen as a​​​‌ generalization of mixed-precision techniques,​ which are also very​‌ popular in this context​​ to help achieving a​​​‌ better frugality. However, care​ must be taken to​‌ not degrade the convergence​​ of the training phase.​​​‌ Moreover, the carbon footprint​ comes essentially from the​‌ manufacturing 82, 73​​ of the computing resources​​​‌ (GPUs) and the main​ goal is to facilitate​‌ their non-renewal, as enabled​​ by memory saving techniques.​​​‌

3.4.7 Communication-Aware Resilience Patterns​ for Iterative Linear Algebra​‌

Participants: Thomas Herault.​​

This year, our work​​​‌ in the Communication and​ Fault Tolerance axis addressed​‌ the growing need for​​ efficient and resilient execution​​​‌ of large-scale iterative linear​ algebra algorithms on modern​‌ HPC platforms. At scale,​​ such computations are increasingly​​​‌ limited by communication costs​ and are exposed to​‌ a wide range of​​ faults, including process failures,​​​‌ silent data corruptions, and​ memory errors, which can​‌ no longer be treated​​ as rare events. Our​​​‌ contributions explore communication-aware resilience​ strategies that integrate fault​‌ detection, recovery, and checkpointing​​ directly into the algorithmic​​​‌ structure of iterative methods.​ By carefully controlling the​‌ frequency and granularity of​​ verification, redundancy, and checkpointing​​​‌ mechanisms, we showed that​ it is possible to​‌ bound error propagation while​​ significantly reducing overheads compared​​​‌ to classical replication-based approaches.​ A central outcome of​‌ this work is a​​ set of analytical models​​​‌ and optimization techniques that​ guide the design of​‌ hierarchical and adaptive resilience​​ patterns, balancing computation, communication,​​​‌ memory usage, and execution​ time under realistic system​‌ constraints such as bounded​​ detection latency and fixed-time​​​‌ resource allocations. Although primarily​ evaluated on linear algebra​‌ solvers, these techniques are​​ largely generic and directly​​​‌ applicable to other iterative​ workloads, including neural network​‌ training, where similar trade-offs​​ between communication efficiency, redundancy,​​​‌ and robustness arise.

3.4.8​ Communication-Aware Resilience Patterns for​‌ Training and Inference

Participants:​​ Olivier Beaumont, Fares​​​‌ Boudjaoui, Lionel Eyraud​ Dubois, Thomas Herault​‌, Philippe Swartvagher.​​

The training and inference​​​‌ phases of modern machine​ learning workloads rely heavily​‌ on large-scale collective communications,​​ which are traditionally designed​​​‌ under the assumption of​ stable, homogeneous, and reliable​‌ infrastructures. However, as execution​​ platforms become increasingly heterogeneous​​​‌ and volatile, communication costs​ and failures have a​‌ growing impact on both​​ performance and correctness. Building​​​‌ on our recent work​ on resilience patterns for​‌ iterative algorithms, we are​​ initiating new research on​​​‌ communication-aware resilience mechanisms for​ distributed training and inference,​‌ with the goal of​​ jointly addressing efficiency and​​​‌ robustness. This work, launched​ with the PhD of​‌ Farès Boudjaoui in the​​ context of the Cupseli​​​‌ Inria challenge, explores adaptive​ communication schemes, fault-tolerant collectives,​‌ and dynamic reconfiguration strategies​​ that can tolerate node​​​‌ unavailability, bandwidth variability, and​ network contention, while limiting​‌ synchronization and data movement​​ overheads. By integrating resilience​​​‌ directly into communication patterns,​ rather than treating failures​‌ as exceptional events, we​​ aim to support scalable​​​‌ and robust executions in​ both centralized HPC platforms​‌ and more decentralized environments,​​ without degrading convergence or​​ inference quality.

4 Application​​​‌ domains

4.1 Multi-Linear Algebra‌ and Solvers

Participants: Olivier‌​‌ Beaumont, Aurélien Esnard​​, Lionel Eyraud Dubois​​​‌, Mathieu Faverge,‌ Abdou Guermouche, Yulia‌​‌ Gusak, Thomas Herault​​, Pierre Ramet,​​​‌ Philippe Swartvagher.

At‌ the core of a‌​‌ large number of simulation​​ tools, the resolution of​​​‌ large linear systems often‌ represents the dominant part‌​‌ of the computing time.​​ These linear solvers rely​​​‌ on a wide variety‌ of numerical methods and‌​‌ algorithms. Massively parallel versions​​ are required to support​​​‌ advances in multi-physics and‌ multi-scale simulations, especially when‌​‌ targeting exascale platforms. The​​ aim is therefore to​​​‌ address the major challenge‌ of designing and building‌​‌ numerically robust solvers on​​ top of runtime systems​​​‌ that can scale up‌ and push back the‌​‌ limits of existing industrial​​ codes by making full​​​‌ use of all computing‌ resources such as CPUs,‌​‌ GPUs and other accelerator​​ units. Following the ANR​​​‌ project SOLHARIS (and previously‌ SOLHAR), we now have‌​‌ experience of strong/weak scalability​​ of sparse direct solvers​​​‌ on large scale, distributed‌ memory, heterogeneous computers. These‌​‌ solvers already rely on​​ asynchronous task-based parallelism 48​​​‌, 49, 78‌, 47, rather‌​‌ than traditional and widely​​ adopted message-passing and multithreading​​​‌ techniques. Indeed, the use‌ of modern runtime systems‌​‌ have proven to be​​ good tools for the​​​‌ development of scientific computing‌ applications 80, 62‌​‌, 86, in​​ particular in combination with​​​‌ compression 63, 85‌, 84, 59‌​‌, 76 and communication​​ avoiding techniques 60,​​​‌ 53. This work‌ can be extended naturally‌​‌ to multi-dimensional objects such​​ as tensors. In the​​​‌ tensor case, we propose‌ to extend the data‌​‌ distribution strategies to minimize​​ communication and the use​​​‌ of system runtimes to‌ handle the variability and‌​‌ heterogeneity of computational resources.​​ Finally, we have focused​​​‌ so far on minimizing‌ the execution time, whereas‌​‌ energy efficiency is becoming​​ a critical element. We​​​‌ therefore plan to revisit‌ the algorithms and methods‌​‌ we developed in linear​​ algebra, and those we​​​‌ propose to design for‌ handling tensors, to allow‌​‌ the optimal use of​​ the available hardware in​​​‌ order to guarantee the‌ performance of the computations‌​‌ within a fixed energy​​ budget.

4.2 Training and​​​‌ Inference for DNNs

Participants:‌ Olivier Beaumont, Lionel‌​‌ Eyraud Dubois, Yulia​​ Gusak, Thomas Herault​​​‌, Laércio Lima Pilla‌, Pierre Ramet,‌​‌ Philippe Swartvagher.

The​​ training phase in Deep​​​‌ Neural Networks has become‌ an important source of‌​‌ HPC resource usage and​​ it is crucial to​​​‌ perform it efficiently on‌ parallel architectures. Until today,‌​‌ data parallelism is the​​ most widely used method,​​​‌ but the associated requirement‌ to replicate all the‌​‌ weights on all computing​​ resources causes memory issues​​​‌ at the level of‌ each node and of‌​‌ collective communications at the​​ level of the platform.​​​‌

In general, the overall‌ shape of the dependency‌​‌ graphs associated with the​​ feed forward training phase​​​‌ has characteristics (long dependencies)‌ that generate a lot‌​‌ of memory needs and​​​‌ data exchange. However, there​ are multiple opportunities to​‌ address these problems by​​ combining 55 re-computations 66​​​‌, 54, 61​, 77, 72​‌, 5, offloading​​ 57, compression and​​​‌ different parallelism strategies (image,​ filter, kernel, model parallelism​‌ 58, 81,​​ 56, 74,​​​‌ 36). It is​ also promising to consider​‌ other more radical techniques​​ to go beyond feed​​​‌ forward training, such as​ the use of multigrid​‌ reduction in time (MGRIT)​​ 68, 69 that​​​‌ come from the field​ of numerical simulations and​‌ that we already address​​ in other contexts.

Within​​​‌ this general framework, the​ minimization of carbon footprint​‌ is obviously a major​​ concern that must guide​​​‌ strategies. Tools to train​ complex and deep network​‌ on otherwise obsolete hardware​​ using memory saving techniques​​​‌ are already a strong​ contribution in this direction​‌ to increase the lifetime​​ of computing resources. and​​​‌ our goal is to​ extend these techniques in​‌ terms of efficiency and​​ in terms of scope,​​​‌ which has consumed a​ little more energy associated​‌ with the computations. As​​ in the case of​​​‌ linear algebra, energy optimization​ also requires the use​‌ of heterogeneous computation resources​​ (CPUs, GPUs, TPUs, FPGAs).​​​‌ Conversely, this heterogeneity hinders​ scalability because of difficulties​‌ in predicting task durations​​ and makes the use​​​‌ of dynamic runtime schedulers​ necessary. Finally, the use​‌ of these dynamic runtimes​​ also poses the problem​​​‌ of knowing what needs​ to be decided statically​‌ and dynamically in terms​​ of resource allocation and​​​‌ scheduling.

5 Social and​ environmental responsibility

5.1 Footprint​‌ of research activities

As​​ part of our research​​​‌ activities, we use local​ computing resources such as​‌ PlaFRIM and the national​​ computing resources of IDRIS​​​‌ and the TGCC.​

The environmental impact of​‌ using these platforms is​​ significant, whether for numerical​​​‌ simulation or training applications.​ However, the positioning of​‌ the team, which produces​​ simulation and training tools​​​‌ but does not directly​ perform simulations and training,​‌ is relatively limited. For​​ example, in the case​​​‌ of training, we have​ so far concentrated on​‌ techniques that do not​​ modify the architecture of​​​‌ the networks and the​ computations that are performed,​‌ so that the number​​ of epochs and the​​​‌ final accuracy are not​ impacted. In this way,​‌ it is possible to​​ validate our developments to​​​‌ accelerate training on a​ single batch (at full​‌ machine scale) and then​​ to extrapolate the acceleration​​​‌ at the whole training​ scale. Similarly, the techniques​‌ developed in linear algebra​​ in the team often​​​‌ do not depend (typically​ for dense approaches) on​‌ the numerical properties of​​ the matrices, so that​​​‌ acceleration (for a given​ problem size) can be​‌ validated without heavy experimental​​ campaigns, beyond what is​​​‌ necessary to obtain valid​ experimental results in complex​‌ environments where performance varies​​ from one experiment to​​​‌ another.

In this context,​ the use of simulation​‌ as opposed to direct​​ experimentation is also a​​​‌ tool that enables us​ to limit the impact​‌ of our research on​​ power consumption, since simulation​​ can save several orders​​​‌ of magnitude in power‌ consumption compared with direct‌​‌ experimentation. In this context,​​ it is crucial to​​​‌ produce simulation tools that‌ are as precise and‌​‌ generic as possible, and​​ the team has been​​​‌ actively collaborating for many‌ years in the development‌​‌ of simulation tools such​​ as SimGRID.

Nevertheless,​​​‌ the tools we produce‌ are used on a‌​‌ large scale in terms​​ of computation resources and​​​‌ simulation/training time, and the‌ associated energy consumption issue‌​‌ is therefore indirectly crucial.​​ In this context, we​​​‌ are developing original solutions‌ for reusing the heat‌​‌ dissipated by computation resources,​​ in particular as part​​​‌ of the Inria-Qarnot Computing‌ Pulse challenge (see Section‌​‌ 5.2). We have​​ also added a research​​​‌ axis aimed at minimizing‌ energy consumption for a‌​‌ given kernel (Section 3.3.3​​).

TOPAL has also​​​‌ signed the "Labos en‌ transitions" Charter of Commitment‌​‌ for research facilities on​​ the Bordeaux university site​​​‌ whose preamble states that‌ "Faced with contemporary environmental‌​‌ and societal challenges, and​​ the urgent need for​​​‌ systemic transformation to meet‌ them, the academic world‌​‌ has a particular responsibility:​​ to promote responsible research,​​​‌ aware of environmental issues‌ and respectful of the‌​‌ people who produce it,​​ which contributes to transitions​​​‌ and enables us to‌ understand and guide current‌​‌ and future societal transformations".​​ In exchange for this​​​‌ commitment, the establishments undertake‌ to provide us with‌​‌ an estimate of the​​ impact of our research​​​‌ activities (including the purchase‌ of equipment and missions).‌​‌ At this stage, this​​ information is difficult to​​​‌ aggregate at team level,‌ but making it available‌​‌ will enable us to​​ measure our progress and​​​‌ involvement.

5.2 Impact of‌ research results

5.2.1 Carbon‌​‌ Impact of Cloud Platforms​​

To limit the environmental​​​‌ impact of cloud computing,‌ Qarnot focuses on re-using‌​‌ the heat produced by​​ computations in heat circuits​​​‌ or boilers. As part‌ of the Pulse Inria‌​‌ challenge, we are working​​ with Qarnot on algorithms​​​‌ for placing computations on‌ their infrastructure, so as‌​‌ to maximize the use​​ of reusable heat sources,​​​‌ depending on computation demand‌ and task characteristics. The‌​‌ aim is to enable​​ users of the Qarnot​​​‌ platform to specify their‌ objective function on the‌​‌ (carbon footprint, time, cost)​​ axes, and to be​​​‌ able to meet it.‌

Our activities with Hivenet‌​‌, conducted within the​​ framework of the Cupseli​​​‌ challenge, complement this approach.‌ In the long term,‌​‌ one of Cupseli’s objectives​​ is to enable the​​​‌ use of distributed computing‌ resources—typically owned by gaming‌​‌ venues—to carry out inference​​ and learning tasks. The​​​‌ aim is therefore to‌ extend the lifespan and‌​‌ usage of these computing​​ resources by providing them​​​‌ with practical utility and‌ added value.

5.2.2 Democratization‌​‌ of Large Models Training​​

In the context of​​​‌ training, at one end‌ of the spectrum we‌​‌ see the provision of​​ computing resources, such as​​​‌ the Jean Zay supercomputer,‌ whose efficient use requires‌​‌ large-scale parallel training algorithms​​ and frameworks to optimize​​​‌ resource utilization and accelerate‌ time to discovery. At‌​‌ the other end of​​​‌ the spectrum, we see​ the importance of enabling​‌ researchers from different communities​​ to use the resources​​​‌ at their disposal (often​ just a few GPUs)​‌ to develop original models​​ without being constrained by​​​‌ hardware limitations. In particular,​ recent transformer-based models are​‌ very heavy-weight, and techniques​​ must be employed to​​​‌ run them on GPUs​ that are only a​‌ few years old, without​​ compromising data quality, computational​​​‌ accuracy, or model size.​ In particular, the Topal​‌ team has been working​​ for several years on​​​‌ memory-saving strategies to enable​ the training of large​‌ models on limited-capacity resources​​ (re-materialization and offloading), and​​​‌ on software 7 such​ as Rotor and Rockmate​‌, which are recognized​​ and visible in the​​​‌ AI applications community and​ enable researchers with access​‌ to limited capacity resources​​ to train large models.​​​‌ Recent ELF 7.1.2 software​ has been developed to​‌ optimize multi-node, multi-GPU training​​ using various types of​​​‌ parallelism and memory-saving techniques.​ While remaining user-friendly, it​‌ supports the easy integration​​ of custom strategies and​​​‌ has been validated at​ large scale during the​‌ NVIDIA–OpenACC IDRIS'25 hackathon through​​ the training of large​​​‌ language models and diffusion​ models.

6 Highlights of​‌ the year

Best Paper​​ Award for “Scheduling​​​‌ Strategies for Partially-Replicable Task​ Chains on Two Types​‌ of Resources” 22​​ in Heterogeneity in Computing​​​‌ Workshop (HCW) - Participant:​ Laércio Lima Pilla.

The​‌ Inria/Hivenet Cupseli challenge is​​ co-led by Olivier Beaumont​​​‌ (Topal) and Alexandru Dobrila​ (Hivenet) and brings together​‌ 11 Inria teams along​​ with researchers from Hivenet,​​​‌ representing a total of​ around thirty permanent staff​‌ members. Over a four-year​​ period, it plans for​​​‌ the recruitment of nine​ PhD students, two postdoctoral​‌ researchers, and three engineers.​​ The project kickoff meeting​​​‌ took place on September​ 25, 2025.

7 Latest​‌ software developments, platforms, open​​ data

7.1 Latest software​​​‌ developments

7.1.1 Chameleon

  • Keywords:​
    Runtime system, Task-based algorithm,​‌ Dense linear algebra, HPC,​​ Task scheduling
  • Scientific Description:​​​‌

    Chameleon is part of​ the MORSE (Matrices Over​‌ Runtime Systems @ Exascale)​​ project. The overall objective​​​‌ is to develop robust​ linear algebra libraries relying​‌ on innovative runtime systems​​ that can fully benefit​​​‌ from the potential of​ those future large-scale complex​‌ machines.

    We expect advances​​ in three directions based​​​‌ first on strong and​ closed interactions between the​‌ runtime and numerical linear​​ algebra communities. This initial​​​‌ activity will then naturally​ expand to more focused​‌ but still joint research​​ in both fields.

    1.​​​‌ Fine interaction between linear​ algebra and runtime systems.​‌ On parallel machines, HPC​​ applications need to take​​​‌ care of data movement​ and consistency, which can​‌ be either explicitly managed​​ at the level of​​​‌ the application itself or​ delegated to a runtime​‌ system. We adopt the​​ latter approach in order​​​‌ to better keep up​ with hardware trends whose​‌ complexity is growing exponentially.​​ One major task in​​​‌ this project is to​ define a proper interface​‌ between HPC applications and​​ runtime systems in order​​​‌ to maximize productivity and​ expressivity. As mentioned in​‌ the next section, a​​ widely used approach consists​​ in abstracting the application​​​‌ as a DAG that‌ the runtime system is‌​‌ in charge of scheduling.​​ Scheduling such a DAG​​​‌ over a set of‌ heterogeneous processing units introduces‌​‌ a lot of new​​ challenges, such as predicting​​​‌ accurately the execution time‌ of each type of‌​‌ task over each kind​​ of unit, minimizing data​​​‌ transfers between memory banks,‌ performing data prefetching, etc.‌​‌ Expected advances: In a​​ nutshell, a new runtime​​​‌ system API will be‌ designed to allow applications‌​‌ to provide scheduling hints​​ to the runtime system​​​‌ and to get real-time‌ feedback about the consequences‌​‌ of scheduling decisions.

    2.​​ Runtime systems. A runtime​​​‌ environment is an intermediate‌ layer between the system‌​‌ and the application. It​​ provides low-level functionality not​​​‌ provided by the system‌ (such as scheduling or‌​‌ management of the heterogeneity)​​ and high-level features (such​​​‌ as performance portability). In‌ the framework of this‌​‌ proposal, we will work​​ on the scalability of​​​‌ runtime environment. To achieve‌ scalability it is required‌​‌ to avoid all centralization.​​ Here, the main problem​​​‌ is the scheduling of‌ the tasks. In many‌​‌ task-based runtime environments the​​ scheduler is centralized and​​​‌ becomes a bottleneck as‌ soon as too many‌​‌ cores are involved. It​​ is therefore required to​​​‌ distribute the scheduling decision‌ or to compute a‌​‌ data distribution that impose​​ the mapping of task​​​‌ using, for instance the‌ so-called “owner-compute” rule. Expected‌​‌ advances: We will design​​ runtime systems that enable​​​‌ an efficient and scalable‌ use of thousands of‌​‌ distributed multicore nodes enhanced​​ with accelerators.

    3. Linear​​​‌ algebra. Because of its‌ central position in HPC‌​‌ and of the well​​ understood structure of its​​​‌ algorithms, dense linear algebra‌ has often pioneered new‌​‌ challenges that HPC had​​ to face. Again, dense​​​‌ linear algebra has been‌ in the vanguard of‌​‌ the new era of​​ petascale computing with the​​​‌ design of new algorithms‌ that can efficiently run‌​‌ on a multicore node​​ with GPU accelerators. These​​​‌ algorithms are called “communication-avoiding”‌ since they have been‌​‌ redesigned to limit the​​ amount of communication between​​​‌ processing units (and between‌ the different levels of‌​‌ memory hierarchy). They are​​ expressed through Direct Acyclic​​​‌ Graphs (DAG) of fine-grained‌ tasks that are dynamically‌​‌ scheduled. Expected advances: First,​​ we plan to investigate​​​‌ the impact of these‌ principles in the case‌​‌ of sparse applications (whose​​ algorithms are slightly more​​​‌ complicated but often rely‌ on dense kernels). Furthermore,‌​‌ both in the dense​​ and sparse cases, the​​​‌ scalability on thousands of‌ nodes is still limited,‌​‌ new numerical approaches need​​ to be found. We​​​‌ will specifically design sparse‌ hybrid direct/iterative methods that‌​‌ represent a promising approach.​​

    Overall end point. The​​​‌ overall goal of the‌ MORSE associate team is‌​‌ to enable advanced numerical​​ algorithms to be executed​​​‌ on a scalable unified‌ runtime system for exploiting‌​‌ the full potential of​​ future exascale machines.

  • Functional​​​‌ Description:
    Chameleon is a‌ dense linear algebra software‌​‌ relying on sequential task-based​​ algorithms where sub-tasks of​​​‌ the overall algorithms are‌ submitted to a Runtime‌​‌ system. A Runtime system​​​‌ such as StarPU is​ able to manage automatically​‌ data transfers between not​​ shared memory area (CPUs-GPUs,​​​‌ distributed nodes). This kind​ of implementation paradigm allows​‌ to design high performing​​ linear algebra algorithms on​​​‌ very different type of​ architecture: laptop, many-core nodes,​‌ CPUs-GPUs, multiple nodes. For​​ example, Chameleon is able​​​‌ to perform a Cholesky​ factorization (double-precision) at 80​‌ TFlop/s on a dense​​ matrix of order 400​​​‌ 000 (i.e. 4 min​ 30 s).
  • Release Contributions:​‌

    Chameleon includes the following​​ features:

    - BLAS 3,​​​‌ LAPACK one-sided and LAPACK​ norms tile algorithms -​‌ Support QUARK and StarPU​​ runtime systems and PaRSEC​​​‌ since 2018 - Exploitation​ of homogeneous and heterogeneous​‌ platforms through the use​​ of BLAS/LAPACK CPU kernels​​​‌ and cuBLAS/MAGMA CUDA kernels​ - Exploitation of clusters​‌ of interconnected nodes with​​ distributed memory (using OpenMPI)​​​‌

  • URL:
  • Publications:
  • Contact:
    Mathieu​‌ Faverge
  • Participants:
    Mathieu Faverge,​​ Florent Pruvost, Emmanuel Agullo,​​​‌ Samuel Thibault
  • Partners:
    Innovative​ Computing Laboratory (ICL), King​‌ Abdullha University of Science​​ and Technology, University of​​​‌ Colorado Denver

7.1.2 ELF​

  • Name:
    Efficient Deep Learning​‌ Framework
  • Keywords:
    Neural networks,​​ Pytorch, Python, GPU, Deep​​​‌ learning, Automatic parallelization
  • Functional​ Description:

    ELF is a​‌ deep learning framework designed​​ for efficient and easy-to-launch​​​‌ multi-GPU training. It enables​ users to input a​‌ PyTorch model and train​​ it on an HPC​​​‌ cluster by automatically handling​ data, model and other​‌ types of parallelization across​​ multiple devices.

    By optimizing​​​‌ the training schedule, minimizing​ communication overhead, and maximizing​‌ GPU utilization, ELF ensures​​ highly optimized execution. Users​​​‌ don’t need to manually​ implement parallelization—ELF does it​‌ automatically while maintaining computational​​ correctness throughout training iterations.​​​‌

  • Publication:
  • Contact:
    Yulia​ Gusak

7.1.3 PaStiX

  • Name:​‌
    Parallel Sparse matriX package​​
  • Keywords:
    Direct solvers, Parallel​​​‌ numerical solvers, Linear Systems​ Solver
  • Scientific Description:
    PaStiX​‌ is based on an​​ efficient static scheduling and​​​‌ memory manager, in order​ to solve 3D problems​‌ with more than 50​​ million of unknowns. The​​​‌ mapping and scheduling algorithm​ handles a combination of​‌ 1D and 2D block​​ distributions. A dynamic scheduling​​​‌ can also be applied​ to take care of​‌ NUMA architectures while taking​​ into account very precisely​​​‌ the computational costs of​ the BLAS 3 primitives,​‌ the communication costs and​​ the cost of local​​​‌ aggregations.
  • Functional Description:

    PaStiX​ is a scientific library​‌ that provides a high​​ performance parallel solver for​​​‌ very large sparse linear​ systems based on block​‌ direct and block ILU(k)​​ methods. It can handle​​​‌ low-rank compression techniques to​ reduce the computation and​‌ the memory complexity. Numerical​​ algorithms are implemented in​​​‌ single or double precision​ (real or complex) for​‌ LLt, LDLt and LU​​ factorization with static pivoting​​​‌ (for non symmetric matrices​ having a symmetric pattern).​‌ The PaStiX library uses​​ the graph partitioning and​​​‌ sparse matrix block ordering​ packages Scotch or Metis.​‌

    The PaStiX solver is​​ suitable for any heterogeneous​​ parallel/distributed architecture when its​​​‌ performance is predictable, such‌ as clusters of multicore‌​‌ nodes with GPU accelerators​​ or KNL processors. In​​​‌ particular, we provide a‌ high-performance version with a‌​‌ low memory overhead for​​ multicore node architectures, which​​​‌ fully exploits the advantage‌ of shared memory by‌​‌ using a hybrid MPI-thread​​ implementation.

    The solver also​​​‌ provides some low-rank compression‌ methods to reduce the‌​‌ memory footprint and/or the​​ time-to-solution.

  • URL:
  • Publications:​​​‌
  • Contact:​​
    Pierre Ramet
  • Participants:
    Alycia​​​‌ Lisito, Grégoire Pichon, Mathieu‌ Faverge, Pierre Ramet

7.1.4‌​‌ pmtool

  • Keywords:
    Scheduling, Task​​ scheduling, StarPU, Heterogeneity, GPGPU,​​​‌ Performance analysis
  • Functional Description:‌
    Analyse post-mortem the behavior‌​‌ of StarPU applications. Provide​​ lower bounds on makespan.​​​‌ Study the performance of‌ different schedulers in a‌​‌ simple context. Provide implementations​​ of many scheduling algorithms​​​‌ from the literature
  • URL:‌
  • Publications:
    hal-01386174,‌​‌ hal-01878606
  • Contact:
    Lionel Eyraud​​ Dubois
  • Participant:
    an anonymous​​​‌ participant

7.1.5 StarPart

  • Keyword:‌
    3-point-lighting technique
  • Functional Description:‌​‌
    StarPart is a flexible​​ and extensible framework that​​​‌ integrates state-of-the-art methods for‌ graph partitioning and sparse‌​‌ matrix ordering. More precisely,​​ StarPart is a framework​​​‌ that offers a uniform‌ API to manipulate graph,‌​‌ hypergraph and mesh structures.​​ It is designed to​​​‌ be easily extensible by‌ adding new methods and‌​‌ to plug all these​​ methods into a comprehensive​​​‌ framework. It is initially‌ designed to provide graph‌​‌ partitioning and sparse matrix​​ ordering methods, that come​​​‌ from sate-of-the-art software such‌ as Metis, Scotch, Patoh,‌​‌ Zoltan, etc. Besides, it​​ provides some facilities for​​​‌ IO, diagnostic, benchmark, visualization‌ (VTK, SVG, ...). StarPart‌​‌ is the core of​​ the MetaPart project. It​​​‌ is built upon the‌ LibGraph library.
  • URL:
  • Contact:
    Aurélien Esnard
  • Participant:​​
    an anonymous participant

7.1.6​​​‌ StarPU

7.1.7 rockmate​​​‌

  • Name:
    rockmate
  • Keywords:
    Deep‌ learning, Optimization, Python, Pytorch,‌​‌ GPU, Automatic differentiation
  • Scientific​​ Description:

    We propose Rockmate​​​‌ to control the memory‌ requirements when training PyTorch‌​‌ DNN models. Rockmate is​​ an automatic tool that​​​‌ starts from the model‌ code and generates an‌​‌ equivalent model, using a​​ predefined amount of memory​​​‌ for activations, at the‌ cost of a few‌​‌ re-computations. Rockmate automatically detects​​ the structure of computational​​​‌ and data dependencies and‌ rewrites the initial model‌​‌ as a sequence of​​ complex blocks. We show​​​‌ that such a structure‌ is widespread and can‌​‌ be found in many​​ models in the literature​​​‌ (Transformer based models, ResNet,‌ RegNets,...). This structure allows‌​‌ us to solve the​​ problem in a fast​​​‌ and efficient way, using‌ an adaptation of Checkmate‌​‌ (too slow on the​​ whole model but general)​​​‌ at the level of‌ individual blocks and an‌​‌ adaptation of Rotor (fast​​ but limited to sequential​​​‌ models) at the level‌ of the sequence itself.‌​‌ We show through experiments​​ on many models that​​​‌ Rockmate is as fast‌ as Rotor and as‌​‌ efficient as Checkmate, and​​ that it allows in​​​‌ many cases to obtain‌ a significantly lower memory‌​‌ consumption for activations (by​​ a factor of 2​​​‌ to 5) for a‌ rather negligible overhead (of‌​‌ the order of 10%​​ to 20%). Rockmate is​​​‌ open source and available‌ at https://github.com/topal-team/rockmate.

    Complete paper:‌​‌ https://openreview.net/pdf?id=wLAMOoL0KD

  • Functional Description:

    Given​​ a PyTorch model, a​​​‌ sample input, and a‌ GPU memory budget, Rockmate‌​‌ builds a new torch.nn.Module,​​ which performs forward and​​​‌ backward pass while keeping‌ the memory of activations‌​‌ under the given budget.​​

    The new model produces​​​‌ the same outputs and‌ gradients as the original‌​‌ one. Training the model​​ with a lower memory​​​‌ than PyTorch Autodiff is‌ achieved by re-computing some‌​‌ of the activations instead​​ of storing them for​​​‌ gradient calculation. Based on‌ the budget, Rockmate determines‌​‌ automatically which activations should​​ be recomputed.

  • URL:
  • Contact:
    Lionel Eyraud Dubois‌
  • Participants:
    Lionel Eyraud Dubois,‌​‌ Yulia Gusak, Olivier Beaumont,​​ Xunyi Zhao

7.1.8 rotor​​​‌

  • Name:
    Re-materializing Optimally with‌ pyTORch
  • Keywords:
    Deep learning,‌​‌ Optimization, Python, GPU, Automatic​​ differentiation
  • Scientific Description:

    This​​​‌ software implements in PyTorch‌ a new activation checkpointing‌​‌ method which allows to​​ significantly decrease memory usage​​​‌ when training Deep Neural‌ Networks with the back-propagation‌​‌ algorithm. Similarly to checkpointing​​ techniques coming from the​​​‌ literature on Automatic Differentiation,‌ it consists in dynamically‌​‌ selecting the forward activations​​ that are saved during​​​‌ the training phase, and‌ then automatically recomputing missing‌​‌ activations from those previously​​ recorded. We propose an​​​‌ original computation model that‌ combines two types of‌​‌ activation savings: either only​​ storing the layer inputs,​​​‌ or recording the complete‌ history of operations that‌​‌ produced the outputs (this​​​‌ uses more memory, but​ requires fewer recomputations in​‌ the backward phase), and​​ we provide in https://hal.inria.fr/hal-02352969​​​‌ an algorithm to compute​ the optimal computation sequence​‌ for this model.

    Our​​ PyTorch implementation processes the​​​‌ entire chain, dealing with​ any sequential DNN whose​‌ internal layers may be​​ arbitrarily complex and automatically​​​‌ executing it according to​ the optimal checkpointing strategy​‌ computed given a memory​​ limit. In https://hal.inria.fr/hal-02352969, through​​​‌ extensive experiments, we show​ that our implementation consistently​‌ outperforms existing checkpoint-ing approaches​​ for a large class​​​‌ of networks, image sizes​ and batch sizes.

  • Functional​‌ Description:
    Allows to train​​ very large convolutional networks​​​‌ on limited memory by​ optimally selecting which activations​‌ should be kept and​​ which should be recomputed.​​​‌ This code is meant​ to replace the checkpoint.py​‌ utility available in pytorch,​​ by providing more efficient​​​‌ rematerialization strategies. The algorithm​ is easier to tune:​‌ the only required parameter​​ is the available memory,​​​‌ instead of the number​ of segments.
  • URL:
  • Publication:
  • Contact:
    Lionel​​ Eyraud Dubois
  • Participant:
    5​​​‌ anonymous participants

7.1.9 VITE​

  • Name:
    Visual Trace Explorer​‌
  • Keywords:
    Visualization, Execution trace​​
  • Functional Description:
    ViTE is​​​‌ a trace explorer. It​ is a tool made​‌ to visualize execution traces​​ of large parallel programs.​​​‌ It supports Pajé, a​ trace format created by​‌ Inria Grenoble, and OTF​​ and OTF2 formats, developed​​​‌ by the University of​ Dresden and allows the​‌ programmer a simpler way​​ to analyse, debug and/or​​​‌ profile large parallel applications.​
  • URL:
  • Publications:
  • Contact:
    Mathieu​​ Faverge
  • Participants:
    Mathieu Faverge,​​​‌ Philippe Swartvagher

8 New​ results

As explained in​‌ Section 3.4, our​​ contributions can be read​​​‌ at the intersection of​ the research domains described​‌ in Section 4 and​​ research axes described in​​​‌ Section 3.3 as shown​ in the following table:​‌

Axis 3.3.1 Axis​​ 3.3.2 Axis 3.3.3​​​‌ Axis 3.3.4 –​
Runtime Compression Energy Comm.​‌ & Fault Tol.
Domain​​ 4.1 – Lin. Alg.,​​​‌ Tensors Topic 3.4.1 Topic​ 3.4.2 Topic 3.4.3 Topic​‌ 3.4.7
Domain 4.2 –​​ Training Topic 3.4.4 Topic​​​‌ 3.4.5 Topic 3.4.6 Topic​ 3.4.8

8.1 Scalable and​‌ portable LU factorization with​​ partial pivoting on top​​​‌ of runtime systems (Topic​ 3.4.1)

Participants: Alycia​‌ Lisito, Mathieu Faverge​​, Pierre Ramet.​​​‌

Task-based runtime systems have​ demonstrated efficiency in leveraging​‌ the capabilities of large,​​ heterogeneous architectures. Many linear​​​‌ algebra algorithms and applications​ have been implemented on​‌ top of runtime systems​​ to increase their performance.​​​‌ However, the High Performance​ Linpack (HPL) benchmark, used​‌ by the TOP500 to​​ rank supercomputers, has not​​​‌ yet been successfully implemented​ using taskbased runtime systems.​‌ In this paper, we​​ explore solutions to implement​​​‌ efficient LU factorization with​ partial pivoting using the​‌ sequential task-flow programming model.​​ We show that, due​​​‌ to the pivoting strategy,​ this algorithm generates a​‌ large number of very​​ small tasks, which usually​​​‌ overload the runtime system​ and make it inefficient.​‌ We propose two solutions​​ to improve the efficiency​​​‌ and reduce the number​ of tasks. First, we​‌ apply wellknown blocking strategies​​ in the context of​​ task-based algorithms. Secondly, we​​​‌ explore batching techniques to‌ reduce the number of‌​‌ tasks submitted to the​​ runtime system. Moreover, in​​​‌ distributed architectures, partial pivoting‌ generates many reductions on‌​‌ the critical path throughout​​ the factorization which needs​​​‌ to be carefully handled‌ to reach high performance.‌​‌ Two task-based reduction algorithms​​ are proposed to express​​​‌ these operations and improve‌ the runtime reactivity on‌​‌ the critical path. These​​ proposals have been implemented​​​‌ in the dense linear‌ algebra library CHAMELEON on‌​‌ top of the STARPU​​ runtime system. Experiments conducted​​​‌ on our cluster with‌ these optimizations show that‌​‌ our LU with partial​​ pivoting asymptotically reaches the​​​‌ performance of the non-pivoting‌ algorithm.

This work has‌​‌ been presented at IPDPS​​ Conference, June 2025, Milan,​​​‌ Italy 20.

8.2‌ Batching the tasks of‌​‌ the LU factorization with​​ partial pivoting on top​​​‌ of runtime systems (Topic‌ 3.4.1)

Participants: Alycia‌​‌ Lisito, Mathieu Faverge​​, Florent Pruvost,​​​‌ Pierre Ramet.

Task-based‌ runtime systems have demonstrated‌​‌ efficiency in leveraging the​​ capabilities of large heterogeneous​​​‌ architectures. Many linear algebra‌ algorithms and applications have‌​‌ been implemented on top​​ of runtime systems to​​​‌ increase their performance. However,‌ the LU factorization with‌​‌ partial pivoting has not​​ yet been successfully implemented​​​‌ using task-based runtime systems.‌ This operation is used‌​‌ to solve large dense​​ linear systems in numerical​​​‌ simulations, such as the‌ Maxwell equations in electromagnetism.‌​‌ This factorization is a​​ major part of the​​​‌ High Performance Linpack (HPL)‌ benchmark used in the‌​‌ TOP500 to evaluate and​​ rank supercomputers. We explore​​​‌ solutions to implement efficient‌ LU factorization with partial‌​‌ pivoting using the sequential​​ task-flow programming model. These​​​‌ solutions have been implemented‌ in the dense linear‌​‌ algebra library Chameleon on​​ top of the StarPU​​​‌ runtime system. We showed‌ that, due to the‌​‌ pivoting strategy, this algorithm​​ generates a large number​​​‌ of very small tasks,‌ which usually overloads the‌​‌ runtime system and makes​​ it inefficient. With a​​​‌ naive task batching strategy,‌ we improved the efficiency‌​‌ and reduced the number​​ of tasks. We propose​​​‌ solutions to adapt the‌ batch size to the‌​‌ granularity of the tasks.​​ In order to do​​​‌ that, we first distinguish‌ two types of tasks‌​‌ and set an adapted​​ batch size for each.​​​‌ Then, we introduce a‌ heuristic based on the‌​‌ number of operations per​​ tasks to adapt the​​​‌ batch size to the‌ computational complexity of the‌​‌ tasks during the factorization.​​ Experiments conducted on our​​​‌ cluster with these optimizations‌ show that our LU‌​‌ factorization with partial pivoting​​ asymptotically reaches about 96%​​​‌ of the performance of‌ the non-pivoting algorithm. Thanks‌​‌ to the adaptive batch​​ size mechanism, the performance​​​‌ peak is reached even‌ faster.

This work has‌​‌ been presented at COMPAS​​ Conference, June 2025, Bordeaux,​​​‌ France 27.

8.3‌ Toward an algebraic multigrid‌​‌ method for the indefinite​​ Helmholtz equation (Topic 3.4.2​​​‌)

Participants: Clement Richefort‌, Pierre Ramet.‌​‌

It is well known​​ that multigrid methods are​​​‌ very competitive in solving‌ a wide range of‌​‌ SPD problems. However achieving​​​‌ such performance for non-SPD​ matrices remains an open​‌ problem. In particular, three​​ main issues may arise​​​‌ when solving a Helmholtz​ problem : some eigenvalues​‌ may be negative or​​ even complex, requiring the​​​‌ choice of an adapted​ smoother for capturing them,​‌ and because the near-kernel​​ space is oscillatory, the​​​‌ geometric smoothness assumption cannot​ be used to build​‌ efficient interpolation rules. Moreover,​​ the coarse correction is​​​‌ not equivalent to a​ projection method since the​‌ indefinite matrix does not​​ define a norm. We​​​‌ present some investigations about​ designing a method that​‌ converges in a constant​​ number of iterations with​​​‌ respect to the wavenumber.​ The method builds on​‌ an ideal reduction-based framework​​ and related theory for​​​‌ SPD matrices to improve​ an initial least squares​‌ minimization coarse selection operator​​ formed from a set​​​‌ of smoothed random vectors.​ A new coarse correction​‌ is proposed to minimize​​ the residual in an​​​‌ appropriate norm for indefinite​ problems. We also present​‌ numerical results at the​​ end of the paper.​​​‌

This paper has been​ published in SIAM SISC​‌ 11.

8.4 Hierarchical​​ partitioning for the numerical​​​‌ simulation of complex 3D​ objects (Topic 3.4.2)​‌

Participants: Dimitri Walther,​​ Mathieu Faverge, Pierre​​​‌ Ramet.

The Boundary​ Element Method (BEM) offers​‌ numerous advantages for simulating​​ complex physical phenomena. By​​​‌ placing the unknowns (or​ degrees of freedom) on​‌ the interfaces between different​​ media, it becomes possible​​​‌ to model problems with​ distant boundary conditions (such​‌ as fluid flow around​​ an object, acoustic or​​​‌ electromagnetic wave diffraction, radiative​ heat transfer, etc.). However,​‌ this approach results in​​ a fully coupled system​​​‌ with a dense matrix.​ When this dense matrix​‌ can be decomposed into​​ low-rank sub-blocks, it is​​​‌ possible to construct a​ hierarchical matrix (H-matrix) that​‌ approximates the original system​​ to a desired level​​​‌ of accuracy. In favorable​ scenarios, this approximation reduces​‌ spatial complexity from O​​(n2)​​​‌ to O(n​logn) by​‌ compressing the matrix sub-blocks.​​ This work investigates the​​​‌ relationship between the partitioning​ of degrees of freedom​‌ and the compression rate​​ of the H-matrix. A​​​‌ new hierarchical partitioning technique,​ specifically designed to optimize​‌ H-matrix compression, is introduced.​​ Unlike existing algorithms based​​​‌ on geometric information (such​ as Median cut, Cobblestone,​‌ or Space-filling curves), this​​ new method relies on​​​‌ the construction of a​ connectivity graph of the​‌ degrees of freedom. This​​ graph is built in​​​‌ quasi-linear time (O​(nlogn​‌)) from the​​ mesh of the studied​​​‌ object and partitioned in​ log-quadratic time (O​‌(n2log​​n)) using​​​‌ a multi-level partitioning approach.​ An additional constraint is​‌ imposed to balance the​​ partition loads, facilitating optimization​​​‌ on task-based execution environments.​ Numerical experiments are conducted​‌ on a variety of​​ test cases from electromagnetic​​​‌ simulations.

This work has​ been presented at COMPAS​‌ Conference, June 2025, Bordeaux,​​ France 46. It​​​‌ was awarded the prize​ for best poster.

8.5​‌ Optimal scheduling algorithms for​​ software-defined radio pipelined and​​ replicated task chains on​​​‌ multicore architecture (Axis 3.3.1‌)

Participants: Laércio Lima‌​‌ Pilla.

Software-Defined Radio​​ (SDR) represents a move​​​‌ from dedicated hardware to‌ software implementations of digital‌​‌ communication standards. This approach​​ offers flexibility, shorter time​​​‌ to market, maintainability, and‌ lower costs, but it‌​‌ requires an optimized distribution​​ tasks in order to​​​‌ meet performance requirements. Thus,‌ we studied the problem‌​‌ of scheduling SDR linear​​ task chains of stateless​​​‌ and stateful tasks for‌ streaming processing. We modeled‌​‌ this problem as a​​ pipelined workflow scheduling problem​​​‌ based on pipelined and‌ replicated parallelism on homogeneous‌​‌ resources. We proposed an​​ optimal dynamic programming solution​​​‌ and an optimal greedy‌ algorithm named OTAC for‌​‌ maximizing throughput while also​​ minimizing resource utilization. Moreover,​​​‌ the optimality of the‌ proposed scheduling algorithm was‌​‌ proved. We evaluated our​​ solutions and compared their​​​‌ execution times and schedules‌ to other algorithms using‌​‌ synthetic task chains and​​ an implementation of the​​​‌ DVB-S2 communication standard on‌ the AFF3CT SDR Domain‌​‌ Specific Language. Our results​​ demonstrated how OTAC quickly​​​‌ finds optimal schedules, leading‌ consistently to better results‌​‌ than other algorithms, or​​ equivalent results with fewer​​​‌ resources.

This paper has‌ been published in the‌​‌ Journal of Parallel and​​ Distributed Computing in October​​​‌ 2025 13.

8.6‌ Task-Based HPC in the‌​‌ Cloud: Price-Performance Analysis of​​ N-Body Simulations with StarPU​​​‌ (Axis 3.3.1)

Participants:‌ Laércio Lima Pilla.‌​‌

Public cloud environments present​​ significant challenges for traditional​​​‌ High Performance Computing (HPC)‌ applications due to infrastructure‌​‌ limitations that differ substantially​​ from dedicated HPC systems.​​​‌ Unlike traditional HPC clusters‌ optimized for tightly coupled‌​‌ parallel workloads, cloud platforms​​ were designed primarily for​​​‌ web services and data‌ processing applications. Key obstacles‌​‌ include high-latency networks, hardware​​ virtualization overhead, and limited​​​‌ availability of specialized accelerators,‌ all of which can‌​‌ severely impact the performance​​ of compute-intensive applications such​​​‌ as physics simulations. This‌ study investigated the feasibility‌​‌ of running HPC workloads​​ on public cloud infrastructure​​​‌ using standard and cost-effective‌ instance configurations rather than‌​‌ expensive specialized “HPC” offerings.​​ We deployed heterogeneous clusters​​​‌ on Amazon Web Services‌ using the HPC@Cloud Toolkit,‌​‌ incorporating various instance types,​​ including GPU-accelerated nodes with​​​‌ different computational capabilities. Our‌ evaluation focused on N-body‌​‌ simulations implemented using a​​ task-based parallel programming model,​​​‌ leveraging the StarPU runtime‌ system to dynamically schedule‌​‌ computational tasks across various​​ processing units. Our experimental​​​‌ results demonstrated three key‌ findings: (1) smaller GPU-equipped‌​‌ instances (g6.2xlarge)​​ achieve performance comparable to​​​‌ larger instances while costing‌ approximately one-sixth the price,‌​‌ challenging conventional scaling assumptions​​ for cloud-based HPC; (2)​​​‌ strategic GPU utilization yields‌ up to 8.‌​‌2× performance improvements​​ over CPU-only configurations while​​​‌ reducing total execution costs‌ by 24.4‌​‌×; and (3)​​ while task-based programming models​​​‌ effectively address network limitations‌ through dynamic scheduling, complex‌​‌ tree-based algorithms like TBFMM​​ face significant optimization challenges​​​‌ in cloud environments due‌ to load balancing issues‌​‌ and expensive parameter tuning​​ requirements. These findings provide​​​‌ practical guidance for researchers‌ and practitioners seeking cost-effective‌​‌ cloud HPC deployments, demonstrating​​​‌ that commodity cloud infrastructures​ can be viable for​‌ regular computational workloads but​​ require careful algorithmic-resource matching​​​‌ for optimal efficiency.

This​ work has been published​‌ in IEEE International Conference​​ on Cloud Engineering, September​​​‌ 2025, Rennes, France 25​.

8.7 Task-Based HPC​‌ in the Cloud: Price-Performance​​ Analysis of N-Body Simulations​​​‌ with StarPU (Topic 3.4.2​)

Participants: Laércio Lima​‌ Pilla.

Tensor-train (TT)​​ decomposition has garnered tremendous​​​‌ popularity for its efficiency​ in handling high-dimensional data​‌ arising in scientific and​​ quantum computing as well​​​‌ as machine learning applications.​ It provides a compact​‌ representation for matrices and​​ vectors with a Kronecker​​​‌ product-like low-rank structure and​ enables efficient matrix-vector operations​‌ in this compressed form.​​ The vector scalar product​​​‌ is among such key​ operations, comprising a series​‌ of tensor contractions in​​ a specific tensor network​​​‌ topology whose order significantly​ impacts the computational cost.​‌ In this work, we​​ proposed efficient algorithms for​​​‌ finding near-optimal contraction orderings​ for tensor networks representing​‌ scalar products in the​​ TT format. We showed​​​‌ that our algorithms outperform​ all existing contraction ordering​‌ methods for general tensor​​ networks where the best​​​‌ existing method incurs up​ to 15% higher cost​‌ for xTy​​, twice the cost​​​‌ for xTA​y, and ten​‌ times higher cost for​​ xTAB​​​‌y scalar products where​ x,y and​‌ A,B are​​ vectors and matrices expressed​​​‌ in the TT format,​ respectively.

This work has​‌ been published in the​​ European Conference on Parallel​​​‌ Processing, August 2025, Dresden,​ Germany 24.

8.8​‌ MetaCS-FL: A Metaheuristic-Based Framework​​ for Client Selection in​​​‌ Federated Learning Systems (Topic​ 3.4.6)

Participants: Alan​‌ Lira Nunes, Laércio​​ Lima Pilla.

Federated​​​‌ Learning (FL) enables the​ collaborative training of distributed​‌ machine learning models, with​​ each participant (client) using​​​‌ their own local private​ data. In Cross-Device FL​‌ systems, clients usually include​​ unreliable and heterogeneous mobile​​​‌ and edge devices with​ highly imbalanced and small​‌ local datasets. Given these​​ characteristics, the selection of​​​‌ clients to participate in​ the training plays an​‌ essential role in the​​ efficacy of these systems,​​​‌ as a poor selection​ can lead to long​‌ execution times, high energy​​ consumption, and low accuracy.​​​‌ In this work, we​ proposed MetaCS-FL, a​‌ client selection framework built​​ to support different metaheuristics,​​​‌ initial solution methods, and​ user-defined triggers for new​‌ client selections. It also​​ employs client profiling and​​​‌ historical and current performance​ data to produce more​‌ efficient selections of clients​​ and the volume of​​​‌ data they should use​ for training locally. We​‌ evaluated our framework in​​ an extensive series of​​​‌ experiments, including comparisons with​ state-of-the-art algorithms, revealing the​‌ effectiveness of our approach.​​ Having FedAvg as the​​​‌ baseline for comparisons, MetaCS-FL​ reduced total time (resp.​‌ energy consumption), by up​​ to 64.83% (resp. 56.79%)​​​‌ for CIFAR-10, and by​ up to 67.59% (resp.​‌ 60.87%) for Fashion-MNIST while​​ reaching the target testing​​​‌ accuracy.

This report has​ been published in HAL​‌ in July 2025 40​​, and its paper​​ is currently under evaluation.​​​‌

8.9 Approximation Algorithms for‌ Scheduling With/Without Deadline Constraints‌​‌ Where Rejection Costs are​​ Proportional to Processing Times​​​‌ (Axis 3.3.3)

Participants:‌ Olivier Beaumont, Lionel‌​‌ Eyraud-Dubois, Laércio Lima​​ Pilla.

We studied​​​‌ two offline job scheduling‌ problems where tasks can‌​‌ be processed on a​​ limited number of energy-efficient​​​‌ edge machines or offloaded‌ to an unlimited supply‌​‌ of energy-inefficient cloud machines​​ (called rejected). The objective​​​‌ was to minimize total‌ energy consumption. First, we‌​‌ considered scheduling without deadlines,​​ formulating it as a​​​‌ scheduling problem with rejection,‌ where rejection costs are‌​‌ proportional to processing times.​​ We proposed a novel​​​‌ 54(1‌+ϵ)-approximation‌​‌ algorithm, BEKP by associating​​ it to a Multiple​​​‌ Subset Sum problem, improving‌ upon the existing (‌​‌32-1​​2m)-approximation​​​‌ for arbitrary rejection costs.‌ Next, we addressed scheduling‌​‌ with deadlines, aiming to​​ minimize the weighted number​​​‌ of rejected jobs. We‌ positioned this problem within‌​‌ the literature and introduced​​ a new (1​​​‌-(m-‌1)mm‌​‌m)-approximation algorithm,​​ MDP, inspired by an​​​‌ interval selection algorithm with‌ a (1-‌​‌mm(m​​+1)m​​​‌)-approximation for arbitrary‌ rejection costs. Experimental results‌​‌ demonstrate that BEKP and​​ MDP obtain better results​​​‌ (lower costs or higher‌ profits) than other state-of-the-art‌​‌ algorithms while maintaining a​​ competitive or better time​​​‌ complexity.

This work was‌ developed in the context‌​‌ of the Challenge PULSE,​​ and the paper has​​​‌ been published in IEEE‌ Transactions on Parallel and‌​‌ Distributed Systems in December​​ 2025 9.

8.10​​​‌ Energy-Aware Scheduling Strategies for‌ Partially-Replicable Task Chains on‌​‌ Heterogeneous Processors (Axis 3.3.3​​)

Participants: Laércio Lima​​​‌ Pilla.

The arrival‌ of heterogeneous (or hybrid)‌​‌ multicore architectures has brought​​ new performance trade-offs for​​​‌ applications, and efficiency opportunities‌ to systems. They have‌​‌ also increased the challenges​​ related to thread scheduling,​​​‌ as tasks' execution times‌ will vary depending if‌​‌ they are placed on​​ big (performance) cores or​​​‌ little (efficient) ones. In‌ this work, we focused‌​‌ on the challenges heterogeneous​​ multicore processors bring to​​​‌ partially-replicable task chains, such‌ as the ones that‌​‌ implement digital communication standards​​ in Software-Defined Radio (SDR).​​​‌ Our objective was to‌ maximize the throughput of‌​‌ these task chains while​​ also minimizing their power​​​‌ consumption. We modeled this‌ problem as a pipelined‌​‌ workflow scheduling problem using​​ pipelined and replicated parallelism​​​‌ on two types of‌ resources whose objectives were‌​‌ to minimize the period​​ and to use as​​​‌ many little cores as‌ necessary. We proposed two‌​‌ greedy heuristics (FERTAC and​​ 2CATAC) and one optimal​​​‌ dynamic programming (HeRAD) solution‌ to the problem. We‌​‌ evaluated our solutions and​​ compared the quality of​​​‌ their schedules (in period‌ and resource utilization) and‌​‌ their execution times using​​ synthetic task chains. We​​​‌ also studied an open‌ source implementation of the‌​‌ DVB-S2 communication standard based​​ on the StreamPU runtime.​​​‌ Leading processor vendors were‌ covered with ARM, Apple,‌​‌ AMD, and Intel platforms.​​​‌ Both the achieved throughput​ and the energy consumption​‌ were evaluated. Our results​​ demonstrated the benefits and​​​‌ drawbacks of the different​ proposed solutions.

This work​‌ has been published in​​ Heterogeneity in Computing Workshop,​​​‌ June 2025, Milan, Italy​ 22, and its​‌ extended version 39 is​​ currently under evaluation.

8.11​​​‌ HiRemate: Hierarchical Approach for​ Efficient Re-materialization of Large​‌ Neural Networks (Domain 4.2​​)

Participants: Olivier Beaumont​​​‌, Lionel Eyraud Dubois​, Yulia Gusak.​‌

Training modern neural networks​​ poses a significant memory​​​‌ challenge, as storing intermediate​ results during the forward​‌ and backward passes requires​​ considerable memory resources. To​​​‌ address this issue without​ affecting model accuracy, re-materialization​‌ techniques have been introduced​​ to recompute selected intermediate​​​‌ results instead of storing​ them, thus fulfilling the​‌ memory size constraint. The​​ main algorithmic problem is​​​‌ to compute a re-materialization​ schedule that minimizes the​‌ computational overhead within a​​ given memory budget. Our​​​‌ proposed HiRemate framework is​ based on a new​‌ hierarchical approach that provides​​ generality and quality: we​​​‌ can handle any class​ of network graphs and​‌ satisfy the memory constraint​​ with a low computational​​​‌ overhead during training. The​ framework exhibits low algorithmic​‌ complexity, making it possible​​ to scale up and​​​‌ handle very large models.​ The framework automatically builds​‌ a dataflow graph from​​ a PyTorch model, decomposes​​​‌ the graph hierarchically, and​ then builds an nn.Module​‌ that executes forward and​​ backward passes within the​​​‌ given memory budget.

This​ work has been published​‌ in the Forty-Second International​​ Conference on Machine Learning​​​‌ (ICML 2025), July 2025,​ Vancouver, Canada 19.​‌

8.12 Fault-tolerant numerical iterative​​ algorithms at scale (Topic​​​‌ 3.4.7)

Participants: Thomas​ Herault.

This year,​‌ we developed a coherent​​ set of models and​​​‌ strategies to make large-scale​ iterative computations —- with​‌ a strong emphasis on​​ linear algebra kernels and​​​‌ solvers —- both more​ resilient to errors and​‌ more efficient in their​​ use of communication and​​​‌ storage. A first contribution​ revisits protection against silent​‌ data corruptions (SDCs), where​​ errors may remain undetected​​​‌ for several iterations. Instead​ of relying on costly​‌ full replication, we studied​​ the use of partial​​​‌ detectors whose detection latency​ is bounded, and we​‌ derived optimal execution schemes​​ (segment lengths, and how​​​‌ many in-memory checkpoints must​ be kept) that guarantee​‌ correctness while reducing overhead.​​ The analysis and Monte-Carlo​​​‌ results show that, across​ a broad range of​‌ parameters, partial detection can​​ significantly outperform replication, sometimes​​​‌ yielding substantial speedups for​ realistic error rates 14​‌.

8.13 Partial Detectors​​ Versus Replication To Cope​​​‌ With Silent Errors (Axis​ 3.3.4)

Participants: Thomas​‌ Herault.

We proposed​​ a holistic fault-tolerance methodology​​​‌ for numerical iterative algorithms​ that jointly addresses the​‌ three dominant error sources​​ at scale: fail-stop failures,​​​‌ computation silent errors, and​ memory bit flips. The​‌ key idea is a​​ hierarchical periodic pattern that​​​‌ interleaves mechanisms at different​ frequencies —- (i) frequent​‌ computation verifications (“chunks”), (ii)​​ less frequent memory verification​​​‌ + in-memory checkpoints (“segments”),​ and (iii) even less​‌ frequent global checkpoints to​​ tolerate fail-stop failures (“patterns”)​​ —- and we provide​​​‌ an analytical framework to‌ derive the optimal pattern‌​‌ minimizing the expected time​​ per iteration. We instantiated​​​‌ and evaluated this approach‌ on Preconditioned Conjugate Gradient,‌​‌ illustrating scenarios where the​​ optimal pattern can dramatically​​​‌ reduce resilience overheads compared‌ to more naïve strategies‌​‌ 16, 38.​​

8.14 Fixed-Work vs. Fixed-Time​​​‌ Checkpointing on Large-Scale Failure-Prone‌ Platforms (Axis 3.3.4)‌​‌

Participants: Thomas Herault.​​

We addressed a very​​​‌ practical systems constraint that‌ directly impacts large-scale linear‌​‌ algebra runs: the prevalence​​ of fixed-length reservations on​​​‌ HPC systems. We studied‌ checkpointing not only in‌​‌ the classical “fixed-work” setting,​​ but also in the​​​‌ dual fixed-time setting, where‌ the goal is to‌​‌ maximize the expected progress​​ achieved within a reservation.​​​‌ We show that fixed-time‌ checkpointing is surprisingly harder‌​‌ than fixed-work checkpointing, propose​​ dynamic threshold-based heuristics that​​​‌ perform well for short/medium‌ reservations, and derive an‌​‌ (discretized) optimal dynamic-programming strategy,​​ including extensions to stochastic​​​‌ checkpoint durations. These results‌ provide actionable guidance for‌​‌ running iterative solvers robustly​​ under real scheduling constraints​​​‌ 8.

8.15 PaRSEC:‌ Scalability, flexibility, and hybrid‌​‌ architecture support for task-based​​ applications in ECP (Axis​​​‌ 3.3.1)

Participants: Thomas‌ Herault.

The work‌​‌ conducted during the Exascale​​ Computing Project (ECP) provided​​​‌ several key lessons on‌ the role and design‌​‌ of task-based runtime systems​​ for future large-scale platforms.​​​‌ In particular, ECP confirmed‌ that data movement, rather‌​‌ than raw computation, is​​ the dominant performance limiter​​​‌ on heterogeneous and accelerated‌ systems, making it essential‌​‌ for the runtime to​​ manage communication, data placement,​​​‌ and the overlap of‌ computation and transfers. The‌​‌ diversity of ECP target​​ architectures also showed that​​​‌ performance portability cannot be‌ achieved through static programming‌​‌ models, but instead requires​​ runtimes that dynamically adapt​​​‌ scheduling, task granularity, and‌ resource usage based on‌​‌ runtime information. In addition,​​ the coexistence of legacy​​​‌ MPI-based components with task-based‌ execution emphasized the importance‌​‌ of interoperability, while the​​ scale and duration of​​​‌ ECP runs highlighted the‌ need to treat resilience‌​‌ as a first-class runtime​​ concern, naturally enabled by​​​‌ dataflow-based execution models. These‌ lessons directly inform the‌​‌ objectives of the NumPEx​​ project, the French counterpart​​​‌ to ECP, in which‌ the TOPAL team is‌​‌ actively involved. By building​​ on the experience gained​​​‌ in ECP, our participation‌ in NumPEx aims to‌​‌ transfer and extend these​​ concepts to the French​​​‌ exascale ecosystem, contributing runtime-level‌ solutions for communication efficiency,‌​‌ adaptability, and fault tolerance​​ on next-generation European supercomputing​​​‌ platforms 10.

8.16‌ Tensor Contractions on Top‌​‌ of Runtime Systems: Application​​ to the Coupled-Cluster Method​​​‌ (Topic 3.4.1)

Participants:‌ Thomas Herault.

This‌​‌ year, we investigated how​​ the benefits of task-based​​​‌ and distributed runtime systems,‌ well established for dense‌​‌ linear algebra, can be​​ extended to tensor computations,​​​‌ which play a central‌ role in modern high-performance‌​‌ computing. Our work focused​​ on tensor contractions arising​​​‌ in computational quantum chemistry,‌ in particular in coupled-cluster‌​‌ methods, where tensors have​​ a small number of​​​‌ dimensions but very large‌ sizes. We extended the‌​‌ Chameleon dense linear algebra​​​‌ library to support tensor​ contractions by expressing them​‌ as sequences of optimized​​ matrix operations, combined with​​​‌ flexible tensor permutations. To​ address the challenges of​‌ data layout and memory​​ footprint, we identified a​​​‌ set of elementary and​ composable tensor transformations and​‌ implemented them on top​​ of the StarPU runtime​​​‌ system. We validated this​ approach on the computation​‌ of coupled-cluster residuals with​​ density fitting, demonstrating both​​​‌ its efficiency and its​ scalability on modern heterogeneous​‌ platforms 28.

8.17​​ Scalable Block-Sparse Matrix Multiplication​​​‌ Using Template Task Graphs​ (Topic 3.4.1)

Participants:​‌ Thomas Herault.

This​​ year, we advanced the​​​‌ use of task-based runtime​ systems for sparse linear​‌ algebra by addressing communication​​ scalability issues in distributed​​​‌ block-sparse matrix multiplication. Building​ on the Template Task​‌ Graph (TTG) programming model,​​ we introduced application-defined scheduling​​​‌ constraints that allow the​ runtime to control task​‌ eligibility without resorting to​​ ad-hoc control flow. Applied​​​‌ to block-sparse matrix multiplication,​ these constraints make it​‌ possible to throttle and​​ structure communication, limiting the​​​‌ number of concurrent broadcasts​ and reducing network contention​‌ while preserving overlap between​​ communication and computation. Experimental​​​‌ results demonstrate that this​ approach significantly improves scalability​‌ on large problem sizes​​ and highlights the importance​​​‌ of exposing high-level execution​ constraints to the runtime.​‌ This work reinforces the​​ role of expressive task-based​​​‌ runtimes as a key​ enabler for scalable and​‌ communication-efficient linear algebra on​​ modern distributed systems 23​​​‌.

8.18 Comparing and​ Contrasting User and Runtime​‌ Directed Data Placement Strategies​​ for Owner-Compute, Multi-accelerator Distributed​​​‌ Task Based Scheduling (Topic​ 3.4.1)

Participants: Thomas​‌ Herault.

This work​​ explores data placement strategies​​​‌ in task-based runtime systems​ for linear algebra applications​‌ on multi-accelerator, distributed platforms.​​ Using the PaRSEC runtime,​​​‌ we compared runtime-directed heuristics​ and user-directed placement strategies​‌ in the context of​​ owner-compute scheduling, focusing on​​​‌ dense matrix multiplication and​ Cholesky factorization. The results​‌ show that while automated​​ strategies can significantly improve​​​‌ locality and outperform naïve​ approaches, they remain consistently​‌ outperformed by carefully designed​​ user-directed placements, particularly at​​​‌ scale. The study also​ highlights the limitations of​‌ relying on unified virtual​​ memory and demonstrates the​​​‌ importance of explicitly managing​ data received from the​‌ network, especially on modern​​ systems where network interfaces​​​‌ are directly attached to​ accelerators. Overall, this work​‌ emphasizes that runtime systems​​ must expose flexible mechanisms​​​‌ for expressing data placement​ policies, allowing expert users​‌ to guide execution while​​ preserving a clear separation​​​‌ between algorithmic expression and​ performance tuning 17.​‌

8.19 Optimizing Parallel Heterogeneous​​ System Efficiency: Dynamic Task​​​‌ Graph Adaptation with Recursive​ Tasks (Topic 3.4.1)​‌

Participants: Abdou Guermouche.​​

Task-based programming models are​​​‌ currently an ample trend​ to leverage heterogeneous parallel​‌ systems in a productive​​ way (OpenACC, Kokkos, Legion,​​​‌ OmpSs, PaRSEC, StarPU, XKaapi,​ ...). Among these models,​‌ the Sequential Task Flow​​ (STF) model is widely​​​‌ embraced (PaRSEC's DTD, OmpSs,​ StarPU) since it allows​‌ to express task graphs​​ naturally through a sequential-looking​​​‌ submission of tasks, and​ tasks dependencies are inferred​‌ automatically. However, STF is​​ limited to task graphs​​ with task sizes that​​​‌ are fixed at submission,‌ posing a challenge in‌​‌ determining the optimal task​​ granularity. Notably, in heterogeneous​​​‌ systems, the optimal task‌ size varies across different‌​‌ processing units, so a​​ single task size would​​​‌ not fit all units.‌ StarPU's recursive tasks allow‌​‌ graphs with several task​​ granularities by turning some​​​‌ tasks into sub-graphs dynamically‌ at runtime. The decision‌​‌ to transform these tasks​​ into sub-graphs is decided​​​‌ by a StarPU component‌ called the Splitter. After‌​‌ deciding to transform some​​ tasks, classical scheduling approaches​​​‌ are used, making this‌ component generic, and orthogonal‌​‌ to the scheduler. In​​ this paper, we propose​​​‌ a new policy for‌ the Splitter, which is‌​‌ designed for heterogeneous platforms,​​ that relies on linear​​​‌ programming aimed at minimizing‌ execution time and maximizing‌​‌ resource utilization. This results​​ in a dynamic well-balanced​​​‌ set comprising both small‌ tasks to fill multiple‌​‌ CPU cores, and large​​ tasks for efficient execution​​​‌ on accelerators like GPU‌ devices. We then present‌​‌ an experimental evaluation showing​​ that just-in-time adaptations of​​​‌ the task graph lead‌ to improved performance across‌​‌ various dense linear algebra​​ algorithms 12.

8.20​​​‌ Improving energy efficiency of‌ HPC applications using unbalanced‌​‌ GPU power capping (Topic​​ 3.4.3)

Participants: Abdou​​​‌ Guermouche, Hayfa Tayeb‌.

Energy efficiency represents‌​‌ a significant challenge in​​ the domain of High-Performance​​​‌ Computing (HPC). One potential‌ key parameter to improve‌​‌ energy efficiency is the​​ use of power capping,​​​‌ a technique for controlling‌ the power limits of‌​‌ a device, such as​​ a CPU or GPU.In​​​‌ this paper, we propose‌ to examine the impact‌​‌ of GPU power capping​​ in the context of​​​‌ HPC applications using heterogeneous‌ computing systems. To this‌​‌ end, we first conduct​​ an extensive study of​​​‌ the impact of GPU‌ power capping on a‌​‌ compute intensive kernel, namely​​ matrix multiplication kernel (GEMM),​​​‌ on different Nvidia GPU‌ architectures. Interestingly, such compute-intensive‌​‌ kernels are up to​​ 30 % more energy​​​‌ efficient when the GPU‌ is set to 55-70‌​‌ % of its Thermal​​ Design Power (TDP). Using​​​‌ the best power capping‌ configuration provided by this‌​‌ study, we investigate how​​ setting different power caps​​​‌ for GPU devices of‌ a heterogeneous computing node‌​‌ can improve the energy​​ efficiency of the running​​​‌ application. We consider dense‌ linear algebra task-based operations,‌​‌ namely matrix multiplication and​​ Cholesky factorization.We show how​​​‌ the underlying runtime system‌ scheduler can then automatically‌​‌ adapt its decisions to​​ take advantage of the​​​‌ heterogeneous performance capability of‌ each GPU. The results‌​‌ show that for a​​ given platform equipped with​​​‌ four GPU devices, applying‌ a power cap on‌​‌ all GPUs improves the​​ energy efficiency for matrix​​​‌ multiplication up to 24.3‌ % (resp. 33.78 %)‌​‌ for double (resp. single)​​ precision 26.

8.21​​​‌ Sparse Matrix Ordering for‌ Fine Grain Parallel Triangular‌​‌ Solve Using SIMD (Topic​​ 3.4.1)

Participants: Abdou​​​‌ Guermouche.

The evolution‌ of processor hardware increasingly‌​‌ supports fine grain parallelism​​ through SIMD (Single Instruction,​​​‌ Multiple Data) vector instruction‌ sets and hardware threading.‌​‌ For instance, the new​​​‌ ARM SVE instruction set​ allows for hardware implementation​‌ of up to 32​​ double precision SIMD vector​​​‌ sizes per hardware thread.​ In this work, we​‌ focus on vectorization of​​ the triangular solves required​​​‌ in BiCGStab preconditioned with​ ILU(0) that is particularly​‌ numerically effective for IFPEN​​ applications. In our context,​​​‌ expressing some parallelism can​ be achieved by changing​‌ the sparse structure of​​ the matrices through unknown​​​‌ ordering; that can be​ recast in terms of​‌ graph ordering and coloring.​​ We use a graph​​​‌ coloring method named ColorRCM​ to exhibit fine grain​‌ parallelism to feed the​​ SIMD computing units while​​​‌ improving the convergence of​ the Krylov solver compared​‌ to classical greedy graph​​ coloring method. We first​​​‌ evaluate the performance of​ SIMD-SpTRSV using the permutation​‌ provided by ColorRCM and​​ achieve an acceleration between​​​‌ 1.7 and 6 in​ AVX2 compared to Intel​‌ MKL 21.4. Then we​​ examine the impact of​​​‌ ColorRCM ordering on ILU(0)-BiCGStab​ performance on 201 matrices,​‌ including those from the​​ Suite Sparse matrix (The​​​‌ University of Florida Sparse​ Matrix Collection collection and​‌ from the IFPEN porous​​ media flow simulations. The​​​‌ solver configuration uses the​ ColorRCM ordering and vectorized​‌ with AVX2 instructions showed​​ the best convergence times​​​‌ in two thirds of​ the tests 21.​‌

8.22 Mind Bubbles and​​ Memory: Bounds on Scheduling​​​‌ Pipeline Parallelism with Rematerialization​ (Domain 4.2)

Participants:​‌ Adrien Aguila–Multner, Yulia​​ Gusak, Olivier Beaumont​​​‌, Lionel Eyraud Dubois​.

Training large neural​‌ networks, especially Transformer-based Large​​ Language Models (LLMs), requires​​​‌ massive high-performance computing (HPC)​ resources. Within each microbatch,​‌ computations follow a strictly​​ sequential flow through a​​​‌ stack of transformer blocks:​ a forward pass to​‌ compute the loss, and​​ a backward pass to​​​‌ propagate gradients. This sequential​ structure limits intrinsic parallelism.​‌ To improve performance, several​​ complementary strategies have been​​​‌ developed: data, tensor, sequence,​ and pipeline parallelism, typically​‌ combined to achieve scalability​​ over tens of thousands​​​‌ of GPUs.

In 35​, we present a​‌ formal analysis of pipeline​​ parallelism (PP) for large-scale​​​‌ training. In PP, the​ model is partitioned into​‌ multiple stages, and microbatches​​ are injected into the​​​‌ pipeline to overlap computation.​ The main challenge is​‌ to minimize idle periods​​ (pipeline bubbles) while managing​​​‌ memory usage, since each​ GPU must store intermediate​‌ activations from multiple in-flight​​ microbatches. Existing scheduling algorithms​​​‌ such as GPIPE, 1F1B,​ HANAYO, and MEGATRON reduce​‌ idle time but lack​​ formal lower bounds or​​​‌ explicit modeling of memory​ constraints.

We develop a​‌ unified analytical approach for​​ PP scheduling, deriving lower​​​‌ bounds on completion time​ for both single-wave and​‌ multi-wave regimes. Our analysis​​ explicitly incorporates a memory​​​‌ constraint K, denoting the​ number of activations that​‌ can be stored per​​ GPU. Exact results are​​​‌ provided for two extreme​ cases (minimal memory (​‌K=1)​​ and large memory (​​​‌Km)),​ while general lower bounds​‌ are established for intermediate​​ configurations. Our analysis highlights​​​‌ the intrinsic coupling between​ pipeline utilization and memory​‌ footprint, providing a foundation​​ for evaluating and comparing​​ pipeline scheduling algorithms under​​​‌ realistic memory constraints.

8.23‌ Optimized Forward-Backward Rematerialization for‌​‌ Memory-Efficient Pipeline Parallel Training​​ (Topic 3.4.4)

Participants:​​​‌ Adrien Aguila–Multner, Olivier‌ Beaumont, Lionel Eyraud‌​‌ Dubois, Yulia Gusak​​.

Pipeline parallelism is​​​‌ a key technique for‌ scaling deep network training‌​‌ across multiple devices. Recent​​ works have significantly reduced​​​‌ pipeline idle time by‌ improving scheduling efficiency. Decoupling‌​‌ the computation of gradients​​ with respect to weights​​​‌ and activations led to‌ the development of schedules‌​‌ with almost no idle​​ time. However, these methods​​​‌ still require substantial memory,‌ limiting their applicability on‌​‌ resource-constrained hardware.

In 36​​, our first contribution​​​‌ is to introduce recomputation‌ to the backward pass,‌​‌ extending rematerialization beyond the​​ forward pass. This enables​​​‌ executing schedules with decoupled‌ gradient computations under much‌​‌ tighter memory constraints. Our​​ second contribution is a​​​‌ unified optimization approach that,‌ given a model and‌​‌ hardware memory constraints, formulates​​ and solves an Integer​​​‌ Linear Programming (ILP) problem‌ to determine the optimal‌​‌ per-microbatch, per-GPU rematerialization strategy​​ for a given schedule,​​​‌ applicable to both one-wave‌ and multi-wave pipeline schedules.‌​‌ Our third contribution shows​​ that, as device memory​​​‌ constraints vary, the relative‌ advantages of different pipeline‌​‌ schedules also change in​​ the presence of rematerialization.​​​‌ We provide corresponding insights‌ and a PyTorch framework‌​‌ that enables finding and​​ executing the optimal combination​​​‌ of pipeline scheduling and‌ rematerialization strategies. Experiments demonstrate‌​‌ the effectiveness of all​​ three contributions, showing that​​​‌ our approach enables efficient‌ training of larger models‌​‌ under tight memory budgets,​​ adapts optimally to varying​​​‌ memory capacities, and reduces‌ recomputation overhead compared to‌​‌ existing recomputation solutions.

8.24​​ Leveraging Expert Usage to​​​‌ Speed up LLM Inference‌ with Expert Parallelism (Topic‌​‌ 3.4.4)

Participants: Olivier​​ Beaumont, Raphael Bourgouin​​​‌.

Large language models‌ have become indispensable for‌​‌ many text-processing applications. Their​​ inference 15, i.e.​​​‌ their use to generate‌ text, is a time-consuming‌​‌ task since tokens have​​ to be generated one​​​‌ after the other, even‌ if the computational load‌​‌ has been reduced by​​ model sparsification, e.g. by​​​‌ using a Mixture of‌ Experts (MoE) models. In‌​‌ the MoE context, a​​ subset of experts is​​​‌ selected at each stage.‌ Note that not all‌​‌ subsets of experts (pairs​​ of experts in most​​​‌ cases) in a given‌ layer have the same‌​‌ probability of being selected.​​ When experts are mapped​​​‌ to different GPUs, there‌ is a risk of‌​‌ load imbalance if the​​ selected experts end up​​​‌ on a small number‌ of GPUs. This paper‌​‌ proposes to leverage this​​ heterogeneity in expert usage​​​‌ to map experts of‌ popular subsets onto distinct‌​‌ GPUs, allowing them to​​ be processed in parallel​​​‌ and thus reducing the‌ time needed for inference.‌​‌ Even though this mapping​​ problem is NP-complete, it​​​‌ is possible to design‌ simple greedy strategies that‌​‌ significantly reduce the need​​ for sequential expert processing.​​​‌ Our proof-ofconcept confirms that‌ our mapping strategies effectively‌​‌ reduce inference time on​​ the Mixtral model.

8.25​​​‌ Pallas: a generic‌ trace format for large‌​‌ HPC trace analysis (Axis​​​‌ 3.3.1)

Participant: Philippe​ Swartvagher.

Identifying performance​‌ bottlenecks in a parallel​​ application is tedious, especially​​​‌ because it requires analyzing​ the behaviour of various​‌ software components, as bottlenecks​​ may have several causes​​​‌ and symptoms. For example,​ a load imbalance may​‌ cause long MPI waiting​​ times, or contention on​​​‌ disk may degrade the​ performance of I/O operations.​‌ Detecting a performance problem​​ means investigating the execution​​​‌ of an application and​ applying several performance analysis​‌ techniques. To do so,​​ one can use a​​​‌ tracing tool to collect​ information describing the behaviour​‌ of the application. At​​ the end of the​​​‌ execution, a trace file​ in a specific format​‌ is available to the​​ application user, which can​​​‌ be used to conduct​ a complete post-mortem investigation.​‌ Several challenges emerge from​​ the generation and use​​​‌ of traces. Tracing applications​ may alter the performance​‌ of the application, and​​ can create thousands of​​​‌ heavy trace files, especially​ at a large scale.​‌ Most importantly, the post-mortem​​ analysis needs to load​​​‌ these thousands of trace​ files in memory, and​‌ process them. This quickly​​ becomes impractical for large​​​‌ scale applications, as memory​ gets exhausted and the​‌ number of opened files​​ exceeds the system capacity.​​​‌ In this paper, we​ propose Pallas 18,​‌ a generic trace format​​ tailored for conducting various​​​‌ post-mortem performance analysis of​ traces describing large executions​‌ of HPC applications. During​​ the execution of the​​​‌ application, Pallas collects events​ and detects their repetitions​‌ on-the-fly. When storing the​​ trace to disk, Pallas​​​‌ groups the data from​ similar events or groups​‌ of events together in​​ order to later speed​​​‌ up trace reading. We​ demonstrate that the Pallas​‌ online detection of the​​ program structure does not​​​‌ significantly degrade the performance​ of the applications. Moreover,​‌ the Pallas format allows​​ faster trace analysis compared​​​‌ to other evaluated trace​ formats. Overall, the Pallas​‌ trace format allows an​​ interactive analysis of a​​​‌ trace that is required​ when a user investigates​‌ a performance problem.

9​​ Bilateral contracts and grants​​​‌ with industry

9.1 Bilateral​ Grants with Industry

Participants:​‌ Olivier Beaumont, Lionel​​ Eyraud-Dubois, Mathieu Faverge​​​‌, Abdou Guermouche,​ Yulia Gusak, Pierre​‌ Ramet.

Some on​​ the ongoing PhD thesis​​​‌ are developed within bilateral​ contract with industry for​‌ PhD advisory:

  • Airbus (2022-).​​ This collaboration concerns the​​​‌ parallelization and optimization of​ the Flusepa application, which​‌ models the separation of​​ boosters for space launchers​​​‌ at Airbus Safran Launchers.​ Flusepa combines computational fluid​‌ mechanics, algorithms (AMR) and​​ task-based parallelism based on​​​‌ the StarPU runtime system.​ We are involved in​‌ the supervision of the​​ PhD. of Alice Lasserre​​​‌ in this context.
  • CEA-Cesta​ for the PhD of​‌ Abel Calluaud. A direct​​ solver developed at CEA​​​‌ relies on the approximation​ by hierarchical matrices to​‌ reduce both computational and​​ memory costs. Although these​​​‌ developments have met a​ growing demand for increased​‌ simulation accuracy, there are​​ still open problems to​​​‌ pursue these research efforts​ in an HPC context.​‌ In this thesis, we​​ propose to develop and​​ compare several approaches to​​​‌ adapt the granularity of‌ hierarchical tasks and extract‌​‌ parallelism to exploit the​​ multicore computational nodes associated​​​‌ with massively parallel architectures‌ such as GPUs.
  • CEA-Cesta‌​‌ for the PhD of​​ Dimitri Walther. In the​​​‌ context of numerical simulation‌ of electromagnetism, integral methods‌​‌ are among the most​​ widely used because of​​​‌ their power. These methods‌ lead to the solution‌​‌ of dense linear problems​​ and are therefore very​​​‌ expensive. For this reason,‌ hierarchical compression methods have‌​‌ been developed that drastically​​ reduce the cost associated​​​‌ with these matrices. They‌ are based on a‌​‌ hierarchical partitioning of the​​ matrix, and therefore of​​​‌ the mesh, and the‌ efficiency of the compression‌​‌ depends on this partitioning.​​ In this context, the​​​‌ aim of the thesis‌ is to develop efficient‌​‌ and scalable hierarchical partitioners​​ to optimise the compression​​​‌ of the matrix.
  • Eviden‌ for the PhD of‌​‌ Alycia Lisito. For over​​ three years, we have​​​‌ been collaborating with Eviden‌ on the development of‌​‌ an HPL benchmark on​​ top of runtime systems.​​​‌ This work is continued‌ as part of Alycia‌​‌ Lisito's thesis funded by​​ a CIFRE contract. To​​​‌ guarantee a high level‌ of flexibility and portability,‌​‌ it is possible to​​ use a task-based implementation​​​‌ through an executive support‌ (or runtime). This programming‌​‌ model has already proved​​ its effectiveness in the​​​‌ implementation of various parallel‌ algorithms, in particular for‌​‌ dense linear algebra (LU​​ decomposition, Cholesky decomposition, QR,​​​‌ etc.). In this thesis,‌ we will use Inria's‌​‌ existing software stack, through​​ the dense linear algebra​​​‌ library Chameleon and the‌ executive support StarPU. These‌​‌ reference libraries for runtime​​ linear algebra will be​​​‌ studied to enable the‌ scaling up of more‌​‌ complex algorithms such as​​ HPL.
  • Eviden for the​​​‌ PhD of Jean Conan.‌ Within the framework of‌​‌ High-Performance Computing (HPC) tenders,​​ Atos Bull must provide​​​‌ contractual performance guarantees for‌ future supercomputers. However, direct‌​‌ measurement is often impossible​​ during the bidding phase,​​​‌ either because the hardware‌ components (processors, accelerators) are‌​‌ not yet commercially available​​ or because the scale​​​‌ of the proposed system‌ exceeds the testing resources‌​‌ available internally. Performance prediction​​ has become a critical​​​‌ tool, not only for‌ meeting client requirements but‌​‌ also for upstream architecture​​ sizing (such as network​​​‌ topology) and optimizing massively‌ parallel software. The transition‌​‌ to exascale computing introduces​​ unprecedented complexity, driven by​​​‌ the increasing heterogeneity of‌ compute nodes and the‌​‌ intricate structure of high-speed​​ networks. The objective of​​​‌ the thesis is thus‌ to explore novel methodologies‌​‌ for performance prediction, with​​ a primary focus on​​​‌ simulation techniques, Reduce optimization‌ overhead by minimizing the‌​‌ number of large-scale physical​​ runs required to determine​​​‌ optimal execution parameters, and‌ finally accurately model the‌​‌ impact of node heterogeneity​​ and network architecture on​​​‌ overall system performance.
  • Diabolocom‌ and Inria are now‌​‌ on the final stage​​ of the contract negotiation​​​‌ to start a PhD‌ thesis co-supervised by Yulia‌​‌ Gusak and Olivier Beaumont​​ on Optimization of Multi-Stage​​​‌ Generative Model Pipelines for‌ Cost-Efficient, Scalable Inference.

10‌​‌ Partnerships and cooperations

10.1​​​‌ International initiatives

10.1.1 Associate​ Teams in the framework​‌ of an Inria International​​ Lab or in the​​​‌ framework of an Inria​ International Program

ELF Associate​‌ Team on on Efficient​​ deep Learning Frameworks.

Partners​​​‌

  • TOPAL
  • California Institute of​ Technology (Caltech)

Nowadays, Deep​‌ Learning (DL) and Artificial​​ Intelligence (AI) technologies are​​​‌ incorporated in more and​ more areas to solve​‌ various problems of video,​​ audio, natural language processing,​​​‌ content generation, etc. Frameworks​ based on neural networks,​‌ which are core modules​​ of deep learning models,​​​‌ have been already successfully​ used for action recognition,​‌ weather forecasting, robotic surgery​​ and other inspiring applications​​​‌ [24, 44, 48]. The​ drawbacks of modern neural​‌ networks are that they​​ usually require a significant​​​‌ amount of data and​ a lot of GPU​‌ devices to be trained,​​ which makes them expensive​​​‌ in terms of energy​ and money costs, and​‌ harmful in terms of​​ air emissions [27]. The​​​‌ general question we are​ going to address during​‌ the work of the​​ associate team is: given​​​‌ your application and your​ computation platform, how to​‌ perform the model training​​ efficiently in terms of​​​‌ time/energy?

10.2 International research​ visitors

10.2.1 Visits to​‌ international teams

Research stays​​ abroad

 

Olivier Beaumont visited​​​‌ Loris Marchal and Pablo​ Piantanida for ten days​‌ in July 2025 at​​ ETS (École de technologie​​​‌ supérieure) to work on​ inference optimization using speculative​‌ decoding. This collaboration led​​ to the submission of​​​‌ an international cooperation project,​ which is currently under​‌ evaluation.

10.3 European initiatives​​

10.3.1 H2020 projects

EUPEX​​​‌

Participants: Olivier Beaumont.​

 

EUPEX project on cordis.europa.eu​‌

  • Title:
    EUROPEAN PILOT FOR​​ EXASCALE
  • Duration:
    From January​​​‌ 1, 2022 to December​ 31, 2025
  • Partners:
    • INSTITUT​‌ NATIONAL DE RECHERCHE EN​​ INFORMATIQUE ET AUTOMATIQUE (INRIA),​​​‌ France
    • GRAND EQUIPEMENT NATIONAL​ DE CALCUL INTENSIF (GENCI),​‌ France
    • VSB - TECHNICAL​​ UNIVERSITY OF OSTRAVA (VSB​​​‌ - TU Ostrava), Czechia​
    • FORSCHUNGSZENTRUM JULICH GMBH (FZJ),​‌ Germany
    • COMMISSARIAT A L​​ ENERGIE ATOMIQUE ET AUX​​​‌ ENERGIES ALTERNATIVES (CEA), France​
    • IDRYMA TECHNOLOGIAS KAI EREVNAS​‌ (FOUNDATION FOR RESEARCH AND​​ TECHNOLOGYHELLAS), Greece
    • SVEUCILISTE U​​​‌ ZAGREBU FAKULTET ELEKTROTEHNIKE I​ RACUNARSTVA (UNIVERSITYOF ZAGREB FACULTY​‌ OF ELECTRICAL ENGINEERING AND​​ COMPUTING), Croatia
    • UNIVERSITA DEGLI​​​‌ STUDI DI TORINO (UNITO),​ Italy
    • CYBELETECH (Cybeletech), France​‌
    • UNIVERSITA DI PISA (UNIPI),​​ Italy
    • GRAN SASSO SCIENCE​​​‌ INSTITUTE (GSSI), Italy
    • ISTITUTO​ NAZIONALE DI ASTROFISICA (INAF),​‌ Italy
    • UNIVERSITA DEGLI STUDI​​ DEL MOLISE, Italy
    • E​​​‌ 4 COMPUTER ENGINEERING SPA​ (E4), Italy
    • UNIVERSITA DEGLI​‌ STUDI DELL'AQUILA (UNIVAQ), Italy​​
    • CONSIGLIO NAZIONALE DELLE RICERCHE​​​‌ (CNR), Italy
    • JOHANN WOLFGANG​ GOETHE-UNIVERSITAET FRANKFURT AM MAIN​‌ (GUF), Germany
    • EUROPEAN CENTRE​​ FOR MEDIUM-RANGE WEATHER FORECASTS​​​‌ (ECMWF), United Kingdom
    • BULL​ SAS (BULL), France
    • POLITECNICO​‌ DI MILANO (POLIMI), Italy​​
    • EXASCALE PERFORMANCE SYSTEMS -​​​‌ EXAPSYS IKE, Greece
    • ALMA​ MATER STUDIORUM - UNIVERSITA​‌ DI BOLOGNA (UNIBO), Italy​​
    • PARTEC AG (PARTEC), Germany​​​‌
    • ISTITUTO NAZIONALE DI GEOFISICA​ E VULCANOLOGIA, Italy
    • CINECA​‌ CONSORZIO INTERUNIVERSITARIO (CINECA), Italy​​
    • SECO SPA (SECO SRL),​​​‌ Italy
    • CONSORZIO INTERUNIVERSITARIO NAZIONALE​ PER L'INFORMATICA (CINI), Italy​‌
  • Inria contact:
    Olivier Beaumont​​
  • Coordinator:
    Jean-Robert Bacou (Eviden)​​​‌
  • Summary:

    The EUPEX consortium​ aims to design, build,​‌ and validate the first​​ EU platform for HPC,​​ covering end-to-end the spectrum​​​‌ of required technologies with‌ European assets: from the‌​‌ architecture, processor, system software,​​ development tools to the​​​‌ applications. The EUPEX prototype‌ will be designed to‌​‌ be open, scalable and​​ flexible, including the modular​​​‌ OpenSequana-compliant platform and the‌ corresponding HPC software ecosystem‌​‌ for the Modular Supercomputing​​ Architecture. Scientifically, EUPEX is​​​‌ a vehicle to prepare‌ HPC, AI, and Big‌​‌ Data processing communities for​​ upcoming European Exascale systems​​​‌ and technologies. The hardware‌ platform is sized to‌​‌ be large enough for​​ relevant application preparation and​​​‌ scalability forecast, and a‌ proof of concept for‌​‌ a modular architecture relying​​ on European technologies in​​​‌ general and on European‌ Processor Technology (EPI) in‌​‌ particular. In this context,​​ a strong emphasis is​​​‌ put on the system‌ software stack and the‌​‌ applications.

    Being the first​​ of its kind, EUPEX​​​‌ sets the ambitious challenge‌ of gathering, distilling and‌​‌ integrating European technologies that​​ the scientific and industrial​​​‌ partners use to build‌ a production-grade prototype. EUPEX‌​‌ will lay the foundations​​ for Europe's future digital​​​‌ sovereignty. It has the‌ potential for the creation‌​‌ of a sustainable European​​ scientific and industrial HPC​​​‌ ecosystem and should stimulate‌ science and technology more‌​‌ than any national strategy​​ (for numerical simulation, machine​​​‌ learning and AI, Big‌ Data processing).

    The EUPEX‌​‌ consortium – constituted of​​ key actors on the​​​‌ European HPC scene –‌ has the capacity and‌​‌ the will to provide​​ a fundamental contribution to​​​‌ the consolidation of European‌ supercomputing ecosystem. EUPEX aims‌​‌ to directly support an​​ emerging and vibrant European​​​‌ entrepreneurial ecosystem in AI‌ and Big Data processing‌​‌ that will leverage HPC​​ as a main enabling​​​‌ technology.

DARE

Participants: Olivier‌ Beaumont, Lionel Eyraud-Dubois‌​‌, Mathieu Faverge,​​ Pierre Ramet, Florent​​​‌ Pruvost.

DARE

  • Title:‌
    A new era for‌​‌ supercomputing in Europe
  • Duration:​​
    From March 1, 2025​​​‌ to March 1, 2026‌
  • Partners (partial list):
    • BARCELONA‌​‌ SUPERCOMPUTING CENTER (BSC)
    • CODASIP​​ GMBH (CODA-DE)
    • AXELERA AI​​​‌ SRL (AXE-IT)
    • OPENCHIP SOFTWARE‌ TECHNOLOGIES SL (OCT)
    • INTERUNIVERSITAIR‌​‌ MICRO-ELECTRONICA CENTRUM (IMEC)
    • FORSCHUNGSZENTRUM​​ JUELICH GMBH (JSC)
    • CINECA​​​‌ CONSORZIO INTERUNIVERSITARIO (CINECA)
    • E4‌ COMPUTER ENGINEERING SPA (E4)‌​‌
    • CHALMERS TEKNISKA HOGSKOLA AB​​ (CHALMERS)
    • POLITECNICO DI MILANO​​​‌ (POLIMI)
    • UNIVERSIDAD COMPLUTENSE DE‌ MADRID (UCM)
    • UNIVERSITAT POLITECNICA‌​‌ DE V ALENCIA (UPV)​​
    • INSTITUT NATIONAL DE RECHERCHE​​​‌ EN INFORMATIQUE ET AUTOMATIQUE‌ (INRIA)
    • THALES (TRT)
    • TECHNISCHE‌​‌ UNIVERSITAET MUENCHEN (TUM)
    • BULL​​ SAS (BULL)
  • Inria contact:​​​‌
    Olivier Sentyies
  • Coordinator:
    Osman‌ Unsal (BSC)
  • Summary:

     

    DARE‌​‌ explores new paths toward​​ greater European autonomy in​​​‌ HPC and AI by‌ advancing open technologies and‌​‌ fostering homegrown innovation. The​​ project aims to reduce​​​‌ strategic dependencies and strengthen‌ Europe’s ability to shape‌​‌ its digital future.

    DARE’s​​ technologies will power future​​​‌ European supercomputers, enabling breakthroughs‌ in science, industry, and‌​‌ AI. By strengthening Europe’s​​ HPC supply chain and​​​‌ IP portfolio, DARE creates‌ long-term economic, technological, and‌​‌ societal benefits across critical​​ sectors.

    DARE sets out​​​‌ to lay the technological‌ foundations for European digital‌​‌ autonomy in HPC and​​ AI. By combining open​​​‌ RISC-V architectures, chiplet technologies,‌ and a co-designed software‌​‌ ecosystem, DARE aims to​​​‌ deliver working prototypes, shape​ the EU HPC roadmap,​‌ and boost Europe’s ability​​ to build and sustain​​​‌ its own supercomputing value​ chain.

10.4 National initiatives​‌

10.4.1 Inria Challenge

Challenge​​ Cupseli: Collaborative Unified Platform​​​‌ for a Scalable and​ Efficient Learning Infrastructure
  • Duration:​‌
    2025 – 2029
  • Coordinator:​​
    Olivier Beaumont (Inria) and​​​‌ Alexandru Dobrila (Hivenet)
  • Local​ contact:
    Olivier Beaumont &​‌ Lionel Eyraud Dubois &​​ Julia Gusak & Thomas​​​‌ Herault & Philippe Swartvagher​
  • Partners:
    Hivenet
  • Inria teams:​‌
    • ARGO and MIMOVE, Inria​​ Paris
    • COAST, Inria Nancy​​​‌ – Grand Est
    • MAGELLAN,​ STACK and WIDE, Inria​‌ Centre at Rennes University​​
    • OCKHAM, Inria Centre of​​​‌ Lyon
    • COATI and NEO,​ Inria Centre at Université​‌ Côte d’Azur
    • TADAAM and​​ TOPAL, Inria Centre of​​​‌ the University of Bordeaux​
  • Summary:
    The Cupseli challenge​‌ aims to demonstrate that​​ it is possible to​​​‌ run complex applications (particularly​ in the field of​‌ machine learning) on heterogeneous,​​ distributed, and volatile resources,​​​‌ while achieving strong parallel​ efficiency and preserving both​‌ accuracy and confidentiality. Building​​ on the combined expertise​​​‌ of hive and Inria​ in storage technologies illustrated​‌ in Alvearium, this​​ strategic partnership explores algorithmic​​​‌ and system solutions to​ optimize computation, memory, and​‌ communications, while ensuring security​​ and fault tolerance. The​​​‌ work is organized around​ three axes: Frugality (adapting​‌ training and inference to​​ limited and dynamic resources),​​​‌ Security and Confidentiality (protecting​ data and models through​‌ encryption, secure enclaves, and​​ defenses against attacks), and​​​‌ Volatility (ensuring robustness and​ performance despite the unpredictable​‌ arrival and departure of​​ resources). The shared goal​​​‌ is to offer a​ green and sovereign alternative​‌ to data centers, by​​ leveraging already-existing resources for​​​‌ the benefit of AI​ and Big Data applications.​‌
Challenge PULSE: Pushing low-carbon​​ services towards the Edge​​​‌
  • Duration:
    2022 – 2026​
  • Coordinator:
    Romain Rouvoy
  • Local​‌ contact:
    Olivier Beaumont &​​ Lionel Eyraud Dubois
  • Partners:​​​‌
    Qarnot Computing, ADEME
  • Inria​ teams:
    • Avalon
    • Ctrl-A
    • Spirals​‌
    • Stack
    • Storm
    • Topal
  • Summary:​​
    The Pulse challenge aims​​​‌ to develop and promote​ best practices in geo-repaired​‌ hardware and software infrastructures​​ for more environmentally friendly​​​‌ intensive computing. The idea​ is to analyze which​‌ solutions are the most​​ relevant, and which levers​​​‌ need to be focused​ on, to reduce the​‌ impact of infrastructures while​​ maximizing the usefulness of​​​‌ their emissions. To this​ end, the challenge is​‌ structured around two complementary​​ research axes to address​​​‌ this technological and environmental​ issue: holistic analysis of​‌ the environmental impact of​​ intensive computing, and implementing​​​‌ more virtuous edge services.​

10.5 Public policy support​‌

Olivier Beaumont conducted an​​ expert assessment, in collaboration​​​‌ with IRD and other​ Inria colleagues, on the​‌ current state and future​​ development prospects of high-performance​​​‌ computing in Africa. This​ study was commissioned by​‌ the French Development Agency​​ (Agence Française de​​​‌ Développement).

11 Dissemination​

11.1 Promoting scientific activities​‌

11.1.1 Scientific events: organisation​​

General chair, scientific chair​​​‌

 

Member of the organizing​ committees

 

Philippe Swartvagher and​‌ Emmanuel Agullo (Concace team)​​ were organizing chairs of​​​‌ Compas 2025, the​ French Conference on Parallelism,​‌ Architecture and System.

11.1.2​​ Scientific events: selection

Chair​​ of conference program committees​​​‌
Member​​​‌ of the conference program‌ committees
Reviewer​​

 

The members of the​​​‌ TOPAL project have also‌ performed reviewing for the‌​‌ following list of conferences:​​ IPDPS'25, SC 25​​​‌, HIPC'25

11.1.3 Journal‌

Member of the editorial‌​‌ boards
Reviewer - reviewing‌ activities

 

The members of‌​‌ the TOPAL project have​​ performed reviewing for Journal​​​‌ of Parallel and Distributed‌ Computing (Lionel Eyraud‌​‌ Dubois , Abdou Guermouche​​ ), ACM Transactions on​​​‌ Mathematical Software (Pierre‌ Ramet , Abdou Guermouche‌​‌ ), IEEE Transactions on​​ Parallel and Distributed Systems​​​‌ (Lionel Eyraud Dubois‌ , Abdou Guermouche ,‌​‌ Mathieu Faverge ), SoftwareX​​ (Abdou Guermouche ),​​​‌ Parallel Computing (Laércio‌ Lima Pilla , Abdou‌​‌ Guermouche ), 4OR -​​ A Quarterly Journal of​​​‌ Operations Research (Lionel‌ Eyraud Dubois ).

11.1.4‌​‌ Invited talks

  • Yulia Gusak​​ gave a talk at​​​‌ Sharp+Foundary @ COLT workshop‌ entitled “Training Neural Networks‌​‌ Under Memory Constraints“.
  • Yulia​​ Gusak gave a talk​​​‌ at AI4Industry'25 entitled “Efficient‌ Training of Neural Networks“.‌​‌
  • Yulia Gusak gave a​​ talk at the 18th​​​‌ Scheduling for large-scale systems‌ workshop, entitled “Optimizing neural‌​‌ networks training using different​​ types of parallelisms (data/tensor/model/pipeline)​​​‌ and re-materialization“
  • Laércio Lima‌ Pilla gave a talk‌​‌ at the 18th Scheduling​​ for large-scale systems workshop,​​​‌ entitled “Exploring scheduling solutions‌ for Federated Learning training”.‌​‌
  • Olivier Beaumont gave a​​​‌ talk at the 18th​ Scheduling for large-scale systems​‌ workshop, entitled “Optimized Forward-Backward​​ Rematerialization for Memory-Efficient Pipeline​​​‌ Parallel Training”.

11.1.5 Leadership​ within the scientific community​‌

11.1.6 Scientific expertise​‌

  • Olivier Beaumont conducted an​​ expert assessment, in collaboration​​​‌ with IRD and other​ Inria colleagues, on the​‌ current state and future​​ development prospects of high-performance​​​‌ computing in Africa. This​ study was commissioned by​‌ the French Development Agency​​ (Agence Française de​​​‌ Développement).
  • Olivier Beaumont​ acted as external evaluator​‌ for several EuroHPC calls:​​ Inno4Scale, Energy,​​​‌ FFPlus
  • Pierre Ramet is​ Scientific Advisor at the​‌ CEA-DAM CESTA.
  • Pierre Ramet​​ participated in the HCERES​​​‌ evaluation committee of the​ IRFU (Institut de recherche​‌ sur les lois fondamentales​​ de l'Univers) at CEA​​​‌ Saclay. The final report​ has been published in​‌ March 2025.
  • Abdou Guermouche​​ acted as external evaluator​​​‌ for one ANRT proposal.​

11.1.7 Research administration

  • Pierre​‌ Ramet is the head​​ of the CNRS Satanas​​​‌ department.
  • Pierre Ramet is​ member of Scientific comittee​‌ of the LaBRI.
  • Philippe​​ Swartvagher is the communication​​​‌ referent for the NumPEx/Exa-SofT​ project.
  • Philippe Swartvagher is​‌ the point of contact​​ in Bordeaux for Grid5000/SLICES-FR​​​‌ infrastructure.
  • Philippe Swartvagher is​ the representative of the​‌ TOPAL team at the​​ Bordeaux CUMI.
  • Philippe Swartvagher​​​‌ is elected member at​ the Center Committee of​‌ Inria Bordeaux.
  • Abdou Guermouche​​ is the scientific lead​​​‌ of the numerical library​ work package of the​‌ ExaSoft project (PEPR NumPEx).​​
  • Abdou Guermouche is member​​​‌ of the Scientific Committee​ of LaBRI.
  • Yulia Gusak​‌ is a PI of​​ the ELF associate team​​​‌ between Topal and Caltech.​
  • Laércio Lima Pilla is​‌ a member of the​​ societal challenges commission at​​​‌ the LaBRI.
  • Laércio Lima​ Pilla is a member​‌ of the committee on​​ gender equality and equal​​​‌ opportunities of the Inria​ Research center at the​‌ University of Bordeaux.
  • Laércio​​ Lima Pilla is a​​​‌ member of the National​ Gender Equality and Equal​‌ Opportunities Committee at Inria.​​

11.2 Teaching - Supervision​​​‌ - Juries - Educational​ and pedagogical outreach

  • Undergraduate​‌ level/Licence:
    • Aurélien Esnard :​​ Network (54h), Software technologies​​​‌ (80h) at Bordeaux University.​
    • Pierre Ramet : System​‌ programming 24h, Databases 32h,​​ Object programming 48h, Distributed​​​‌ programming 16h, Cryptography 16h,​ Introduction to unsupervised learning​‌ 16h at Bordeaux University.​​
    • Philippe Swartvagher : C​​​‌ Programming (46h), Web Programming​ (36h), Tools for Programming​‌ and C project (30h)​​ at Bordeaux INP (​​​‌Enseirb-MatMeca).
    • Abdou Guermouche​ System programming 36h at​‌ Bordeaux University.
    • Mathieu Faverge​​ : Programming environment (26h),​​​‌ Numerical algorithmic (25h), C​ projects (25h) at Bordeaux​‌ INP (Enseirb-MatMeca).​​
  • Post graduate level/Master:
    • Aurélien​​​‌ Esnard : Network management​ (24h), Network security (24h)​‌ at Bordeaux University.
    • Lionel​​ Eyraud Dubois : Graphs​​​‌ and Algorithms (20h), Complexity​ and Approximation (20h) at​‌ Bordeaux University.
    • Olivier Beaumont​​ : Parallel Algorithms, 20h​​​‌ at Bordeaux INP.
    • Pierre​ Ramet : Cryptography 20h​‌ and Numerical algorithms 40h​​ at Bordeaux INP (​​​‌Enseirb-MatMeca).
    • Philippe Swartvagher​ : Parallel Algorithms (17h),​‌ Project of network and​​ system programming (25h), Operating​​ Systems (15h) at Bordeaux​​​‌ INP (Enseirb-MatMeca).‌
    • Abdou Guermouche Network management‌​‌ 92h, Network security 64h,​​ Operating system 24h at​​​‌ Bordeaux University.
    • Mathieu Faverge‌ : System programming: lecture,‌​‌ practice and project (54h),​​ Linear Algebra for high​​​‌ Performance Computing (9h) at‌ Bordeaux INP (Enseirb-MatMeca‌​‌). He is also​​ in charge of the​​​‌ master 2 internship for‌ the Computer Science department‌​‌ at Bordeaux INP (Enseirb-MatMeca)​​ and he is in​​​‌ charge, with Abdou Guermouche‌ , of the High‌​‌ Performance Computing - High​​ Performance Data Analytics specialty​​​‌ at Enseirb-MatMeca. This is‌ a common training curriculum‌​‌ between the Computer Science​​ and the MatMeca departments​​​‌ at Bordeaux INP and‌ with the Bordeaux University‌​‌ in the context of​​ the Computer Science Research​​​‌ Master.
    • Yulia Gusak :‌ Efficient Deep Learning (Outils‌​‌ pour l'apprentissage) (19h) at​​ Bordeaux INP (Enseirb-MatMeca​​​‌).
    • Laércio Lima Pilla‌ : Algorithms for High-Performance‌​‌ Computing Platforms (16h) at​​ Bordeaux INP (Enseirb-MatMeca​​​‌) and Bordeaux University,‌ Reading articles and scientific‌​‌ documentation (3h) at Bordeaux​​ University.
    • Thomas Herault :​​​‌ Introduction to tensor algebra‌ for the Engineer in‌​‌ Computer Science (9h) at​​ Bordeaux INP (Enseirb-MatMeca​​​‌); Open MP programming‌ (8h) at Bordeaux INP‌​‌ (Enseirb-MatMeca).

11.2.1​​ Supervision

  • PhD in progress:​​​‌ Brieuc Nicolas ; Scalable‌ tensor algebra on top‌​‌ of runtime system; started​​ Oct 2024; advisors Thomas​​​‌ Herault , Mathieu Faverge‌ ,Abdou Guermouche .‌​‌
  • PhD in progress: Nicolas​​ Ducarton ; Fault tolerance​​​‌ and task-based programming for‌ large-scale systems ; started‌​‌ April 2025; advisors Thomas​​ Herault , Samuel Thibault​​​‌ ,Amina Guermouche .‌
  • PhD in progress: Abel‌​‌ Calluaud; Combined compiler and​​ runtime approach for a​​​‌ direct hierarchical solver; started‌ Nov. 2022; advisors Pierre‌​‌ Ramet , Mathieu Faverge​​ .
  • PhD in progress:​​​‌ Jean-François David; Dynamic Scheduling‌ for Inference in Deep‌​‌ Neural Networks; advisors Olivier​​ Beaumont , Lionel Eyraud​​​‌ Dubois .
  • PhD in‌ progress: Alycia Lisito; Design‌​‌ and implementation of a​​ portable linear algebra benchmark​​​‌ on runtime systems for‌ performance evaluation of heterogeneous‌​‌ Exascale architectures ; started​​ Nov. 2023; advisors Pierre​​​‌ Ramet , Mathieu Faverge‌ , Matthieu Kuhn (Eviden).‌​‌
  • PhD in progress: Dimitri​​ Walther; ; started Nov.​​​‌ 2024; advisors Pierre Ramet‌ , Mathieu Faverge ,‌​‌ M. Lecouvez (CEA Cesta).​​
  • PhD in progress: Hayfa​​​‌ Tayeb ; Optimization of‌ high-performance applications on heterogeneous‌​‌ computing nodes; started Nov.​​ 2021; A. Guermouche ,​​​‌ B. Bramas , M.‌ Faverge. Defended March 25th,‌​‌ 2025.
  • PhD in progress:​​ Albert D'Aviau de Piolant​​​‌ ; started October 2023;‌ Energy aware scheduling for‌​‌ exascale architectures. Advisors: Abdou​​ Guermouche and Amina Guermouche.​​​‌
  • PhD in progress: Thomas‌ Morin ; started October‌​‌ 2023; Scheduling recursive task​​ graphs. Advisors: Abdou Guermouche,​​​‌ Samuel Thibault, Pierre-André Wacrenier.‌
  • PhD in progress :‌​‌ Alice Lasserre ; Started​​ Oct. 2022; Optimization of​​​‌ a task-based simulation code‌ on a distributed supercomputer;‌​‌ Advisors: Jean-Marie Couteyen-Carpaye, Raymond​​ Namyst and Abdou Guermouche.​​​‌
  • PhD in progress: Samuel‌ Mendoza; On the Scalability‌​‌ of sparse linear system​​ solvers using the task-based​​​‌ paradigm. Started Sept. 2025;‌ advisors Abdou Guermouche ,‌​‌ Emmanuel Agullo and Alfredo​​​‌ Buttari.
  • PhD in progress:​ Jean Conan; Simulation-based performance​‌ prediction of scientific computing​​ applications on exascale supercomputers;​​​‌ Started March 2025; advisors​ Abdou Guermouche , Louis​‌ Poirel and Arnaud Legrand.​​
  • PhD in progress: Adrien​​​‌ Aguilla–Multner , Started October​ 2024; Efficient Training of​‌ Neural Networks  36,​​ 35. Advisors: Yulia​​​‌ Gusak , Olivier Beaumont​ .
  • PhD defended: Diane​‌ Orhan ; Modeling and​​ dynamic optimization of software​​​‌ radio chains on heterogeneous​ architectures; defended in December​‌ 2025; advisors Denis Barthou​​ , Christophe Jégo ,​​​‌ and Laércio Lima Pilla​ .
  • PhD in progress:​‌ Alan Lira Nunes ;​​ Scheduling algorithms for the​​​‌ optimization of distributed machine​ learning models on heterogeneous​‌ resources; started in August​​ 2022; advisors Cristina Boeres​​​‌ , Lúcia Drummond ,​ and Laércio Lima Pilla​‌ .
  • PhD in progress:​​ Vanderlei Munhoz Pereira Filho​​​‌ ; Scheduling of task-based​ parallel applications on heterogeneous​‌ Cloud computing environments; started​​ in February 2025; advisors​​​‌ Olivier Aumage , Márcio​ Castro , and Laércio​‌ Lima Pilla .
  • PhD​​ in progress: Giorgio Bettonte​​​‌ ; Large-Scale Artificial Intelligence​ Inference Optimization in Distributed​‌ Cloud Environments; started in​​ October 2025; advisors Olivier​​​‌ Beaumont , Thomas Lambert​ , and Laércio Lima​‌ Pilla .
  • PhD in​​ progress: Tristan Riehs ;​​​‌ Integrate scheduling of asynchronous​ network communications and task​‌ scheduling; started in October​​ 2025; advisors Samuel Thibault​​​‌ (Storm team), Alexandre Denis​ (Tadaam team), and Philippe​‌ Swartvagher .
  • Lionel Eyraud-Dubois​​ and Philippe Swartvagher supervised​​​‌ the internship of Theo​ Grandsart about the use​‌ of task-based runtime systems​​ to implement LLMs.
  • Philippe​​​‌ Swartvagher , with Alexandre​ Denis (Tadaam team) and​‌ Samuel Thibault (Storm team),​​ supervised the internship of​​​‌ Tanguy Chatelain, about the​ anticipation of communications in​‌ task-based parallelism 64.​​
  • Thomas Herault and Philippe​​​‌ Swartvagher supervised the internship​ of Joachim Robert about​‌ communications for AI applications​​ in an heterogeneous and​​​‌ geo-distributed network 45.​
  • Thomas Herault and Philippe​‌ Swartvagher supervised the pre-PhD​​ period of Fares Boudjaoui​​​‌ about the scheduling of​ communications in an heterogeneity​‌ network.
  • Internship on task-based​​ systems for efficient deep​​​‌ learning (Enrique Galves​ ). Supervised by Yulia​‌ Gusak and Olivier Beaumont​​ .
  • Internship on diffusion​​​‌ model inference speed-up via​ parallelization within solver steps​‌ and solver composition (​​Victor Lucas Rosada Canesin​​​‌ ). Supervised by Yulia​ Gusak .
  • Internship on​‌ efficient teacher–student pipeline-parallel training,​​ with application to Reinforcement​​​‌ Learning from human feedback​ (Mohamed Kherraz ).​‌ Supervised by Yulia Gusak​​ .

11.2.2 Juries

  • Pierre​​​‌ Ramet : chair of​ the PhD jury of​‌ Lise Jolicoeur.
  • Olivier​​ Beaumont : chair of​​​‌ the PhD jury of​ Luis Lopes Marques and​‌ Diane Orhan
  • Lionel Eyraud​​ Dubois acted as "opponent"​​​‌ for the defense of​ Pirah Noor Soomro at​‌ Chalmers University of Technology.​​
  • Thomas Herault : chair​​​‌ of the HDR jury​ of Francieli Boito ;​‌ examiner in the jury​​ of Atte Torri PhD​​​‌ defense; examiner in the​ jury of Abdessalam Benhari​‌ PhD defense.
  • Yulia Gusak​​ : member of the​​​‌ PhD jury of Yannick​ Malot on Quantized DNN​‌ learning algorithms with limited​​ hardware overhead for Edge​​ implementation.
  • Yulia Gusak :​​​‌ member of the PhD‌ monitoring committee (comité de‌​‌ suivi) of Méline Trochon​​ on Adaptive Checkpoint-Restart System​​​‌ with Knowledge of the‌ Network Load.
  • Yulia Gusak‌​‌ : member of the​​ PhD monitoring committee of​​​‌ Rafael Silva on Artificial‌ Intelligence for Cardiac Monitoring:‌​‌ Portable Multimodal Cardiac Function​​ Analysis.

11.3 Popularization

11.3.1​​​‌ Participation in Live events‌

  • As part of the‌​‌ "Circuit Scientifique Bordelais", Philippe​​ Swartvagher presented to high​​​‌ school pupils from the‌ Lycée Stendhal at Aiguillon‌​‌ what is research in​​ computer science and how​​​‌ to become a researcher.‌
  • As part of the‌​‌ "Fête de la Science"​​, Olivier Beaumont presented​​​‌ HPC to students at‌ Lycée Gaston Crampe, Aire-sur-l'Adour‌​‌ (Landes)
  • Olivier Beaumont participated​​ in several internal events​​​‌ (closed doors,...) to present‌ the activities of the‌​‌ Inria Bordeaux center teams​​ at the interface between​​​‌ HPC and AI.
  • As‌ part of Maths en‌​‌ Jeans, Olivier Beaumont​​ worked with groups of​​​‌ students from Andernos high‌ school on combinatorial problems‌​‌ linked training.
  • On several​​ occasions, we have welcomed​​​‌ 3rd and 2nd grade‌ students into the team,‌​‌ with the participation of​​ Topal's PhD students, for​​​‌ periods of 2 hours‌ to half a day.‌​‌

12 Scientific production

12.1​​ Major publications

12.2 Publications of​ the year

International journals​‌

International peer-reviewed conferences​

Conferences​​​‌ without proceedings

  • 27 inproceedings​A.Alycia Lisito,​‌ M.Mathieu Faverge,​​ M.Matthieu Kuhn,​​​‌ F.Florent Pruvost and​ P.Pierre Ramet.​‌ Batching the tasks of​​ the LU factorization with​​​‌ partial pivoting on top​ of runtime systems.​‌COMPAS 2025 - Conférence​​ francophone d'informatique en Parallélisme,​​​‌ Architecture et SystèmeBordeaux,​ FranceJune 2025HAL​‌back to text
  • 28​​ inproceedingsB.Brieuc Nicolas​​​‌, M.Mathieu Faverge​ and T.Thomas Herault​‌. Contraction de tenseurs​​ au-dessus de supports d'exécutions​​​‌ Application à la méthode​ Coupled-Cluster.COMPAS2025COMPAS​‌ 2025 - Conférence francophone​​ d'informatique en Parallélisme, Architecture​​​‌ et SystèmeCOMPAS2025Bordeaux,​ FranceJune 2025HAL​‌back to text
  • 29​​ inproceedingsD.Dimitri Walther​​​‌, M.Mathieu Faverge​, M.Matthieu Lecouvez​‌ and P.Pierre Ramet​​. Algebraic hierarchical partitioning​​​‌ to improve H-matrix compression​.PP 2026 -​‌ SIAM Conference on Parallel​​ Processing for Scientific Computing​​​‌Berlin, GermanyMarch 2026​HAL

Edition (books, proceedings,​‌ special issue of a​​ journal)

  • 30 proceedingsAsynchronous​​​‌ Many-Task Systems and Applications:​ Third International Workshop, WAMTA​‌ 2025.Asynchronous Many-Task​​ Systems and Applications: Third​​​‌ International Workshop, WAMTA 2025​15690Lecture Notes in​‌ Computer ScienceSaint Louis,​​ Missouri, United StatesSpringer​​​‌ Nature SwitzerlandOctober 2025​HALDOIback to​‌ text
  • 31 periodicalServerless​​ Computing.IEEE Internet​​​‌ Computing286January​ 2025, 5-7HAL​‌DOI

Doctoral dissertations and​​ habilitation theses

  • 32 thesis​​​‌J.-F.Jean-François David.​ Dynamic scheduling for deep​‌ neural network inference.​​Université de BordeauxMarch​​​‌ 2025HAL
  • 33 thesis​H.Hayfa Tayeb.​‌ Optimizing HPC applications with​​ vectorization and multi-criteria task​​ scheduling on heterogeneous systems​​​‌.Université de Bordeaux‌March 2025HAL

Reports‌​‌ & preprints

Other scientific publications

12.3 Cited publications

  • 47​​​‌ articleE.Emmanuel Agullo​, O.Olivier Aumage​‌, M.Mathieu Faverge​​, N.Nathalie Furmento​​​‌, F.Florent Pruvost​, M.Marc Sergent​‌ and S. P.Samuel​​ Paul Thibault. Achieving​​​‌ High Performance on Supercomputers​ with a Sequential Task-based​‌ Programming Model.IEEE​​ Transactions on Parallel and​​​‌ Distributed Systems2017,​ 1-1DOIback to​‌ text
  • 48 articleE.​​Emmanuel Agullo, A.​​​‌Alfredo Buttari, A.​Abdou Guermouche and F.​‌Florent Lopez. Implementing​​ Multifrontal Sparse Solvers for​​​‌ Multicore Architectures with Sequential​ Task Flow Runtime Systems​‌.ACM Trans. Math.​​ Softw.432August​​​‌ 2016, 13:1--13:22HAL​DOIback to text​‌
  • 49 inproceedingsE.Emmanuel​​ Agullo, A.Alfredo​​​‌ Buttari, A.Abdou​ Guermouche and F.Florent​‌ Lopez. Task-Based Multifrontal​​ QR Solver for GPU-Accelerated​​​‌ Multicore Architectures..HiPC​Best paper awardIEEE​‌ Computer Society2015,​​ 54-63HALDOIback​​​‌ to text
  • 50 inproceedings​P.Pedro Alonso,​‌ M. F.Manuel F.​​ Dolz, F. D.​​​‌Francisco D. Igual,​ R.Rafael Mayo and​‌ E. S.Enrique S.​​ Quintana-Ortí. Reducing Energy​​​‌ Consumption of Dense Linear​ Algebra Operations on Hybrid​‌ CPU-GPU Platforms.2012​​ IEEE 10th International Symposium​​​‌ on Parallel and Distributed​ Processing with Applications2012​‌, 56-62DOIback​​ to text
  • 51 article​​​‌P.Pedro Alonso,​ M. F.Manuel F.​‌ Dolz, R.Rafael​​ Mayo and E. S.​​​‌Enrique S. Quintana-Ortí.​ Modeling power and energy​‌ consumption of dense matrix​​ factorizations on multicore processors​​​‌.Concurrency and Computation:​ Practice and Experience26​‌172014, 2743-2757​​URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3162DOIback​​​‌ to text
  • 52 inproceedings​H.Hartwig Anzt,​‌ J.Jack Dongarra and​​ E. S.Enrique S.​​​‌ Quintana-Ortí. Adaptive Precision​ Solvers for Sparse Linear​‌ Systems.Proceedings of​​ the 3rd International Workshop​​​‌ on Energy Efficient Supercomputing​E2SC '15New York,​‌ NY, USAAustin, Texas​​Association for Computing Machinery​​​‌2015, URL: https://doi.org/10.1145/2834800.2834802​DOIback to text​‌
  • 53 inproceedingsO.Olivier​​ Beaumont, P.Philippe​​​‌ Duchon, L.Lionel​ Eyraud-Dubois, J.Julien​‌ Langou and M.Mathieu​​ Verite. Symmetric Block-Cyclic​​​‌ Distribution: Fewer Communications leads​ to Faster Dense Cholesky​‌ Factorization.SC'22: Proceedings​​ of the International Conference​​ for High Performance Computing,​​​‌ Networking, Storage and Analysis‌(best paper, Algorithm Track)‌​‌IEEE and ACM2022​​back to textback​​​‌ to text
  • 54 techreport‌O.Olivier Beaumont,‌​‌ L.Lionel Eyraud-Dubois,​​ J.Julien Herrmann,​​​‌ A.Alexis Joly and‌ A.Alena Shilova.‌​‌ Optimal checkpointing for heterogeneous​​ chains: how to train​​​‌ deep neural networks with‌ limited memory.RR-9302‌​‌Inria Bordeaux Sud-OuestNovember​​ 2019HALback to​​​‌ textback to text‌
  • 55 inproceedingsO.Olivier‌​‌ Beaumont, L.Lionel​​ Eyraud-Dubois and A.Alena​​​‌ Shilova. Efficient Combination‌ of Rematerialization and Offloading‌​‌ for Training DNNs.​​NeurIPS 2021 - Thirty-fifth​​​‌ Conference on Neural Information‌ Processing SystemsVirtual-only Conference‌​‌December 2021HALback​​ to textback to​​​‌ text
  • 56 inproceedingsO.‌Olivier Beaumont, L.‌​‌Lionel Eyraud-Dubois and A.​​Alena Shilova. MadPipe:​​​‌ Memory Aware Dynamic Programming‌ Algorithm for Pipelined Model‌​‌ Parallelism.2022 IEEE​​ International Parallel and Distributed​​​‌ Processing Symposium Workshops (IPDPSW)‌IEEE2022back to‌​‌ text
  • 57 inproceedingsO.​​Olivier Beaumont, L.​​​‌Lionel Eyraud-Dubois and A.‌Alena Shilova. Optimal‌​‌ GPU-CPU Offloading Strategies for​​ Deep Neural Network Training​​​‌.Euro-Par 2020: Parallel‌ ProcessingChamSpringer International‌​‌ Publishing2020, 151--166​​back to textback​​​‌ to text
  • 58 inproceedings‌O.Olivier Beaumont,‌​‌ L.Lionel Eyraud-Dubois and​​ A.Alena Shilova.​​​‌ Pipelined Model Parallelism: Complexity‌ Results and Memory Considerations‌​‌.Proceedings of Europar​​ 2021Lisbon, PortugalSpringer​​​‌August 2021HALback‌ to textback to‌​‌ text
  • 59 inproceedingsO.​​Olivier Beaumont, L.​​​‌Lionel Eyraud-Dubois and M.‌Mathieu Verite. 2D‌​‌ Static Resource Allocation for​​ Compressed Linear Algebra and​​​‌ Communication Constraints.2020‌ IEEE 27th International Conference‌​‌ on High Performance Computing,​​ Data, and Analytics (HiPC)​​​‌IEEE2020, 181--191‌back to text
  • 60‌​‌ inproceedingsO.Olivier Beaumont​​, L.Lionel Eyraud-Dubois​​​‌, M.Mathieu Vérité‌ and J.Julien Langou‌​‌. I/O-Optimal Algorithms for​​ Symmetric Linear Algebra Kernels​​​‌.ACM Symposium on‌ Parallelism in Algorithms and‌​‌ ArchitecturesAssociation for Computing​​ Machinery : SIGACT, SIGARCH​​​‌Philadelphie, United StatesJuly‌ 2022HALback to‌​‌ textback to text​​
  • 61 articleO.Olivier​​​‌ Beaumont, J.Julien‌ Herrmann, G.Guillaume‌​‌ Pallez and A.Alena​​ Shilova. Optimal memory-aware​​​‌ backpropagation of deep join‌ networks.Philosophical Transactions‌​‌ of the Royal Society​​ A37821662020​​​‌, 20190049back to‌ text
  • 62 inproceedingsR.‌​‌Rocío Carratalá-Sáez, M.​​Mathieu Faverge, G.​​​‌Grégoire Pichon, E.‌ S.Enrique Salvador Quintana-Ortí‌​‌ and G.Guillaume Sylvand​​. Exploiting Generic Tiled​​​‌ Algorithms Toward Scalable H-Matrices‌ Factorizations on Top of‌​‌ Runtime Systems.SIAM​​ PP20-SIAM Conference on Parallel​​​‌ Processing for Scientific Computing‌2020back to text‌​‌
  • 63 inproceedingsR.Rocío​​ Carratalá-Sáez, M.Mathieu​​​‌ Faverge, G.Grégoire‌ Pichon, G.Guillaume‌​‌ Sylvand and E. S.​​Enrique S Quintana-Ortí.​​​‌ Tiled Algorithms for Efficient‌ Task-Parallel ?-Matrix Solvers.‌​‌2020 IEEE International Parallel​​ and Distributed Processing Symposium​​​‌ Workshops (IPDPSW)IEEE2020‌, 757--766back to‌​‌ text
  • 64 mastersthesisT.​​​‌Tanguy Chatelain. Anticipation​ des communications réseau grâce​‌ à la connaissance du​​ futur dans le parallélisme​​​‌ à tâche.MA​ ThesisEnseirb-MatmecaSeptember 2025​‌HALback to text​​
  • 65 inproceedingsV.Viktoriia​​​‌ Chekalina, G.Georgiy​ Novikov, J.Julia​‌ Gusak, A.Alexander​​ Panchenko and I.Ivan​​​‌ Oseledets. Efficient gpt​ model pre-training using tensor​‌ train matrix representation.​​Proceedings of the 37th​​​‌ Pacific Asia Conference on​ Language, Information and Computation​‌2023, 600--608back​​ to text
  • 66 article​​​‌T.Tianqi Chen,​ B.Bing Xu,​‌ C.Chiyuan Zhang and​​ C.Carlos Guestrin.​​​‌ Training deep nets with​ sublinear memory cost.​‌arXiv preprint arXiv:1604.061742016​​back to text
  • 67​​​‌ articleD.Daria Cherniuk​, S.Stanislav Abukhovich​‌, A.-H.Anh-Huy Phan​​, I.Ivan Oseledets​​​‌, A.Andrzej Cichocki​ and J.Julia Gusak​‌. Quantization aware factorization​​ for deep neural network​​​‌ compression.Journal of​ Artificial Intelligence Research81​‌2024, 973--988back​​ to text
  • 68 article​​​‌R. D.Robert D​ Falgout, S.Stephanie​‌ Friedhoff, T. V.​​Tz V Kolev,​​​‌ S. P.Scott P​ MacLachlan and J. B.​‌Jacob B Schroder.​​ Parallel time integration with​​​‌ multigrid.SIAM Journal​ on Scientific Computing36​‌62014, C635--C661​​back to text
  • 69​​​‌ articleM. J.Martin​ J Gander and S.​‌Stefan Vandewalle. Analysis​​ of the parareal time-parallel​​​‌ time-integration method.SIAM​ Journal on Scientific Computing​‌2922007,​​ 556--578back to text​​​‌
  • 70 articleP.P.​ Ghysels, X. S.​‌X. S. Li,​​ F.-H.F.-H. Rouet,​​​‌ S.S. Williams and​ A.A. Napov.​‌ An Efficient Multicore Implementation​​ of a Novel HSS-Structured​​​‌ Multifrontal Solver Using Randomized​ Sampling.SIAM Journal​‌ on Scientific Computing38​​52016, S358-S384​​​‌back to text
  • 71​ inproceedingsA. N.Aidan​‌ N Gomez, M.​​Mengye Ren, R.​​​‌Raquel Urtasun and R.​ B.Roger B Grosse​‌. The reversible residual​​ network: Backpropagation without storing​​​‌ activations.Proceedings of​ the 31st International Conference​‌ on Neural Information Processing​​ Systems2017, 2211--2221​​​‌back to text
  • 72​ inproceedingsA.Audrunas Gruslys​‌, R.Rémi Munos​​, I.Ivo Danihelka​​​‌, M.Marc Lanctot​ and A.Alex Graves​‌. Memory-efficient backpropagation through​​ time.Advances in​​​‌ Neural Information Processing Systems​2016, 4125--4133back​‌ to text
  • 73 inproceedings​​U.Udit Gupta,​​​‌ Y. G.Young Geun​ Kim, S.Sylvia​‌ Lee, J.Jordan​​ Tse, H.-H. S.​​​‌Hsien-Hsin S Lee,​ G.-Y.Gu-Yeon Wei,​‌ D.David Brooks and​​ C.-J.Carole-Jean Wu.​​​‌ Chasing Carbon: The Elusive​ Environmental Footprint of Computing​‌.2021 IEEE International​​ Symposium on High-Performance Computer​​​‌ Architecture (HPCA)IEEE2021​, 854--867back to​‌ text
  • 74 inproceedingsJ.​​Julia Gusak, D.​​​‌Daria Cherniuk, A.​Alena Shilova, A.​‌Alexander Katrutsa, D.​​Daniel Bershatsky, X.​​​‌Xunyi Zhao, L.​Lionel Eyraud-Dubois, O.​‌Oleg Shlyazhko, D.​​Denis Dimitrov, I.​​Ivan Oseledets and O.​​​‌Olivier Beaumont. Survey‌ on Large Scale Neural‌​‌ Network Training.The​​ 31st International Joint Conference​​​‌ on Artificial Intelligence (IJCAI)‌2022back to text‌​‌back to text
  • 75​​ articleA.A. Ida​​​‌, T.T. Iwashita‌, T.T. Mifune‌​‌ and Y.Y. Takahashi​​. Parallel Hierarchical Matrices​​​‌ with Adaptive Cross Approximation‌ on Symmetric Multiprocessing Clusters‌​‌.Journal of Information​​ Processing2242014​​​‌, 642--650back to‌ text
  • 76 techreportE.‌​‌Esragul Korkmaz, M.​​Mathieu Faverge, G.​​​‌Grégoire Pichon and P.‌Pierre Ramet. Deciding‌​‌ Non-Compressible Blocks in Sparse​​ Direct Solvers using Incomplete​​​‌ Factorization.RR-9396Inria‌ Bordeaux - Sud Ouest‌​‌2021, 16HAL​​back to text
  • 77​​​‌ inproceedingsN.Navjot Kukreja‌, A.Alena Shilova‌​‌, O.Olivier Beaumont​​, J.Jan Huckelheim​​​‌, N.Nicola Ferrier‌, P.Paul Hovland‌​‌ and G.Gerard Gorman​​. Training on the​​​‌ Edge: The why and‌ the how.2019‌​‌ IEEE International Parallel and​​ Distributed Processing Symposium Workshops​​​‌ (IPDPSW)IEEE2019,‌ 899--903back to text‌​‌
  • 78 inproceedingsX.Xavier​​ Lacoste, M.Mathieu​​​‌ Faverge, G.George‌ Bosilca, P.Pierre‌​‌ Ramet and S.Samuel​​ Thibault. Taking Advantage​​​‌ of Hybrid Systems for‌ Sparse Direct Solvers via‌​‌ Task-Based Runtimes.2014​​ IEEE International Parallel &​​​‌ Distributed Processing Symposium Workshops,‌ Phoenix, AZ, USA, May‌​‌ 19-23, 2014IEEE Computer​​ Society2014, 29--38​​​‌URL: https://doi.org/10.1109/IPDPSW.2014.9DOIback‌ to text
  • 79 book‌​‌T.T. Mary.​​ Block Low-Rank multifrontal solvers:​​​‌ complexity, performance and scalability‌.Université Toulouse 3‌​‌ Paul SabatierPh.D. Dissertation​​2017back to text​​​‌
  • 80 articleS.Salli‌ Moustafa, F.François‌​‌ Févotte, M.Mathieu​​ Faverge, L.Laurent​​​‌ Plagne and P.Pierre‌ Ramet. Efficient Parallel‌​‌ Solution of the 3D​​ Stationary Boltzmann Transport Equation​​​‌ for Diffusive Problems.‌Journal of Computational Physics‌​‌March 2019HALDOI​​back to text
  • 81​​​‌ inproceedingsD.Deepak Narayanan‌, A.Aaron Harlap‌​‌, A.Amar Phanishayee​​, V.Vivek Seshadri​​​‌, N. R.Nikhil‌ R Devanur, G.‌​‌ R.Gregory R Ganger​​, P. B.Phillip​​​‌ B Gibbons and M.‌Matei Zaharia. PipeDream:‌​‌ generalized pipeline parallelism for​​ DNN training.Proceedings​​​‌ of the 27th ACM‌ Symposium on Operating Systems‌​‌ Principles2019, 1--15​​back to textback​​​‌ to text
  • 82 article‌D.David Patterson,‌​‌ J.Joseph Gonzalez,​​ Q.Quoc Le,​​​‌ C.Chen Liang,‌ L.-M.Lluis-Miquel Munguia,‌​‌ D.Daniel Rothchild,​​ D.David So,​​​‌ M.Maud Texier and‌ J.Jeff Dean.‌​‌ Carbon emissions and large​​ neural network training.​​​‌arXiv preprint arXiv:2104.103502021‌back to text
  • 83‌​‌ inproceedingsA.-H.Anh-Huy Phan​​, K.Konstantin Sobolev​​​‌, K.Konstantin Sozykin‌, D.Dmitry Ermilov‌​‌, J.Julia Gusak​​, P.Petr Tichavsk\`y​​​‌, V.Valeriy Glukhov‌, I.Ivan Oseledets‌​‌ and A.Andrzej Cichocki​​. Stable Low-rank Tensor​​​‌ Decomposition for Compression of‌ Convolutional Neural Network.‌​‌European Conference on Computer​​​‌ Vision (ECCV)Springer2020​, 522--539back to​‌ textback to text​​
  • 84 articleG.Grégoire​​​‌ Pichon, E.Eric​ Darve, M.Mathieu​‌ Faverge, P.Pierre​​ Ramet and J.Jean​​​‌ Roman. Sparse supernodal​ solver using block low-rank​‌ compression: Design, performance and​​ analysis.International Journal​​​‌ of Computational Science and​ Engineering27July 2018​‌, 255 - 270​​HALDOIback to​​​‌ text
  • 85 inproceedingsG.​Grégoire Pichon, M.​‌Mathieu Faverge and P.​​Pierre Ramet. Recent​​​‌ Developments Around the Block​ Low-Rank PaStiX Solver.​‌SIAM Conference on Parallel​​ Processing for Scientific Computing​​​‌ (SIAM PP 2020)2020​back to text
  • 86​‌ inproceedingsD.Dalal Sukkari​​, H.Hatem Ltaief​​​‌, D.David Keyes​ and M.Mathieu Faverge​‌. Leveraging Task-Based Polar​​ Decomposition Using PARSEC on​​​‌ Massively Parallel Systems.​2019 IEEE International Conference​‌ on Cluster Computing (CLUSTER)​​IEEE2019, 1--12​​​‌back to text