2025Activity reportProject-TeamBENAGIL
RNSR: 202324438T- Research center Inria Saclay Centre at Institut Polytechnique de Paris
- In partnership with:Institut Polytechnique de Paris, TELECOM SUDPARIS
- Team name: Efficient and safe distributed systems
- In collaboration with:Services répartis, Architectures, MOdélisation, Validation, Administration des Réseaux
Creation of the Project-Team: 2023 September 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A1.1.1. Multicore, Manycore
- A1.1.4. High performance computing
- A1.1.13. Virtualization
- A1.3.5. Cloud
Other Research Topics and Application Domains
- B6.1.1. Software engineering
1 Team members, visitors, external collaborators
Research Scientist
- Gael Thomas [Team leader, INRIA, Senior Researcher, HDR]
Faculty Members
- Mathieu Bacou [TELECOM SUDPARIS, Associate Professor]
- Elisabeth Brunet [TELECOM SUDPARIS, Associate Professor]
- Valentin Honore [ENSIIE, Associate Professor]
- Alexandre Nolin [TELECOM SUDPARIS, Associate Professor, from Nov 2025]
- Pierre Sutra [TELECOM SUDPARIS, Professor, HDR]
- Francois Trahay [TELECOM SUDPARIS, Professor, HDR]
Post-Doctoral Fellows
- Nicolas Derumigny [TELECOM SUDPARIS, Post-Doctoral Fellow]
- Ayush Pandey [TELECOM SUDPARIS, Post-Doctoral Fellow, from Apr 2025]
PhD Students
- Tara Aggoun [TELECOM SUDPARIS, from Sep 2025]
- Mickaël Boichot [TELECOM SUDPARIS, until Jul 2025]
- Adam Chader [TELECOM SUDPARIS]
- Jean-Francois Dumollard [TELECOM SUDPARIS]
- Catherine Guelque [TELECOM SUDPARIS]
- Boubacar Kane [TELECOM SUDPARIS, until Jan 2025]
- Harena Rakotondratsima [TELECOM SUDPARIS, from Sep 2025]
- Marie Reinbigler [TELECOM SUDPARIS, until Sep 2025]
- Jules Risse [INRIA]
- Jana Toljaga [TELECOM SUDPARIS]
- Guillermo Toyos Marfurt [TELECOM SUDPARIS, from Aug 2025]
- Nguyen Tung [TELECOM SUDPARIS, from Aug 2025]
- Lucas Van Lanker [CEA]
- Nevena Vasilevska [TELECOM SUDPARIS]
Interns and Apprentices
- Tara Aggoun [TELECOM SUDPARIS, Intern, from Mar 2025 until Sep 2025]
- Joni Dervishi [Telecom SudParis, Intern, from Mar 2025 until May 2025]
- Harena Rakotondratsima [TELECOM SUDPARIS, Intern, from Mar 2025 until Sep 2025]
Administrative Assistant
- Julienne Moukalou [INRIA]
2 Overall objectives
Distributed systems are pivotal to many applications used in our daily life: AI, data analytics, online gaming, social networks, web services, healthcare, etc. Because they have to sustain massive workloads, these systems scatter computation across many units, which coordinate to store the input, execute the calculus and return results in a usable manner to the application. Inefficiencies in these infrastructures hinder the ability to handle large computations. They also lead to wasting energy and hardware resources. Errors at runtime may result in painful data losses and exploitable security loopholes. As a consequence, designing and implementing such systems in an efficient and safe manner is essential, and it has a strong commitment from all the major IT industries.
The Benagil team works on the design and implementation of more efficient and safer distributed systems. For that, the Benagil team focuses on the core system components at the frontier with the hardware: hypervisors, operating systems, language runtimes, storage systems and communication libraries. Improving the efficiency and safety of distributed systems is a challenging task. Modern distributed systems manage large pools of machines, a plethora of users and they process very large datasets. Consequently, they are inherently complex and both their design and implementation is notoriously hard. Complexity arises from the software stack, the algorithms at the core of these systems, as well as the hardware itself:
- System software level. A typical modern computer system runs many software components: hypervisors (e.g., KVM/Qemu), operating systems (e.g., Linux), container systems (e.g., Linux containers), language runtimes (e.g., the Java virtual machine) and specialized runtimes for HPC (e.g., MPI), data analytics (e.g., Spark) or AI (e.g. PyTorch). Such software are today very large. For instance, the last version of the Linux kernel runs over 22,000,000 lines of code.
- Distributed system level. As pointed above, modern systems are distributed, involving many machines. These machines are connected with heterogeneous networks, ranging from fast local networks (e.g., Infiniband or Ethernet 10Gb) to high-latency planet-scale connections. Many of these systems have to be highly available, that is they need to be responsive 99.999% of the time. This requires to use complex monitoring mechanisms and replication algorithms that solve trade-offs between availability and performance. Distributed systems need also to do fine-grained task and data placement choices. They aggregate resources, have to use them efficiently, and provide high-enough isolation levels between the multiple applications using them.
- Machine level. Internally, each machine is a very complex entity. It is today composed of multiple processors, memory banks and devices inter-connected with a complex network. A processor contains tens of cores with finely tunable cache hierarchies and out-of-order execution pipelines. Each core is a very dense unit of calculus, as testified by the specification of Intel Skylake that covers more than 4,800 pages. A machine also often includes multiple heterogeneous accelerators and specialized hardware such as persistent memory that provides durability at the nanosecond scale, GPUs specialized for massively parallel computations, FPGAs used to offload complex computations from the CPUs, and TPUs specialized in deep neural network computation. Accessing all these components is not uniform both in terms of bandwidth and latency. Heterogeneity must be taken into account at multiple levels of the system stack. This makes data access optimization especially challenging. This complexity also opens security breaches, such as cache timing attacks, code timing attacks and data access pattern attacks. Preventing these attacks requires to solve complex trade-offs between performance, security and usability.
The inherent complexity of distributed systems makes analyzing their performance and safety difficult. This difficulty is increased by complex and unexpected interactions between software and hardware components. Besides that, understanding and improving the system components in the context of distributed systems require an expertise in many areas: hypervisors, operating systems, containerization, language runtimes, compilation, network, architecture, web, databases, data analytics runtimes, cloud runtimes and distributed algorithms. As an example, in a previous work, we observed a large performance degradation in a data analytics application written in Scala (namely, PageRank in Apache Spark). This phenomenon was caused by a bad memory placement performed by the Java virtual machine on a non-uniform memory architecture. This issue was also reinforced by the use of a (system) virtual machine that blindly allocates memory from any memory bank. Another source of inefficiencies was due to the hypervisor which was continuously moving memory without telling the virtual machine. All in all, understanding and solving the performance bottleneck at each level of the system stack took us 8 years. It involved 3 PhD students and 6 researchers with expertise in different system areas.
3 Research program
The Benagil team works on improving the performance and the safety of the core system components of the distributed systems. In order to achieve this goal, we propose a systematic approach. This approach first consists in profiling and analyzing current distributed systems to identify their limits in term of efficiency and/or safety when they execute large distributed applications. Then, building upon this analysis, we develop new algorithms, mechanisms and components to improve them.
The Benagil team is structured along three main axes which articulate the above approach. The first axis is devoted to performance profiling and analysis. In this axis, we introduce new tools and techniques to automatically analyze the performance of a large distributed system. Based on this analysis, we identify performance issues, which we use as input in the two other axes to improve performance. The two other axes study two aspects of the system components. In the system components for cloud infrastructure axis, we devise new system techniques to improve the performance and safety of two core system components used in cloud infrastructure: virtualization and storage. In the system components for emerging computing models, we propose new system mechanisms and interfaces for two pivotal upcoming programming models: serverless and edge computing.
3.1 Performance analysis
Due to the high complexity of modern large-scale distributed applications, understanding performance problems is a tedious task even for the most experienced programmers. A performance bottleneck may arise from different interactions, between hardware and software, or between different software components. Even just a single contended lock, or a falsely shared cache line, in one of the system components may lead to a dramatic slowdown.
Because of this complexity, manually identifying the root cause of a performance bottleneck is notoriously difficult. In this axis, we propose to help the developer by designing new profiling tools able to handle the complexity of hardware and software stacks, and able to scale with the size of the system.
3.2 System components for the cloud
In this axis, we aim at studying and designing the next generation of systems for cloud infrastructures. Today, these infrastructures are undergoing major changes at the hardware level with the generalization of ultra-fast networks at the micro-second scale (e.g., RDMA) and storage devices (e.g., NVMe or Non-Volatile Memory). Their joint arrivals require to radically revisit the way we design two core system components of any cloud infrastructure: the virtualization system and the storage system.
3.3 System components for emerging computing models
At a higher level of the system stack, we are witnessing the arrival of two new computing models: serverless computing and edge computing. These computing models deeply change the assumptions under which the current system components were built. Current system components assume long-running applications and powerful computing infrastructures. However, this is no more the case with these two new computing models. In serverless computing, applications are split into short-lived tasks. In edge computing, applications execute at the border of the network, atop low performance hardware.
4 Application domains
Overall, the Benagil team is mostly specialized on the low-level components of distributed systems. This specialization is at the frontier of security, hardware, high-performance computing (HPC), machine learning, data analytics and databases. With respect to security, the team studies some system aspects, such as trusted execution environments (e.g., Intel SGX) to protect applications, or data replication to improve availability. However, the Benagil team is not a security one per se. Regarding hardware, the Benagil team has a strong background in using modern hardware such as persistent memory or GPU. This knowledge is crucial to efficiently use the hardware in system components. However, the team is only consuming hardware and does not directly design it. This is also the case with HPC, machine learning and data analytics. The Benagil team understand the system requirements of these highly-demanding applications, and use them to benchmark their system components. However the team only rarely contribute to these runtimes themselves. The Benagil team has also a strong knowledge regarding the storage system components used in databases. This includes the algorithmic and implementation concerns related to data distribution, consistency, replication and persistence. However, the Benagil team is not specialized in database in general.
5 Highlights of the year
- Four PhD students of the team defended their PhD in 2025: Boubacar Kane, Mickaël Boichot, Marie Reinbigler, and Adam Chader
- The team obtained three new grants: the ANR JCJC VHS, the ANR Centeanes, and the PIA Camelia
6 Latest software developments, platforms, open data
6.1 Latest software developments
6.1.1 EZTrace
-
Keywords:
MPI communication, Execution trace, Traces, High performance computing, Performance analysis, HPC, OpenMP, CUDA
-
Functional Description:
The improvement of the performances of parallel applications (numerical simulation for example) is an important phase of the development. For that it is necessary to detect the various phases of the application and to understand the performances of them.
The automatic generation of traces of execution makes it possible the developer to quickly detect simply and the various phases of the application and to understand the behavior of it.
- URL:
-
Publications:
hal-01257904v1, hal-00707236v1, hal-03215663v1, hal-03276036v1, hal-02179717v1, inria-00587216v1, hal-00918733v1, hal-00865845v1, tel-03278305v1
-
Contact:
Francois Trahay
-
Participant:
2 anonymous participants
6.1.2 Pallas
-
Keywords:
Performance analysis, HPC, High performance computing, Execution trace
-
Functional Description:
Pallas is a generic trace format tailored for conducting various post-mortem performance analyses of traces describing large executions of HPC applications. During the execution of the application, Pallas collects events and detects their repetitions on-the-fly. When storing the trace to disk, PALLAS groups the data from similar events or groups of events together in order to later speed up trace reading. The Pallas format allows faster trace analysis compared to other trace formats.
- URL:
-
Contact:
Francois Trahay
-
Participant:
an anonymous participant
6.1.3 numamma
-
Keywords:
NUMA, Memory Allocation, Profiling
-
Functional Description:
NumaMMa is both a NUMA memory profiler/analyzer and a NUMA application execution engine. The profiler allows to run an application while gathering information about memory accesses. The analyzer visually reports information about the memory behavior of the application allowing to identify memory access patterns. Based on the results of the analyzer, the execution engine is capable of executing the application in an efficient way by allocating memory pages in a clever way.
- URL:
- Publications:
-
Contact:
Francois Trahay
-
Participant:
an anonymous participant
6.1.4 ForkNox
-
Name:
ForkNox: a micro-hypervisor to protect Linux
-
Keywords:
Virtualization, Security
-
Functional Description:
ForkNox is a micro-hypervisor designed to protect Linux. By leveraging virtualization techniques, ForkNox can revoke read, write, and execute permissions for specific memory regions of Linux. This ensures that, even if Linux is under attack, the attacker cannot modify those parts of the system.
-
Release Contributions:
Initial version of the software.
-
News of the Year:
Initial version of the software.
- URL:
-
Contact:
Gael Thomas
-
Participant:
4 anonymous participants
6.1.5 VoliMem
-
Name:
VoliMem: a lightweight virtualization for processes
-
Keyword:
Virtualization
-
Functional Description:
VoliMem is a small library that remaps a native process inside a virtual machine. Thanks to this, the process gains access to low-level system hardware primitives, such as a page table in user space or fast inter-processor interrupts.
-
Release Contributions:
Initial version of the prototype
-
News of the Year:
Initial implementation of the software.
- URL:
-
Contact:
Gael Thomas
-
Participant:
3 anonymous participants
6.1.6 Tele-GC
-
Name:
Tele-GC: a garbage collector for disaggregated memory
-
Keywords:
Garbage Collection, Java, Disaggregated memory
-
Functional Description:
Tele-GC is a garbage collector specifically designed for disaggregated memory. It runs the application on the compute node while the garbage collector operates on the memory node. Tele-GC leverages the discrepancy between the cache on the compute node and the memory on the memory node to avoid any synchronization during a collection.
- URL:
-
Contact:
Gael Thomas
6.1.7 FaaSLoad
-
Keywords:
Cloud computing, Serverless, Function-as-a-Service, Measures, Resource utilization, Workload injection, Performance measure
-
Scientific Description:
FaaSLoad is a tool to gather fine-grained data about performance and resource usage of the programs that run on Function-as-a-Service cloud platforms. It considers individual instances of functions to collect hardware and operating-system performance information, by monitoring them while injecting a workload. FaaSLoad helps building a dataset of function executions to train machine learning models, studying at fine grain the behavior of function runtimes, and replaying real workload traces for in situ observations.
-
Functional Description:
Invoke functions in a Function-as-a-Service platform, and gather data about their performance and their resource usage to understand their behavior in Serverless environments.
-
Release Contributions:
Stabilization and opening to outsiders.
-
News of the Year:
Release of public version 2.0 (and then 2.1.0), the first mature and useful to outsiders. Published in a dedicated scientific paper at OPODIS'24.
- URL:
- Publications:
-
Contact:
Mathieu Bacou
-
Participant:
an anonymous participant
7 New results
This year, the Benagil team carried out research projects along three axes.
In the performance analysis axis, the Benagil team studied: (i) the optimization of performance trace representations to improve analysis time, (ii) methods for measuring energy consumption at a fine granularity, and (iii) the performance prediction of an application when we change hardware.
In the system components for cloud infrastructures axis, the Benagil team studied: (i) how we can simplify the use of persistent memory by relying on a page table to identify the dirty set of a transaction, (ii) the protection of the internal data structures of the Linux kernel with virtualization techniques, and (iii) the memory collection of a large heap in a disaggregated context.
In the system components for emerging computing models axis, the Benagil team worked on: (i) analyzing large images on a modest cluster, and (ii) adjusting data consistency to the actual needs of an application.
7.1 Performance analysis
7.1.1 Scalable trace format
Participants: Catherine Guelque, Valentin Honoré, Philippe Swartvegher [Inria TOPAL], François Trahay.
Identifying performance bottlenecks in a parallel application is tedious, especially because it requires analyzing the behavior of various software components, as bottlenecks may have several causes and symptoms. Detecting a performance problem means investigating the execution of an application and applying several performance analysis techniques. To do so, one can use a tracing tool to collect information describing the behavior of the application. At the end of the execution, a trace file in a specific format is available to the application user, which can be used to conduct a complete post-mortem investigation. When analyzing the performance of application running at a large scale, the post-mortem analysis needs to load thousands of trace files in memory, and process them. This quickly becomes impractical for large scale applications, as memory gets exhausted and the number of opened files exceeds the system capacity.
As part of the Exa-SofT project, Catherine Guelque proposes Pallas, a generic trace format tailored for conducting various post-mortem performance analyses of traces describing large executions of HPC applications 7. During the execution of the application, Pallas collects events and detects their repetitions on-the-fly. When storing the trace to disk, Pallas groups the data from similar events or groups of events together in order to later speed up trace reading. We conducted large-scale experiments on the Jean-Zay supercomputer to evaluate Pallas. Our experiments show that the Pallas format allows faster trace analysis compared to other evaluated trace formats. Overall, the Pallas trace format allows an interactive analysis of a trace that is required when a user investigates a performance problem. These results were presented at IPDPS'257.
7.1.2 Fine-grain energy measurement
Participants: Jules Risse, Amina Guermouche [Inria STORM], François Trahay.
The power consumption of supercomputers is and will be a major concern. As a matter of fact, Frontier, the fastest super computer in the world consumes around 20 MW. As a consequence, reducing the power consumtion of HPC applications is mandatory. The first step towards reducing the power consumption of programs is being able to monitor their energy consumption. Servers usually contain wattmeters able to measure the power consumption of the CPU, the memory, the GPU, etc. However, these wattmeters only provide coarse grain energy measurement, with a typical measurement period of dozens of milliseconds. During this period of time, the application may execute hundreds of tasks. As a result, analyzing the power consumption of an application at the microsecond scale is tedious.
As part of the Exa-SofT project, Jules Risse's PhD investigates fine grain energy measurement in StarPU. Since StarPU executes many instances of a few types of tasks, it should be possible to build an energy consumption model of each type of task. The energy consumption model can then be provided to StarPU so that the task scheduling takes into account both the performance of tasks, and their energy consumption. In this project, we measure the energy consumption of a server (its CPU, GPU, etc.) at coarse-grain (typically, one sample every 20 ms), and we log which tasks were executed during this period of time. By repeating this many times, we build a linear system that can be solved to model the energy consumption of microsecond-scale tasks. We show that the model can accurately predict the energy consumption of fine grain tasks running on CPUs. We conducted similar experiments on GPUs where the accuracy is lower due to errornous power consumption metrics reported by the GPU. These results were presented at Cluster'25 9.
7.1.3 Performance prediction
Participants: Lucas Van Lanker, Hugo Taboada [CEA/DAM], Mickaêl Boichot, Adrien Roussel [CEA/DAM], Patrick Carribault [CEA/DAM], Elisabeth Brunet, François Trahay.
With the advent of heterogeneous systems that combine CPUs and GPUs, designing a supercomputer becomes more and more complex. The hardware characteristics of GPUs significantly impact the performance. Choosing the GPU that will maximize performance for a limited budget is tedious because it requires predicting the performance on a non-existing hardware platform.
During his Phd, Mickaël Boichot studied the relation between the expressed parallelism and memory footprint of loops in order to extrapolate which data sizes provide sufficient parallelism to load a new GPU architecture. In the case oversubscribing memory, his work focused on how to efficiently exploit new unified memory feature of GPU in order to place data where it is most often reused. These results are detailed in his thesis and in 6
Lucas Van Lanker's PhD explores means for predicting the performance of kernels running on GPUs. We propose a methodology that analyzes the behavior of an application running on an existing platform, and projects its performance on another GPU based on the target hardware characteristics. The performance projection relies on a hierarchical roofline model as well as on a comparison of the kernel’s assembly instructions of both GPUs to estimate the operational intensity of the target GPU. Our experiments show that the performance can be predicted accurately at a low cost.
7.2 System components for the cloud
7.2.1 VoliPMem: using transparently a persistent memory
Participants: Jana Toljaga, Tara Aggoun, Gaël Thomas, Mathieu Bacou, Nicolas Derumigny.
Handling persistent memory is complex because the application can fail at any time, leaving the persistent memory in an inconsistent state. To avoid inconsistency, the developer must use transactions, which are applied with an all-or-nothing semantics at the end of a transaction. To achieve this, persistent memory writes are first executed in volatile memory and only applied to persistent memory at the end of a transaction. Unfortunately, currently, to propagate the writes, the developer has to explicitly indicate the modified memory locations, which is cumbersome and error-prone. With VoliPMem (PhD thesis of Jana Toljaga), we propose to transparently identify the memory locations modified inside a transaction for the developer. To achieve this, we rely on a library called VoliMem, which, instead of executing a process natively, executes it in a lightweight virtual machine. By executing the process inside a virtual machine, the application can directly manage a secondary page table within its address space. In VoliPMem, we use this page table to automatically identify the modified memory locations. Specifically, VoliPMem identifies the modified pages by traversing the page table to find the dirty pages. At the end of a transaction, VoliPMem collects these dirty pages and copies them atomically to persistent memory with an all-or-nothing semantics. Thanks to these abstractions, using persistent memory becomes straightforward: the developer simply has to indicate the boundaries of the transaction and no longer has to worry about annotating each write.
7.2.2 ForkNox: protecting the internal data structures of Linux
Participants: Jean-François Dumollard, Harena Rakotondratsima, Gaël Thomas, Mathieu Bacou, Nicolas Derumigny.
Linux is designed as a monolithic kernel, which leaves it vulnerable as soon as an attacker can execute code in system mode. Technically, if an attacker can execute code in system mode, the attacker can modify any part of Linux: the attacker can alter Linux’s code, modify any data structure, and disable any security mechanisms installed by Linux. With ForkNox (PhD thesis of Jean-François Dumollard), we propose a new technique to enforce Linux’s security, even if an attacker is able to execute code inside the kernel. To do so, we introduce a new protection ring by leveraging the processor’s virtualization feature. Specifically, ForkNox is a Linux module that runs as a hypervisor while Linux runs as a guest operating system. By leveraging virtualization, ForkNox can revoke read, write, or execute permissions for important Linux memory regions, which allows ForkNox to protect Linux against an attacker capable of executing code inside the Linux kernel.
7.2.3 Tele-GC: a garbage collector for disaggregated memory
Participants: Adam Chader, Nevena Vasilevska, Yohan Pipereau [Engineer at Gandi], Gaël Thomas, Mathieu Bacou, Nicolas Derumigny.
A disaggregated infrastructure simplifies hardware resource management. In detail, in a disaggregated infrastructure, the cloud system can dynamically adjust the hardware resources allocated to a virtual machine to its actual use by allocating hardware resource from a specialized blade. Designing a garbage collector in this context is challenging because of the high-memory latency. With TéléGC (PhD of Adam Chader), we propose a new garbage collector (GC) for a disaggregated infrastructure. TéléGC runs on the memory node while the application runs on the CPU node. It runs concurrently with the application while avoiding most synchronization. To achieve this, we introduce the write-back barrier. With the write-back barrier, instead of synchronously executing a barrier when the application writes to the heap, TéléGC executes a barrier asynchronously later, when the CPU node writes back a page to the memory node. Thanks to this, the application does not pay the cost of synchronizing with the GC, boosting its performance. Our evaluation on a disaggregated infrastructure shows that TéléGC significantly reduces both completion time and pause time compared to Mako and G1, which are the state-of-the-art GCs of Hotspot.
7.3 System components for emerging computing models
7.3.1 Efficient Pyramidal Analysis of Gigapixel Images on a Decentralized Modest Computer Cluster
Participants: Marie Reinbigler, Rishi Sharma [EPFL], Rafael Pires [EPFL], Elisabeth Brunet, Anne-Marie Kermarrec [EPFL], Catalin Fetita [Telecom SudParis].
Analyzing gigapixel images is recognized as computationally demanding. In this work, we introduce PyramidAI, a technique for analyzing gigapixel images with reduced computational cost 8. The proposed approach adopts a gradual analysis of the image, beginning with lower resolutions and progressively concentrating on regions of interest for detailed examination at higher resolutions. We investigated two strategies for tuning the accuracy-computation performance trade-off when implementing the adaptive resolution selection, validated against the Camelyon16 dataset of biomedical images. Our results demonstrate that PyramidAI substantially decreases the amount of processed data required for analysis by up to 2.65x, while preserving the accuracy in identifying relevant sections on a single computer. To ensure democratization of gigapixel image analysis, we evaluated the potential to use mainstream computers to perform the computation by exploiting the parallelism potential of the approach. Using a simulator, we estimated the best data distribution and load balancing algorithm according to the number of workers. The selected algorithms were implemented and highlighted the same conclusions in a real-world setting. Analysis time is reduced from more than an hour to a few minutes using 12 modest workers, offering a practical solution for efficient large-scale image analysis.
7.3.2 Efficient and Principled Approaches to Scalable Programming
Participants: Boubacar Kane, Tung Nguyen, Pierre Sutra.
Parallel programs require software support to coordinate access to shared data. For this purpose, modern programming languages provide strongly-consistent shared objects. To account for their many usages, these objects offer a large API. However, in practice, each program calls only a tiny fraction of the interface. Leveraging such an observation, we propose to tailor a shared object for a specific usage. We call this principle adjusted objects.
Adjusted objects already exist in the wild. Our work provides their first systematic study. We explain how everyday programmers already adjust common shared objects (such as queues, maps, and counters) for better performance. We present the formal foundations of adjusted objects using a new tool to characterize scalability, the indistinguishability graph. Leveraging this study, we introduce a library named DEGO to inject adjusted objects in a Java program. In micro-benchmarks, objects from the DEGO library improve the performance of standard JDK shared objects by up to two orders of magnitude. We also evaluate DEGO with a Retwis-like benchmark modeled after a social network application. On a modern server-class machine, DEGO boosts by up to 1.7x the performance of the benchmark. This work was conducted during the PhD of Boubacar Kane, who successfully defended in January 2025 11.
A key question in concurrent programming is determining the synchronization power of a shared object. An object has consensus number when is the largest number for which we may solve consensus with copies of this object and registers. The indistinguishability graph can be used to characterize the consensus number of a shared object. However, this characterizations is incomplete, and it covers only objects that are readable. In a seminal work, Herlihy and Ruppert provide an exact characterization of the consensus number for deterministic one-shot objects (that can be accessed by each process at most once). In 12, we extend the study of Herlihy and Ruppert to deterministic two-shot objects in a two-process system. Such objects that can be accessed by each process at most twice. We introduce three disjoint classes of two-shot objects: The first class is similar to one-shot objects in the sense that the first operation call gives enough information to solve consensus. Objects in the second class do not provide any useful information after the first call to one of the two processes. The last class contains objects for which calling the object twice is always necessary. In this class, the second operation to call is chosen adaptively, which may lead to using different operations in different schedules. For instance, the second operation used in a solo run might differ from the one called when processes interleave. We show that these three classes provide an exact characterization of the two-shot deterministic objects able to solve two-process consensus.
8 Bilateral contracts and grants with industry
8.1 Bilateral contracts with industry
Participants: Mickaël Boichot, Lucas Van Lanker.
- Contract with CEA for the PhD of Mickaël Boichot (2021-2025), and Lucas Van Lanker (2024-2027)
- Adobe research gift to support our research activities.
9 Partnerships and cooperations
9.1 National initiatives
PEPR NumPex – Exa-SofT
Participants: Catherine Guelque, Jules Risse, Élisabeth Brunet, Valentin Honoré, François Trahay.
Partners: Université Paris Saclay, Télécom SudParis, CEA, CNRS, Inria
Coordinator: Raymond Namyst, Inria Bordeaux
Funding: 453 k€
Date: 2023-2028
Summary: Though significant efforts have been devoted to the implementation and optimization of several crucial parts of a typical HPC software stack, most HPC experts agree that exascale supercomputers will raise new challenges, mostly because the trend in exascale compute-node hardware is toward heterogeneity and scalability: Compute nodes of future systems will have a combination of regular CPUs and accelerators (typically GPUs), along with a diversity of GPU architectures. Meeting the needs of complex parallel applications and the requirements of exascale architectures raises numerous challenges which are still left unaddressed. As a result, several parts of the software stack must evolve to better support these architectures. More importantly, the links between these parts must be strengthened to form a coherent, tightly integrated software suite. Our project aims at consolidating the exascale software ecosystem by providing a coherent, exascale-ready software stack featuring breakthrough research advances enabled by multidisciplinary collaborations between researchers. The main scientific challenges we intend to address are: productivity, performance portability, heterogeneity, scalability and resilience, performance and energy efficiency.
PEPR Cloud – DiVa
Participants: Jana Toljaga, Nevena Vasilevska, Tara Aggoun, Mathieu Bacou, Nicolas Derumigny, Gaël Thomas.
Partners: LIP6, LIG, IRIT, Inria Paris, Benagil/Telecom SudParis
Coordinator: Gaël Thomas, Télécom SudParis
Funding: 864 k€
Date: 2023-2030
Summary: The DiVa project investigates new virtualization mechanisms tailored for a disaggregated infrastructure and for an infrastructure composed of small edge infrastructures connected to powerful data centers. In the context of a disaggregated cloud, the DiVa project will focus on the virtualization interfaces, the scheduling, the use of programmable networks, and replication mechanisms. In the context of the continuum between the edge and the cloud, the DiVa project will focus on migration between heterogeneous machines, edge/edge and edge/data center network optimizations, and virtualization interfaces for micro virtual machines.
PEPR Cloud – Archi-CESAM
Participants: Jean-François Dumollard, Harena Rakotondratsima, Mathieu Bacou, Nicolas Derumigny, Gaël Thomas.
Partners: Université de Rennes, Benagil/Telecom SudParis, Institut Polytechnique de Grenoble, CEA, Inria
Coordinator: Denis Dutoit, CEA
Funding: 580 k€
Date: 2023-2030
Summary: European sovereignty in the cloud also means sovereignty over hardware, especially processors and accelerators. Dennard's Law is now over and Moore's Law is slowing down. In this technological context, which will continue, the improvement of processor performance will require hardware architectures that evolve towards more parallelism (multi-core), more specialization (accelerators), towards a closer relationship between computing and memory and new types of interconnections between components. On the other hand, by dissociating hardware resources (computing, memory, interconnection) from logical resources, virtualization facilitates the deployment of converged architectures that bring together the computing, storage and network infrastructure. The cloud gains in modularity, speed and agility for the deployment of new services with optimal use of resources. Hardware disaggregation on the one hand and resource virtualization on the other are making the intermediate adaptation layer increasingly complex, difficult to validate and prone to failure. The Archi-CESAM project proposes to rethink the hardware (computing, memory and interconnection) so that it is co-designed with the application in a perspective of converged architecture and trust, in an environment known for its abundance of data to be processed. The Archi-CESAM project addresses this major evolution of the Cloud in a global and coordinated approach between distributed architectures, acceleration, interconnection and security bricks, without forgetting the design methods.
ANR PRC – FrugalDinet
Participants: Gaël Thomas.
Partners: LIP6, LISTIC, Benagil/Inria Saclay, New-York University Shanghai
Coordinator: Pierre Sens, LIP6
Funding: 171 k€
Date: 2024-2028
Summary: In recent years, innovative hardware technologies have emerged to enhance distributed computations in datacenters. Programmable switches enable packet processing with user-defined functionality on packets in transit. Similarly, SmartNIC DPUs offload data-centric computations from host CPUs. Simultaneously, the urgency of climate and energy crises has emphasized the need for frugal architectures. These technologies present an opportunity to reduce overall network traffic from distributed services, offloading computations from CPUs to the network itself. They should be integrated in designing fundamental distributed system components like failure detectors, group membership, reliable broadcast, or consensus. We propose FrugalDinet a framework to build reliable, low-cost distributed services, leveraging these technologies which minimizes CPU usage in datacenters and subsequently their energy consumption. Our holistic approach extends key algorithms such as leader election, group membership and broadcasting, necessary for the creation of reliable services. We intend not only to offload algorithmic logics on network elements, but also to make opportunistic use of the information available at the switch level. We also plan to introduce a new high-level programming language facilitating transparent utilization of these frugal, reliable distributed services. The implemented frugal algorithms and programming abstractions will be applied to design a distributed transaction system
ANR PRCE – Centeanes
Participants: Pierre Sutra.
Partners: Télécom SudParis, Université Paris Cité, École Polytechnique, Université de Paris 6.
Coordinator: Pierre Sutra
Funding: 196 k€
Date: 2025-2029
Summary: Cloud computing of the past was concerned with the management of infrastructure resources, e.g., servers, VMs or containers. Today, serverless computing promises to abstract this worry away. In this new paradigm, the quantum of computation is the function; a function-as-a-service platform automatically manages deployment of functions, executing them on demand and at scale. This greatly simplifies access to the cloud, letting the application developer focus on getting the application code right, and ignore infrastructure issues.
Unfortunately, serverless computing remains difficult to use and to reason about. Indeed, the serverless environment is inherently unpredictable and non-deterministic, making it hard to understand and to control. Being distributed, serverless must cope with concurrency, unpredictable failures, or impossibility of consensus. On top of that, serverless poses more, new challenges to the application programmer. Events may trigger the same function invoked multiple times and/or terminate it before it has finished. Functions are stateless, starting from afresh every time; but often it must access an external storage service, thus being exposed to stale or inconsistent state. Finally, existing platforms suffer from inefficiencies, such as excessive data movement or random placement.
The Centeanes project aims to address these challenges from the perspectives of correctness, efficiency, and expressivity, in a real application context. It will develop tools for specifying, programming and running correct-by-design serverless applications. In detail, we propose a formal framework to study the foundations of serverless computing, including function composition and fault-tolerance. This framework is implemented in a lightweight runtime environment, where stateful operations and data locality are first class citizen. We also construct a toolchain to program and verify serverless applications executing in the runtime. This verification toolchain simplifies the programming of applications and helps enforce their correctness. The design is informed by, and will be validated against, benchmarks and full-scale industrial cloud or edge applications built with Eclipse Zenoh.
ANR PRC – Maplurinum
Participants: Adam Chader, Mathieu Bacou, Gaël Thomas.
Partners: INPG, Inria Rennes, CEA, Benagil/Telecom SudParis
Coordinator: Gaël Thomas, Telecom SudParis
Funding: 184 k€
Date: 2021-2025
Summary: High-Performance architectures are increasingly heteregenous and incorporate often specialized hardware. We have first seen the generalization of GPUs in the most powerful machines, followed a few years later by the introduction of FPGAs. More recently we have seen nascence of many other accelerators such as tensor processor units (TPUs) for DNNs or variable precision FPUs. Recent hardware manufacturing trends make it very likely that specialization will not only persist, but increase in future supercomputers. Because manually managing this heterogeneity in each application is complex and not maintainable, we propose in this project to revisit how we design both hardware and operating systems in order to better hide the heterogeneity to supercomputer users. In summary, we propose to rethink the hardware/software boundary in order to hide the heterogeneity behind a common minimal instruction set and a unified address space.
ANR JCJC – VHS
Participants: Valentin Delis, François Trahay.
Partners: CEA/DAM, Benagil/Telecom SudParis
Coordinator: Valentin Delis, ensIIE
Funding: 225 k€
Date: 2025-2029
Summary: Magnetic tapes have been used to store computer data since the 1950s, so the layman now often considers it as an outdated technology. However, tape storage is still and will remain essential in many fields such as academic research, international organisations or cloud companies for its strong practical benefits: low cost per TB, low energy consumption, longevity etc... This dependency on tapes has motivated industrial efforts in technology improvements, resulting in much faster data density progression on tape rather than on disk. After recent breakthroughs in materials used, tape capacity is expected to witness a massive leap in coming years, increasing to several hundreds of TB per tape. This evolution will amplify the main benefits of tape storage.
However, tapes have often been primarily considered for archiving cold data, because of their main drawback: it takes around a minute to mount a tape from its shelf into a drive and position the reading head before starting reading data. This explains the current lack of academic effort to optimize relatively frequent data accesses. Nevertheless, more and more research projects require to handle tremendous volumes of data, which are not only destined to be archived but also regularly accessed for scientific analysis. Budget constraints impose the usage of tape storage, and optimizing tape data access therefore becomes more and more significant, and not limited to improving archive retrieval.
The general idea of the VHS project is to propose new interactions between resource management and tape systems. Using filesystem on tapes, we plan to design novel data placement strategies that will propose efficient data accesses by considering tapes at the level of the storage hierarchy by optimizing its operational cost. Our methodology starts from the tapes themselves, to better understand the physical processes involved in the different operations. Then, we will leverage this knowledge to derive interactions between tape and disk storage systems in order to improve data placement.
PIA Camelia
Participants: Élisabeth Brunet, Gaël Thomas.
Partners: CEA, Inria, CNRS, IMT, UGA, ECL, SU, INSA Rennes, UM, UB, IJL, INL, IM2NP, UJM, Mines Paris, UniStra, UPVD, UBO
Coordinator: C. Auliac et O. Santieys
Funding: 319 k€
Date: 2026-2032
Summary: Ce projet a pour objectif la conception et le développement d’un environnement et d’une pile logicielle permettant l’apprentissage et l’inférence de grands réseaux de neurones, dans des environnements exigeants en ressources, tels que le near-edge, le Cloud ou le HPC. À ce titre, il devra permettre de tirer pleinement parti des accélérateurs matériels développés dans le cadre des projets 1 (accélérateurs numériques), 2 (accélérateurs analogiques) et 3 (plateforme de co-intégration matérielle) du programme. Les solutions développées devront également être suffisamment flexibles pour permettre l’exploitation ultérieure de cibles matérielles exogènes au programme, notamment des solutions industrielles françaises ou européennes telles que celles de SiPearl, STMicroelectronics et Kalray. En complément, la facilité de prise en main par les ingénieurs et chercheurs en IA (souvent peu familiers du matériel ou des couches logicielles basses qu’ils exploitent) et la compatibilité avec les concepts émergents en IA, sont des aspects clefs pour le succès du projet, qui seront étudiés de près.
Chist-ERA - Redonda
Participants: Pierre Sutra.
Partners: Institut Mines-Télécom, IMDEA Software Institute, University of Surrey, Royal Holloway College - University of London, University of Neuchâtel
Coordinator: Pierre Sutra
Funding: 320 k€
Date: 2023-2026
Summary: The Redonda project's ambition is to design a next-generation replication protocol for blockchain. To achieve this, the project taps into recent advances in networking, secure computing and distributed systems. At the scale of a datacenter, the protocol relies on two recent technologies: RDMA and TEE. Both technologies are leveraged to create a sub-microsecond consensus layer that tolerates Byzantine failures. TEEs are also used in a novel upgradable and portable smart contract engine to execute blockchain transactions across a variety of infrastructures and hardware. Between datacenters, the protocol relies on leaderless state-machine replication. This recent approach decomposes transaction ordering into two sub-tasks that can execute in parallel, without a central coordinator to bottleneck the system. To ensure security and safety at runtime, the Redonda project creates the blockchain protocol by composing mechanically-verified building blocks. The new blockchain protocol is assessed using real hardware against benchmarks and publicly available traces. We target that it scales across hundreds of geo-distributed nodes while offering 100k+ transactions per second and split-second latency.
10 Dissemination
10.1 Promoting scientific activities
10.1.1 Scientific events: organisation
- Gaël Thomas : annual Inria Defi OS workshop (11/2025), annual PEPR DiVa workshop (05/2025)
- Mathieu Bacou : annual thematic workshop of the working group "Virtualization" of CNRS's GDR RSD about virtualization of systems and networks (12/2025)
Member of the organizing committees
- François Trahay : participation to the organization of the Per3S workshop as part of the steering committee;
10.1.2 Scientific events: selection
Member of the steering committees
- Gaël Thomas : chair of the steering committee of Compas (french)
- François Trahay : member of the steering committee of Compas (french)
- Pierre Sutra : member of the steering committee for PaPoC
Member of the conference program committees
- Gaël Thomas : member of Usenix ATC 2025, Eurosys 2025, Apsys 2025, and Resdis 2025 program committee.
- François Trahay : member of the ISC 2025 program comittee.
- Élisabeth Brunet : member of PDS 2025, Compas 2025, SC 2025.
- Valentin Delis : member of Cluster 2025 and ESA 2025 program comittee.
- Pierre Sutra : TPC member for Middleware 2025, ICDCS 2025, PaPoC 2025, and SRDS 2025.
Reviewer - reviewing activities
- Valentin Delis : Reviewer for TPDS
10.1.3 Invited talks
-
Gaël Thomas
- 06/2025, invited talk at Epita, téléGC: a barrier-free garbage collector for disaggregated memory
- 07/2025, invited talk at Sushi Seminar, téléGC: a barrier-free garbage collector for disaggregated memory
-
François Trahay
- workshop ECLAT
- journée scientifique de l'Institut Polytechnique de Paris
- tutorial on performance analysis with EZTrace, as part of the Compas conference
-
Pierre Sutra
- keynote at LADC '25
10.1.4 Scientific expertise
- François Trahay was a member of selection committee for an Associate Professor position at Télécom SudParis, May 2025.
- Élisabeth Brunet w as a member of the selection committee for an Associate Professor position at INSA Lyon in 2025.
- Elisabeth Brunet was a member of the committee awarding the Prix de thèse Gilles Kahn of the SIF-Société Informatique de France in 2025.
- Mathieu Bacou was a member of selection committee for twin Associate Professor positions at Université de Lille, May 2025.
10.1.5 Research administration
- François Trahay : head of research action "Energy Efficiency" of the Energy4Climate interdisciplinary center.
- François Trahay : head of working group "Large Scale Computing" of CNRS's GDR C4P.
- Mathieu Bacou : co-head of working group "Virtualization" of CNRS's GDR RSD.
10.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
- Master: François Trahay is the head of the master of Computer Science at Institut Polytechnique de Paris
- Master: Pierre Sutra and Gaël Thomas are the heads of the Parallel & Distributed Systems master track at Institut Polytechnique de Paris
- Engineering: Élisabeth Brunet is in charge of the AI 3rd year track at Télécom SudParis
- Engineering: Pierre Sutra is in charge of the ASR 3rd year track at Télécom SudParis
- Engineering : Valentin Delis is in charge of the CIDM HPC track at ensIIE (2nd & 3rd year of engineering program). Holder of the Chair "Technologies avancées & émergentes pour la Souveraineté Numérique" between ensIIE and CEA. 330h of teaching duties (including administrative duties) at ensIIE from 1st to 3rd year in both initial and apprenticeship training programs. Teaching in CPES Data Science course at Lycée International de Saclay (course leader: Maria Boritchev , Télécom Paris). Jury member for the oral examination of Concours Mines-Telecom.
10.2.1 Supervision
Phd in progress:
- Jean-Francois Dumollard , "Virtualization techniques to enforce the security of an operating system", supervised by G. Thomas, M. Bacou and N. Derumigny
- Catherine Guelque , "Large scale performance analysis", supervised by F. Trahay, and V. Delis
- Martin Horth , "Static analysis methods for obfuscated software reverse engineering", supervised by F. Trahay, and O. Levillain
- Jules Risse , "Fine-grain energy consumption measurement", supervised by F. Trahay, and A. Guermouche
- Jana Toljaga , "Virtualization techniques for persistent memory", supervised by G. Thomas, M. Bacou and N. Derumigny
- Guillermo Toyos Marfurt , "A Next-Generation State-Machine Replication Protocol for Blockchain", supervised by P. Sutra and P. Kuznetsov
- Lucas Van Lanker , "Performance projection of GPU applications", supervised by F. Trahay, E. Brunet, and H. Taboada
- Nevena Vasilevska , "Hardware cache controlled by software for memory disaggregation", supervised by G. Thomas, J. Dumas, and N. Derumigny
- Tara Aggoun , "Design and implementation of a disaggregated Java virtual machine", supervised by G. Thomas and J.-P. Lozi
- Harena Rakotondratsima , "Design and implementation of in-process isolation mechanisms", supervised by G. Thomas and N. Derumigny
- Minh Tung Nguyen , "Computability and Complexity in Mixed-Trust Distributed Systems", supervised by P. Sutra
Defended Phd:
- Mickaël Boichot , "Caracterizing parallel applications for porting to multi-GPUs systems", supervised by P. Carribault, and E. Brunet
- Adam Chader , "Large-scale garbage collectors", supervised by G. Thomas, and M. Bacou
- Marie Reinbigler , "Frugal multiresolution analysis of gigapixel images : application to biomedical data and beyond", supervised by C. Fetita, and E. Brunet
- Boubacar Kane , "Les objets ajustés : Une approche bien fondée et efficace pour la programmation concurrente", supervised by P. Sutra
10.2.2 Juries
-
Gaël Thomas
- Reviewer of the PhDs of Papa Assane Fall, Nahuel Palumbo, Xiaoxiang (William) Wu (Australia), Nahuel Palumbo, Lana Scravaglieri, Simon Lambert, Adrian Khelili, Aghiles Ait Messaoud, Guillermo Polito (HdR)
- Examiner of the PhDs of Léo Cosseron, Eduardo Tomasi Ribeiro, Himadri Pandya, Matthieu Bettinger, Ayush Pandey
-
François Trahay
- President of the PhD committee for Boubacar Kane, Institut Polytechnique de Paris
- Reviewer of the PhDs of Aymeric Millan, Louis Boulanger, Himadri Pandya
-
Pierre Sutra
- President of the PhD committee for Luciano Freitas de Souza, Institut Polytechnique de Paris
- Élisabeth Brunet : examiner of the PhD of Youssouph Faye.
-
Mathieu Bacou
- Expert member of the jury to award the VAE "Expert en Sécurité des Systèmes d'Information (ESSI)" of ANSSI
- Examiner of the PhD of Jean-Baptiste Decourcelle
11 Scientific production
11.1 Major publications
- 1 inproceedingsPALLAS: a generic trace format for large HPC trace analysis.IPDPS 2025: 39th IEEE International Parallel & Distributed Processing Symposium39th IEEE International Parallel & Distributed Processing Symposium(IPDPS)Milan, Italy2025HAL
- 2 proceedingsAdjusted Objects: An Efficient and Principled Approach to Scalable Programming.MIDDLEWARE '25: 26th International Middleware ConferenceNashville (Tenessee), United StatesACMDecember 2025, 215-227HALDOI
- 3 proceedingsAn Exact Characterization of the Two-shot Deterministic Objects Solving Two-process Consensus.PODC '25: ACM Symposium on Principles of Distributed ComputingSanta María Huatulco, MexicoACMJune 2025, 477-487HALDOI
11.2 Publications of the year
International journals
International peer-reviewed conferences
Conferences without proceedings
Edition (books, proceedings, special issue of a journal)
Doctoral dissertations and habilitation theses
Reports & preprints