2025Activity reportProject-TeamCORSE
RNSR: 201421151J- Research center Inria Centre at Université Grenoble Alpes
- In partnership with:Université de Grenoble Alpes
- Team name: Compiler Optimization and Run-time SystEms
- In collaboration with:Laboratoire d'Informatique de Grenoble (LIG)
Creation of the Project-Team: 2016 July 01
Each year, Inria research teams publish an Activity Report presenting their work and results over the reporting period. These reports follow a common structure, with some optional sections depending on the specific team. They typically begin by outlining the overall objectives and research programme, including the main research themes, goals, and methodological approaches. They also describe the application domains targeted by the team, highlighting the scientific or societal contexts in which their work is situated.
The reports then present the highlights of the year, covering major scientific achievements, software developments, or teaching contributions. When relevant, they include sections on software, platforms, and open data, detailing the tools developed and how they are shared. A substantial part is dedicated to new results, where scientific contributions are described in detail, often with subsections specifying participants and associated keywords.
Finally, the Activity Report addresses funding, contracts, partnerships, and collaborations at various levels, from industrial agreements to international cooperations. It also covers dissemination and teaching activities, such as participation in scientific events, outreach, and supervision. The document concludes with a presentation of scientific production, including major publications and those produced during the year.
Keywords
Computer Science and Digital Science
- A1. Architectures, systems and networks
- A1.1. Architectures
- A1.1.1. Multicore, Manycore
- A1.1.2. Hardware accelerators (GPGPU, FPGA, etc.)
- A1.1.3. Memory models
- A1.1.4. High performance computing
- A1.1.12. Non-conventional architectures
- A1.6. Green Computing
- A2.1.6. Concurrent programming
- A2.1.7. Distributed programming
- A2.1.10. Domain-specific languages
- A2.2. Compilation
- A2.2.1. Static analysis
- A2.2.2. Memory models
- A2.2.3. Memory management
- A2.2.4. Parallel architectures
- A2.2.6. GPGPU, FPGA...
- A2.2.8. Code generation
- A2.3.1. Embedded systems
- A7.1. Algorithms
- A8.2. Optimization
- A8.2.1. Operations research
- A9.2.1. Supervised learning
- A9.2.3. Reinforcement learning
- A9.2.5. Bayesian methods
- A9.2.8. Deep learning
Other Research Topics and Application Domains
- B3.1. Sustainable development
- B4.5. Energy consumption
- B5.3. Nanotechnology
- B6.1.2. Software evolution, maintenance
- B6.6. Embedded systems
- B6.7. Computer Industry (harware, equipments...)
- B9.1. Education
1 Team members, visitors, external collaborators
Research Scientists
- Fabrice Rastello [Team leader, INRIA, Senior Researcher, HDR]
- Guillaume Iooss [INRIA, Researcher]
- Hugo Pompougnac [INRIA, Starting Research Position, from Mar 2025]
Faculty Members
- Florent Bouchez [UGA, Associate Professor]
- Ylies Falcone [UGA, Associate Professor, until Jul 2025]
- Guillaume Huard [UGA, Associate Professor]
- Manuel Selva [GRENOBLE INP, Associate Professor]
Post-Doctoral Fellow
- Hugo Pompougnac [INRIA, Post-Doctoral Fellow, until Feb 2025]
PhD Students
- Alban Dutilleul [INRIA, from Oct 2025]
- Leon Frenot [UGA, from Sep 2025]
- Sylvain Noiry [KALRAY]
Technical Staff
- Rui Cesista [INRIA, Engineer, from Dec 2025]
- Gallas Gaye [ GROUPE LR TECHNOLOGIES, Engineer, from Aug 2025]
- Gallas Gaye [GROUPE LR TECHNOLOGIES, Engineer, from Feb 2025 until Jun 2025]
- Christophe Guillon [INRIA, Engineer]
- Zikang Liu [INRIA, Engineer, until Aug 2025]
- Liam Semeria [INRIA, Engineer, from Nov 2025]
- Lucas Sube [INRIA, Engineer, from Oct 2025]
Interns and Apprentices
- Arthur Debaud [INRIA, Intern, from Jun 2025 until Aug 2025]
- Emile Dechenaud [INRIA, Intern, from Apr 2025 until Jul 2025]
- Alban Dutilleul [INRIA, Intern, from Mar 2025 until Aug 2025]
- Nikolaos Mavrogeorgis [INRIA, Intern, until Mar 2025]
- Oussama Ouaghlissi [UGA, Intern, from May 2025 until Jul 2025]
- Ryan Rouibah [INRIA, Intern, from May 2025 until Jul 2025]
- Ryan Rouibah [INRIA, Intern, from Mar 2025 until Apr 2025]
- Leopold Saint Denis [UGA, Intern, from May 2025 until Jul 2025]
- Lucas Sube [INRIA, Intern, from Mar 2025 until Apr 2025]
- Lucas Sube [INRIA, Intern, until Mar 2025]
- Anh Khoi Truong [INRIA, Intern, from Jun 2025 until Jul 2025]
- Chloe Viennot [INRIA, Intern, from Jun 2025 until Aug 2025]
Administrative Assistants
- Marie-Anne Dauphin-Rizzi [INRIA]
- Maria Immaculada Presseguer [INRIA]
2 Overall objectives
Our research team investigates the foundations and tools required to fully exploit modern computing systems, with a particular focus on the tight interaction between compilers, software stacks, and microarchitectural design. Our overarching objective is to bridge the growing gap between increasingly complex hardware platforms and the software that must efficiently and reliably use them.
As computing enters a new era dominated by hardware accelerators (especially for artificial intelligence and data-intensive workloads) traditional approaches must be revisited. Performance, energy efficiency, and predictability increasingly depend on a deep understanding of architectural and microarchitectural details such as memory hierarchies, dispatch queues, or specialized compute units. Our work recognizes that these details must be first-class citizens in the software toolchain rather than hidden behind opaque layers.
Compilers and compiler-related tools play a central role in this vision. We develop techniques that combine high-level programmability with low-level awareness, enabling automated transformations, optimizations, and code generation that are explicitly informed by hardware characteristics. By integrating microarchitectural knowledge into compilation, scheduling, and analysis tools, we aim to unlock the potential of systems composed of CPUs, GPUs, and domain-specific accelerators.
Ultimately, our goal is to contribute to a co-designed hardware–software ecosystem in which compiler technology serves as the key enabler for performance portability.
3 Research program
Today, the team’s research is structured around four interconnected axes, each targeting critical aspects of computational efficiency and education:
Automatic Derivation of Parametric Data Movement Complexity
This research axis focuses on understanding and optimizing data movement in algorithms to improve computational performance. The team has developed methods to derive both lower and upper bounds for data movement. We use polyhedral compilation tools to automate the derivation of lower bounds, addressing challenges like small problem sizes and specific patterns that hinder tiling. To determine upper bounds, we leveraged our expertise in the polyhedral model, data movement complexity estimation, and mathematical proof techniques to create an algorithm that calculates symbolic over-approximations of data movement for tiled affine codes. This algorithm helps in determining optimal tile sizes and loop orderings. One of the objectives of this axis is to gain insights that help the development of more efficient algorithms and tools in the third axis.
Automatic Resource Characterization and Performance Feedback
This axis focuses on developing tools for characterizing CPU architectures and debugging performance issues. The team created tools for automatic resource characterization, which can be used for modeling the performance of computing kernels. These tools enhance CPU performance understanding and optimization, requiring expertise in CPU architecture, performance modeling, and binary instrumentation. The insights gained are crucial for optimizing computational kernels, as explored in the next axis.
Performance Modeling, Schedule Optimization, and Code Generation for Tensor Computations
This research axis focuses on optimizing tensor computations, which are crucial for various applications including maching learning. The team develops optimization schemes, code generators, as well as autotuning methods, evaluates cache models, and explores reinforcement learning for complex architectures. We also optimize Sparse Matrix Multi-vector multiplication for graph neural networks and pruned convolutional neural networks. This work requires expertise in performance modeling, schedule optimization, code generation, cache modeling, and reinforcement learning to create efficient tensor computation solutions. The last two axes are decoupled from the first three but share common expertise in program instrumentation.
Teaching of Algorithms, Programming, and Debugging
This research axis focuses on enhancing computer science education by developing tools for visualizing program execution, gamifying debugging practice, implementing active learning strategies, and generating automated exercises for Intelligent Tutoring Systems. It requires expertise in educational pedagogy, visualization tools, game design, AI, machine learning, software development, and user interface design. These efforts leverage techniques from program instrumentation and visualization to improve the teaching and learning experience.
4 Application domains
4.1 Transfer
The main industrial sector related to the research activities of Corse is the one of semi-conductor (programmable architectures spanning from embedded systems to servers). Obviously any computing application which has the objective of exploiting as much as possible the resources (in terms of high-performance but also low energy consumption) of the host architecture is intended to take advantage of advances in compiler and run-time technology. These applications are based on numerical kernels (linear algebra, FFT, convolution, etc.) that can be adapted to a wide spectrum of architectures. More specifically, an important activity concerns the optimization of machine learning applications for some high-performance accelerators. Members of Corse already maintain fruitful and strong collaborations with several companies such as Google, STMicroelectronics, AMD, or ARM.
5 Social and environmental responsibility
5.1 Footprint of research activities
As expected, after the COVID pandemia, team members kept travel activities quite low compared to before the pandemia. Whenever long distance meetings (such as conference PC) could be done virtually, travel has been avoided. Also, team members try to better use existing hardware instead of replacing them (buying new ones).
5.2 Impacting research directions for environment
Because of rebound effect, improving efficiency does not necessarily improve environmental impact. It is thus crucial to think how our community can have actual impact on sustainable computing, that is, influence better design ("R" friendly – Reuse, Reduce, Repair...) and better usage (consume less) of our compute resources. To achieve this goal, we arrange panels with the aim of raising awareness within our community about this significant issue. We expect some of our future research projects to address the challenge of sustainable computing without just focusing on energy efficiency but by considering the global systemic impact as much as possible.
5.3 Impacting usage
The main two challenges of sustainable computing are:
- Decrease usage: While the actual environmental impact of our usage is already not that clear to experts like us (lack of open data), it is even less clear for users and developers. It is thus our responsibility to expose estimations of resource usage (and associated environmental impact) to the developers. Performance debugging tools should evolve to provide meaningful metrics and make them accessible to non experts.
- Increase the lifetime of hardware (that is, Reuse, Repair, Re...): The need for supporting the development of simple, open-source, digital commons, low-impact (not necessarily low-tech) hardware/software solutions is becoming critical but not sufficient. We also need to provide the microscope and the tool-box so that a majority (including sometimes the end-user) can repair or repurpose their device.
Compiler analysis, programming infrastructure, hardware modeling, teaching tools, HIM, etc. are at the heart of those challenges.
6 Latest software developments, platforms, open data
6.1 Latest software developments
6.1.1 GUS
-
Keywords:
CPU, Microarchitecture simulation, Performance analysis, Dynamic Analysis
-
Functional Description:
GUS' goal is to detect performance bottlenecks at the very low level on monothread applications by the use of sensitivity analysis. It is coded as a QEMU plug-in in order to collect runtime information that are later treated by the generic CPU model.
-
News of the Year:
We have designed a “tainting” mechanism to identify instructions or resources that constrain others.
In addition, we have extended Gus to new architectures and microarchitectures beyond Intel x86 Sky Lake: ARM Maia, Intel x86 Sandy Bridge, Intel x86 Ice Lake, Intel x86 Alder Lake.
- URL:
- Publication:
-
Contact:
Fabrice Rastello
6.1.2 SEATOP
-
Name:
Semantically enhanced asm tracer for optimised programs
-
Keywords:
Software, Debug
-
Functional Description:
SEATOP combines binary instrumentation and static debug information to compare an unoptimized execution with an optimized execution of the same program. It produces an assembly trace of the instructions executed by the optimized version, annotated with high-level semantic information: equivalent source code line, variable values.
SEATOP assists in debugging performance or the correctness of the implemented optimization (tiling, interchange, -O3, ...).
- URL:
-
Contact:
Lucas Sube
6.1.3 PALMED
-
Keywords:
CPU, Performance measure, Performance analysis, Reverse engineering
-
Scientific Description:
PALMED is a software to generate automatically performance models for superscalar processors. It supports Armv8a and x86 ISA, and outputs a bipartite graph (instructions, resources) that represents the behaviour of the tested CPU. PALMED is based on a convex encoding of the resource mapping problem, solved by a convex solver, Gurobi.
-
Functional Description:
PALMED computes a bipartite graph between assembly instructions and abstract resources that may be used for performance prediction, targeting static analysis tools and compilers. Internally, PALMED uses PIPEDREAM as a framework for microbenchmarking code generation, and uses gurobi to find a first small graph. Then, PALMED deduces from the found resources and the microbenchmarks that saturates them a mapping of every supported instruction.
- URL:
- Publications:
-
Contact:
Fabrice Rastello
-
Participant:
3 anonymous participants
6.1.4 Diesect
-
Name:
Diesect
-
Keywords:
CPU, Reverse engineering, Microarchitecture
-
Scientific Description:
Diesect computes the decomposition of instructions in a given ISA towards the µOPs into backend structures. At first, it tries to isolate the capacities and relationships of instruction queues (IQs) thanks to a set of basic instructions. This characterization then allow to stress these resources to design contention situations to deduce the mapping between instructions and micro-operations (µOPs) footprints in IQs. This data can then be plugged into a performance model or cross-referenced with other microbenchmarks to have a finer-grain modelization of the pipeline.
-
Functional Description:
Diesect is a reverse-engineering tool that reveals the hidden internal logic of modern processors. While hardware manufacturers often keep their chip designs secret, most processors follow a set of universal design principles. Diesect operates as a grey-box tool, which means it uses these shared structural patterns to look inside a processor without needing proprietary documentation or specialized vendor tools. The software automatically creates specialized tests that cause intentional and controlled bottlenecks within the chip. By measuring the precise timing of these delays, the tool can deduce the processor's internal organization and performance limits. This approach is highly portable and works across various manufacturers like Intel, Apple, or ARM because it is independent of vendor-specific details. Ultimately, Diesect provides the analytical insights needed to build accurate performance models and develop more efficient software.
- URL:
-
Contact:
Alban Dutilleul
6.1.5 sarcasm
-
Name:
Set-Associative Rotating Cache Analytical/Simulating Model
-
Keywords:
Cache, Operational intensity
-
Functional Description:
This repository includes: (i) A definition and management of Ttile scheme, which describes schedules for tensor operators, (ii)Algorithms to generate such scheme and explore a microkernel-based search space (algorithm from Ttile), (iii) Interfacing with XTC to launch them, using TVM/MLIR as possible backends, (iv)A fully-associative cache model (inspired by IOOpt/IOUB), and (v) the Sarcasm set-associative cache model, that can be applied to any Ttile scheme to predict its number of cache misses.
- URL:
-
Contact:
Guillaume Iooss
6.1.6 XTC
-
Name:
Xdsl Transform Compiler
-
Keywords:
Compilation, Machine learning, Autotuning
-
Functional Description:
The XTC (XDSL Transform Compiler) project, is a compiler framework that provides high-level scheduling specifications for linear algebra operations over a dataflow graph.
The project provides high-level syntax for applying transformations like: (i) Tiling, (ii) Loop interchange, (iii) Vectorization, (iv) Unrolling.
It allows the integration of multiple backend and implements: (i) MLIR backend (using MLIR linalg and transform dialects), (ii) TVM backend (using TVM tensor IR and schedule APIs), and (iii) JIR backend (using JIR intermediate representation).
Its core components are: (i) abstract interfaces for tensors, graphs, nodes, and operators, (ii) implementers for different backends, (iii) schedulers for transformation management, (iv) compilers for code generation, and (v) evaluators for performance measurement
The main tools are: (i) `mlir-loop`: Compilation driver, (ii) `loop-explore`: Automatic exploration of different scheduling strategies, (iii) `loop-display`: Visualization of exploration results.
- URL:
-
Contact:
Hugo Pompougnac
6.1.7 JIR
-
Name:
JIR Intermediate representation and Tools
-
Keywords:
Compilation, Autotuning, Machine learning
-
Functional Description:
JIR (J IR) is an intermediate representation. The JIR project provides in addition a compiler/optimization framework.
Its key Design Principles are: (i) Concision & clarity for interactive transformation, (ii) Ease of manipulation for scheduling transformations, (iii) Limited scope focusing exclusively on scheduling of linear algebra operations found in ML frameworks, and (iv) Integration with existing compiler infrastructure (MLIR, Polygeist, Z3)
Its core Features are: (i) Code transformation and optimization framework, (ii) Schedule-based optimization system, (iii) Constraint-based autotuning capabilities, (iv) Support for benchmarking and performance analysis, and (v) Interactive transformation tool
Technical Details: (i) AST based J IR for loops and buffer descriptions, (ii) Transformations directly done at the J IR level, (iii) Backend codegenration for MLIR/LLVM infrastructure, (iv) Leverages Z3 theorem prover for constraint solving, and (v) Implements Gibbs sampling for schedule exploration
Specific Features: (i) Unlike Halide, TACO, or TVM, JIR deliberately excludes computation description, (ii) Relies on externally supplied computation primitives, (iii) Concentrates purely on scheduling problems
High-Level MLIR Integration: (i) Uses MLIR affine dialect as target, (ii) Preserves loop structure and iteration space information, (iii) Enables additional backend-level optimizations
Unified Intermediate Language: (i) Maintains consistent representation throughout transformations, (ii) No lowering until final MLIR translation, (iii) Enables better reasoning about transformations.
Constraint System: (i) Implements scheduling space through discrete constraints, (ii) Integrates with SMT solver (Z3), (iii) Enables schedule verification and completion, (iv) Supports Gibbs sampling for schedule space exploration
- URL:
-
Contact:
Christophe Guillon
6.1.8 SDist
-
Name:
MLIR-SDist
-
Keywords:
MLIR, Compilation, Matrix calculation, Kalray, Optimizing compiler
-
Functional Description:
The development of a compilation toolchain for tensor operations on a machine with distributed local memories requires to generate complex on-chip horizontal communications between memories. SDist provides a series of abstractions dedicated for the distribution of tensors and computation loops in this context. So the user can reason on higher level concepts like memory meshes. Then, unlike existing tools that produces MPI-style primitives, SDist automates the code generation of Hardware-specific communication primitives, like one sided DMA transfers. The project is designed as an extension to the MLIR infrastructure, by adding dialects and passes. The code generation currently supports two accelerators with distributed local memories, the Kalray MPPA, and the Amd AIE.
- URL:
-
Contact:
Sylvain Noiry
6.1.9 MLOpBench
-
Name:
Machine Learning Operators Benchmark
-
Keywords:
Machine learning, Benchmarking, LLM, Optimizing compiler
-
Functional Description:
MLOpBench provides an easy way to extract operators from modern machine learning models (including LLMs), and benchmark them on several state of the art libraries and compilers. The tool has a command line interface. The user chooses the model or set of models (e.g. MLPerf) to analyze and the input parameters (e.g. context size, batch, etc). The extraction of operators, that relies on the Pytorch ecosystem, outputs information like the operator type and shape, data type, and shapes of inputs/outputs. Then, MLOpBench can benchmark some of the extracted operators on one of the following operators libraries: Intel MKL, OpenBLAS, ARM Compute Library, ARM Performance Library, CuBLAS and CuBLAS-lt, Google XNNPACK. As well as the following compilers: XTC (https://bil.inria.fr/fr/software/view/5448/tab), TVM. The tool reuses the runtime the evaluation harness developed for XTC for fair and reproducible measurements with hardware counters.
- URL:
-
Contact:
Sylvain Noiry
6.1.10 Agdbentures
-
Keywords:
Debug, Teaching of programming, Video Game
-
Functional Description:
Agdbentures is a game to teach debugging. It is based on GDB (the Gnu Debugger), uses the Easytracker library, and proposes to students an RPG-like (Role Playing Game) 2D interface, where each level is a program in the C language that contains one or more bugs. To validate a level, one needs to first correct the bugs that block the main character in its goals. Difficulty is gradual, and the code for each level is based (and expands) on the preceding level, which allows players to get familiar with the code base.
- URL:
-
Contact:
Florent Bouchez
7 New results
Our team effort is structured along four axis:
- Automatic Derivation of Parametric Data Movement Complexity: This research axis focuses on understanding and optimizing data movement in algorithms to improve computational performance
- Automatic Resource Characterization and Performance Feedback: This axis focuses on developing tools for characterizing CPU architectures and debugging performance issues
- Performance Modeling, Schedule Optimization, and Code Generation for Tensor Computations: This research axis focuses on optimizing tensor computations, which are crucial for various applications including machine learning.
- Teaching of Algorithms, Programming, and Debugging: This research axis focuses on enhancing computer science education by developing tools for visualizing program execution, gamifying debugging practice, implementing active learning strategies, and generating automated exercises for Intelligent Tutoring Systems.
7.1 Automatic Derivation of Parametric Data Movement Complexity
Participants: Guillaume Iooss, Fabrice Rastello.
I/O Complexity
When analyzing an algorithm's properties, we often focus on its computational complexity, which measures the amount of computation required (ex: for a quadratic algorithm) and provides a rough information about the evolution of its performance related to the problem sizes. However, there are other important factor that are critical for performance, such as the amount of data movement needed to perform these computations.
We consider a simple generic memory model, composed of (i) a large memory of infinite size which contains the input data of the program; and (ii) a small memory of limited parametric size () from which the computational unit can interact with its content. The I/O complexity of a program, also known as data-movement complexity, refers to the minimum amount of data movement an algorithm requires across all valid execution schedules.
To estimate this I/O complexity, we derive lower and upper parametric bounds using two different approaches. A lower bound can be established by mathematically proving that a certain volume of computation is necessary, using strategies such as the K-partitioning method or the wavefront method. An upper bound can be determined by demonstrating a schedule that matches this data movement quantity.
Specialization to the Hourglass pattern
Our effort is focused on improving the lower bound derivation techniques, in order to improve their tightness compared to the upper bound. In particular, we consider specializations of the classical derivation techniques, and we leverage the additional properties of the targeted programs to improve asymptotically the derived bound.
We have identified a pattern of data dependencies, called the hourglass dependency pattern. An hourglass pattern involves successive reductions and broadcasts over a parametric number of elements, significantly constraining the shape of a valid tiling. Several important linear algebra algorithms, such as Gram-Schmidt, QR Householder, reduction to a bidiagonal matrix (gebd2), and Hessenberg matrix factorization (gehd2), exhibit this pattern.
Recent contribution
This year, we have studied an generalization of the hourglass pattern to cover even more algorithms, in particular from dynamic programming. Instead of focusing on pairs of reduction and broadcast dependencies, we rely on dominance properties. We also consider the possibility of reordering the terms of a reduction (using associativity and commutativity properties of their operators) as part of the scheduling decision, which can lead to a different bound. This is important for some kernels such as nussinov. A journal paper on the topic of this work is currently in preparation.
7.2 Automatic Resource Characterization and Performance Feedback
Participants: Alban Dutilleul, Lucas Sube, Guillaume Huard, Fabrice Rastello.
7.2.1 Gus
Modern Out-of-Order (OoO) CPUs are intricate systems with numerous interconnected components. Identifying performance bottlenecks and understanding the root causes of program performance issues are essential to fully leverage the hardware's capabilities. Current performance debugging methods either measure resource utilization to determine which CPU parts cause performance limitations or analyze code based on capacity/throughput models to derive bottleneck information. These approaches are constrained by instrumental and methodological precision, face portability issues across different microarchitectures, and often provide factual data about resource constraints without offering causal insights on how to resolve them.
Gus (Section 6.1.1), a performance debugging and analysis tool we began designing and developing several years ago, uses a resource-centric CPU model driven by dynamic binary instrumentation to detect complex bottlenecks resulting from the interplay of hardware and software factors. Bottlenecks are identified through sensitivity-based analysis, a form of model parameterization that employs differential analysis to reveal constrained resources.
This year efforts on Gus aimed at enhancing the user-friendliness of its output. Key updates include integrating the causality feature and establishing a benchmarking campaign to validate the model against native performance and GEM5. This initiative also provide an opportunity for code refactoring, model fine-tuning, and bug resolution.
7.2.2 SeaTop
As Gus provides the developer with information about kernels bottlenecks at the assembly code level, the link between these bottlenecks and the original source code is usually not straightforward at all. Indeed, optimized kernel generally undergo several source-to-source transformations, such as tiling, loop interchange, loop unrolling, vectorization, in addition to the optimization passes performed by the compiler. Thus, the resulting assembly code can be especially hard to understand and correlate with the source code on which the developer is working.
To address this issue, we developed SeaTop (Semantically Enhanced Asm Tracer for Optimized Programs, Section 6.1.2), a tool to help debugging the performance of an optimized kernel. It enriches traces collected from the optimized application with semantic information extracted from the original unoptimized source code. To this indent, it combines binary instrumentation and static debug information to compare two executions of the same program : an unoptimized one an optimized one. It then produces an assembly trace of the instructions executed by the optimized version, annotated with high-level semantic information: equivalent source code line and relevant variables value. Although SeaTop main objective is to assist in debugging performance of the implemented optimizations (tiling, loop interchange, loop unrolling, vectorization, ...), it can also be used to help debugging correctness of some transformations (for instance to check that the transformed iteration domain remains unchanged). At the time of this report, SeaTop is still under development and should be released in the upcoming year.
7.2.3 DieSect
The rapid proliferation of diverse and opaque processor microarchitectures from heterogeneous cores in Apple Silicon to specialized ARM server CPUs presents a critical scalability challenge for performance modeling.
Conventional characterization faces an unsatisfactory dichotomy: black-box methods are often brittle and susceptible to noise, while white-box approaches rely on non-portable, poorly documented performance counters that require significant human intuition.
To keep pace with the current rate of microarchitectural innovation, there is a clear need for automated, "grey-box" methodologies that can extract deep insights across a broad spectrum of designs.
In this context, we introduce DieSect (Section 6.1.4), a portable microbenchmarking harness that exploits the fundamental, stable design principles of superscalar, out-of-order pipelines rooted in Tomasulo’s algorithm. We utilize timing side-channel attacks on backend structures to infer the decomposition of instructions into micro-operations (μops). By systematically inducing targeted analytical bottlenecks, we disentangle the intricacies of the throughput function without demanding vendor-specific tooling.
We have already begun applying this technique to extract microarchitectural information across multiple platforms, including Apple Silicon and several ARM designs.
Our objective is to integrate this reverse-engineered structural data with other microbenchmarks, such as throughput measurements, to develop a new, high-fidelity model of the processor pipeline. This combined approach aims to provide deeper, more accurate insights than existing black-box methods while maintaining the portability required for the modern hardware landscape.
7.3 Performance Modeling, Schedule Optimization, and Code Generation for Tensor Computations
Participants: Hugo Pompougnac, Christophe Guillon, Léon Frenot, Sylvain Noiry, Rui Cesista, Liam Semeria, Guillaume Iooss, Fabrice Rastello, Albert Cohen [Google, France], Saday Sadayappan [University of Utah].
Tensor computations is a class of program that includes matrix multiplication, convolution and similar operations. These are important for machine learning, since most of the computation of Convolutional Neural Network (CNN) or Large Language Model (LLM) fall into this class.
Our effort focuses on exploring how to optimize these computations from a compilation approach, and is centered around XTC, our new research platform (Section 7.3.1). From this tool, we have several on-going directions:
- With XTC, we have introduced Descript, a scheduling language that allows us to describe a schedule. We are investigating an extension to this language (called Descript++, Section 7.3.2) that allows us to define a search space and to automatically derive a sampler from it.
- In order to find the best schedules on CPU, we have to consider their behavior in respect to caches. Sarcasm (Section 7.3.3) is a new set-associative cache model which is specialized to tensor operator schedules, and fast enough to be applied on many schedules in a reasonable time.
- MlOpBench (Section 7.3.4) is a benchmarking tool that extracts operators from a set of models, and provides benchmarking utilities to estimate the efficiency of their performance.
- SDist (Section 7.3.7) is a backend for the MPPA machine and the AMD AIE. It requires extending Descript and the whole XTC framework to support distributed computation, due to the nature of these targets.
7.3.1 XTC: A Research Platform for Optimizing AI Workload Operators
We developed XTC, a research platform that decouples scheduling specification from code generation and performance measurement for AI workload optimization. The platform addresses a fundamental limitation in current compiler research: existing scheduling languages such as TVM's Tensor Expression, Halide, and MLIR's Transform dialect remain tightly coupled to their respective compilation ecosystems, preventing fair comparison, reuse, and evaluation across frameworks.
XTC provides three main contributions. First, a unified API for scheduling that abstracts core components from multiple scheduling languages, exposing ten scheduling primitives (strip-mine, interchange, split, unroll, vectorize, parallelize, pack, bufferize, and fusion operations) through a consistent interface. This API enables seamless integration with diverse scheduling compilers while supporting a higher-level declarative language for manual experimentation. Second, a unified API for measurement that provides a controlled, cross-platform harness for performance evaluation. This infrastructure accesses hardware performance counters on x86 and ARM CPUs (including, to our knowledge, the first such access on Apple Silicon), as well as NVIDIA GPUs, enabling reproducible and quantitative comparisons across compilation pipelines. Third, integration with state-of-the-art backends including TVM and MLIR, allowing researchers to leverage rapidly evolving compiler ecosystems while maintaining the flexibility to evaluate research prototypes through user-defined backends.
The platform also provides interfaces for automating design space exploration, enabling experts to connect high-level scheduling strategies with custom sampling and predictive models. We demonstrated this capability by reproducing the search spaces of state-of-the-art autotuners such as Ansor.
Our evaluation showed that XTC matches hand-tuned C code performance for matrix multiplication using the Goto strategy, validated cross-backend consistency between TVM and MLIR, identified limitations in MLIR's vectorization pass for convolution operations, and demonstrated integration within the Aidge framework achieving 15-30× speedup on Intel CPUs compared to generic C++ export for neural network inference.
Impact and dissemination. XTC was first presented at the University of Cambridge during the MLIR Summer School, introducing the platform to the compiler research community. We are delivering a tutorial on performance engineering based on XTC at the CGO conference, and presenting the tool at the C4ML workshop. From a research perspective, XTC now serves as the foundation for developing a novel scheduling language based on constraint programming for defining loop optimization search spaces, opening new directions for automated exploration of transformation sequences.
7.3.2 Descript++
Optimizing a linear algebra operator implies searching for performing configuration in a large space of configurations, while balancing the various bottlenecks of the considered architecture. An expert can guide this exploration using their insight of the architecture and manually examine a sequence of configurations. On the other extreme, autotuning techniques are fully automated process that require less expert knowledge. However, this knowledge is usually embedded in the search space and cannot be modified.
Descript++ aims at unifying these approaches by introducing a scheduling language, Descript, that is extended to obtain an autotuning languae. This means that an expert can specify a configuration to examine it, but also build a search space by adding variables and constraints. From this search space specification, we can automatically infer a usable sampling procedure that outputs a list of configuration from this search space. This allows a fine-grain control of the expertise used in an autotuner.
7.3.3 Sarcasm
In order to estimate the performance of a schedule at compile time, data movement is a critical part. On CPU, this means that we need to estimate how many cache misses occur at each cache level, since it linked with one of the major potential performance bottleneck. More precisely, given a set of schedules which are already optimized for the register level (vectorized + microkernel), we wish for a cache model that is able to identify the set of configurations that minimize the cache misses.
In the literature, the analytical cache models are either (i) very precise but very slow (for example, by using Presburger sets to accurately model the points in the program in which a cache miss occur); (ii) very fast but very inaccurate (all fully associative cache model falls into this category, but are unusable on a set-associative cache); or (iii) heuristics that sacrifices precision of the model for speed.
We introduced Sarcasm (Set-Associative Rotating Cache Analytical/Simulating Model, Section 6.1.5) as a set-associative analytical cache model that is fast enough to be usable in the context of a sampler, while mostly preserving the ordering of configurations according to their number of cache misses. We exploit the shape of the considered kernels, and in particular the fact that all the footprints at every tile levels are rectangular.
This tool has been integrated inside XTC, and extended to support sequential composition of microkernels, which was a feature of one of our previous work (called Ttile 7). A journal paper on this topic has been submitted and is currently being reviewed.
7.3.4 MlOpBench: Benchmarking of Machine learning Operators
Having access to relevant benchmark suites is crucial for the evaluation of compiler optimizations that are developed in the team. In the context of machine learning models, we need to compare our work with state of the art compilers and libraries, at the granularity of operators. Thus we have decided to create a dedicated tool to extract operators from models and benchmark them on several implementations.
We have specifically designed the tools for researchers working on compilers or performance optimization for machine learning operators. Starting from existing sets of models (e.g. MLPerf 5.1 inference) or a list of models provided by the user (e.g. LLama3.1 8B), it extracts the datatype, shape and name of each operator. This first step furnishes some data for some early analysis, like operators classification, extraction of tensor sizes, or I/O complexity analysis.
The second step is the benchmarking of each operator on a set of libraries that implement it. They can be general (e.g. XNNPack) or hardware specific (e.g ARM performance library). The tool reuses the runtime developed in XTC for a fair evaluation of the generated code using performance counters. More advanced analysis like the Roofline analysis can be performed at this point.
7.3.5 First-Class Verification Dialects for MLIR
A five-month research stay at the University of Cambridge led to a collaboration that resulted in a publication at PLDI 2. This work addresses a core challenge in the MLIR compilation infrastructure: the lack of a unified framework for expressing and verifying semantic properties directly within the intermediate representation. The paper proposes introducing verification dialects as first-class constructs in MLIR, enabling the encoding of invariants, pre- and post-conditions, and correctness properties at each abstraction level of the compiler. This approach ensures that correctness guarantees are preserved throughout successive transformations of the intermediate representation, whereas existing methods rely on ad hoc or external verification outside the compilation pipeline. The paper demonstrates the applicability of this methodology across several existing MLIR dialects and evaluates its impact on detecting transformation errors.
7.3.6 Multi-level x86 Backend for xDSL
This collaboration with Sasha Lopoukhine, building on our joint work with the University of Cambridge, aims to design an x86 backend for the MLIR xDSL compiler. It extends the approach presented in A Multi-level Compiler Backend for Accelerated Micro-kernels Targeting RISC-V ISA Extensions. The original work proposed a multi-level compilation pipeline in xDSL, progressively lowering high-level representations to emit optimized micro-kernels targeting RISC-V ISA extensions, leveraging MLIR’s abstractions at each stage. The ongoing work adapts this methodology to the x86 architecture, addressing specific challenges related to the complexity of instruction encoding, register set management, and the exploitation of vector extensions (AVX). The goal is to demonstrate the generality of the multi-level approach beyond a single target architecture and to provide a functional x86 backend within the xDSL ecosystem, paving the way for high-performance code generation on widely deployed platforms.
7.3.7 SDist: Abstract distribution primitives and generate code for Kalray MPPA and Amd AIE
Following the development of an end-to-end MLIR based toolchain from machine learning models implemented with Pytorch to the Kalray MPPA accelerator, SDist has been created to facilitate the exploration of schedules for tensor operations, for machines with distributed local memories. It can generate code for the MPPA machine, but also the AMD AIE accelerator.
To reason on distributed implementations of machine learning operators, we need to abstract the low-level communication primitives of each Hardware. We decided to create a small set of scheduling primitives for the distribution of tensors and loops. Then we implemented the lowering to the MPPA using the previously developed toolchain, and a lowering to the AIE is in progress. SDist is entirely MLIR based. A utility for the visualization of distributed tensor is also provided.
7.4 Teaching of Algorithms, Programming, and Debugging
Participants: Florent Bouchez Tichadou, Guillaume Huard, Christophe Guillon, Manuel Selva, Guillaume Huard, Fabrice Rastello.
Our goal here is to combine our expertise in compilation and teaching to help teachers and learners in computer science fields such as programming, algorithms, data structures, automata, debugging, or more generally computing literacy. This axis is developed into two ongoing projects:
- EasyTracker: a library for controlling and inspecting the execution of a program.
- Agdbentures: a game that helps learners to gain skills in debugging, which is based on EasyTracker. See Section 6.1.10.
7.4.1 Easytracker : A generic library for controlling and inspecting program execution and state
Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools. However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts.
EasyTracker is a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages.
Currently, the EasyTracker library is sufficiently stable for our usage and only minor improvements were made this year. EasyTracker is now also used in more courses:
- 1st year at Ensimag - Basies of imperative programming with Python
- 1st year at Ensimag - C programming
- L2 at UGA - Algorithms and Imperative Programming
- L3 at UGA - Algorithms and Modelization
7.4.2 Agdbentures: A game to learn how to debug in autonomy
Debugging is an important task in software development and can be the source of a lot of frustration and time consumption. However, it is not often taught explicitly in computer science curricula even at university level. For these reasons, we develop Agdbentures (see Section 6.1.10), a debug practicing game where “levels” consist of programs containing bugs that the learner needs to debug to advance in the game.
In Agdbentures, the level programs are executed using Easytracker, which enables us to present a live visual representation of the program state during execution in the form of a 2D RPG-like world. For instance, the “player_x” and “player_y” variables in the level code are inspected at runtime and used to place a character representing the player on a graphical 2D map. The interest is three-fold: First, this makes the game appealing as the player/learner is plunged into a “real” game; Second, it showcases the importance of having information on the state of the program being executed in order to be able to debug; Third, it separates completely the graphical code, which can be very complex and is hidden from players, from the level code which is given to players: this let us simplify the source code so novice programmers won't be rebuked. The levels share a common codebase that is increasing in size and complexity as the player advances in the game. It initially only controls the main character position, then more features are added such as interactive objects, NPCs (non playable characters), level logic (activating levers, collecting items...). This helps the player getting familiar with a codebase of increasing size over time so we can present more interesting bugs where locating the problem is similar to what happens in real life development. It also let us to create “fun” levels where bugs have interesting or amusing effects on the visual representation, and where finding the solution (fixing the bugs) is rewarding.
In past experiments, we obtained very encouraging results about the engagement of students at the L2 university level. All were eager to participate and declared they would really like to continue playing Agdbentures on their own with more levels. This year, we continued to propose internship positions to students to help develop Agdbentures and the main current efforts focus on testing the current projet, coverage of interesting debugging situtations, and on the whole game scenario. The objective is to complete a release that meet interesting pedagogical goals while maintaining players investment. Future efforts will be focused on making the game accessible to students and evaluating the pedagogical impact on students' debugging skills.
8 Bilateral contracts and grants with industry
8.1 Bilateral contracts with industry
With the PhD thesis (CIFRE) of Syvlain Noiry a bilateral contract has been signed with Kalray.
-
Title
Neural Network Compilation for a Distributed Memory Acceleration Platform
-
Duration
2024-2027
-
Abstract
There are various infrastructures for optimizing deep learning networks, but they often rely on existing libraries, complicating joint optimization of operators. The thesis aims to uniformly manage tensor operator optimization across different hardware levels using a domain-specific intermediate representation (IR) and auto-tuning mechanisms, targeting the Kalray MPPA Coolidge data processing unit.
9 Partnerships and cooperations
9.1 International research visitors
9.1.1 Visits to international teams
Research stays abroad
Hugo Pompougnac
-
Visited institution:
University of Cambridge
-
Country:
United Kingdom
-
Dates:
Oct 2024 - Feb 2025
-
Context of the visit:
Advancing compiler infrastructure by introducing formal verification dialects in MLIR and extending a multi-level backend approach from RISC-V to x86 architectures.
-
Mobility program/type of mobility:
Research stay
9.2 National initiatives
BPI OTPaaS
-
Title:
Développement et renforcement de la filière française et européenne du Cloud
-
Duration:
October 2021 – Mai 2025
-
Coordinator:
P. Betinelli
-
Corse contact:
Fabrice Rastello
-
Corse participants:
Fabrice Rastello, Christophe Guillon
-
Partners:
Agileo, Atos, Captronic, Duliprint, IMT, MDM, Prosyst, SE, Soben, Tridimeo, Solem, CEA, Valeo
-
Inria Partners:
DataMove
-
Summary:
The OTPaaS project targets massive digitization by offering a suitable cloud for scanning that is compatible with Gaia-X and easy to use by companies including SMEs. The consortium brings together national technology providers and users from major groups and SMEs/ETIs, with strong support from major French research institutes. The platform OTPaaS will be validated by 6 demonstrators and followed by ambitious industrialization programs.
BPI DeepGreen
-
Title:
Plateforme independante pour le deep learning embarqué
-
Duration:
April 1st 2023 – Mai 2027
-
Coordinator:
CEA
-
Corse contact:
Fabrice Rastello
-
Corse participants:
Fabrice Rastello, Christophe Guillon, Hugo Pompougnac, Valentin Trophine, Guillaume Iooss
-
Partners:
CEA, Adagos, Pulse Audition, Kalray, Dolphin Design, Thales Research & Technology France, Arcys, MBDA, Arcelormittal, EDF, Sysnav, Hawai.tech, Ezako
-
Summary:
The DeepGreen project aims to bring together major industrial players and small and medium-sized enterprises (SMEs) in France for the deployment of Artificial Intelligence on constrained hardware targets through a software platform that meets the requirements and expectations of each stakeholder.
HOLIGRAIL – PEPR AI
-
Title:
HOLIistic approaches to GReener model Architectures for Inference and Learning
-
Duration:
Oct 1st 2023 – 2027
-
Coordinator:
Olivier Sentieys
-
Corse contact:
Fabrice Rastello
-
Corse participants:
Fabrice Rastello, Christophe Guillon, Hugo Pompougnac
-
Partners:
CEA List, TIMA
-
Inria Partners:
Taran, Emeraude
-
Summary:
The vision of this action is to create a synergy with the research on the foundations of AI frugality to propose cutting-edge methods that significantly improve the energy efficiency of both inference and training of a model. We will propose (i) more compact and efficient number representations that still maintain a quality of inference or training close to the reference, (ii) hardware-aware training algorithms that enhance certain types of sparsity (e.g., more structured), coding compactness (aggressive quantization, maximum entropy) and tensor transformations. Most state-of-the-art solutions are agnostic of the hardware they run on. By taking advantage of this interplay between the hardware and the algorithms, we can achieve breakthroughs beyond current solutions, in particular by developing (iii) efficient hardware mechanisms, especially optimized to take advantage of sparsity, extreme quantization and ad-hoc number representations, together with (iv) compiler optimizations, to demonstrate the effectiveness of the proposed methods. Our approaches are holistic in the sense that they will jointly optimize the whole computing stack of AI, i.e., at the algorithm, arithmetic, compiler and hardware levels.
10 Dissemination
10.1 Promoting scientific activities
10.1.1 Scientific events: organisation
Member of the organizing committees
- Fabrice Rastello: Member of the steering Committee of ACM/IEEE CGO
- Fabrice Rastello: Member of the steering committee of the MLIR (Un)School September 2025
- Hugo Pompougnac: Member of the steering committee of the MLIR (Un)School September 2025
10.1.2 Scientific events: selection
Member of the conference program committees
- Fabrice Rastello: ACM/IEEE CGO 2026 conference
- Fabrice Rastello, End25
10.1.3 Invited talks
- Guillaume Iooss: Invited talk by the CASH team in Lyon (May 2025).
- Fabrice Rastello: Invited talk at UDC (Universidade da Coruña) "Compiler design in the age of specialized libraries: bridging performance model with autotuning"
10.1.4 Leadership within the scientific community
- Fabrice Rastello: member of Inria evaluation committee (CE) since Sept 2023
- Fabrice Rastello: deputy scientific director of Inria Grenoble Rhône-Alpes (DSA) since Sept 2022
- Fabrice Rastello: scientific council of Inria Grenoble Rhône-Alpes (CoS)
- Fabrice Rastello: vice-president of the Inria CRCN/IFSP recruiting committees 2025
- Fabrice Rastello: co-PI of the Numeric/ASIC agencies PEPR Camelia (32M€)
- Fabrice Rastello: co-PI of the sub-project PC4 of PEPR Camelia
- Fabrice Rastello: coordinator of UGA Idex REUSE program (500k€)
- Fabrice Rastello: Inria Delegate to CoARA (Coalition for Advancing Research Assessment)
10.1.5 Scientific expertise
- Fabrice Rastello: Strategic Orientation Committee for ICube/ICPS
- Fabrice Rastello: Inria CRCN-TH recruting committee 2025
- Fabrice Rastello: ANR review
10.1.6 Research administration
- Guillaume Iooss: RADAR Correspondant
10.2 Teaching - Supervision - Juries - Educational and pedagogical outreach
10.2.1 Teaching
- Licence 1: Florent Bouchez Tichadou, Operating System and Programming, 40 hours, UGA.
- Licence 2: Florent Bouchez Tichadou, Algorithms languages and programming, 101 hours, UGA.
- Licence 3: Florent Bouchez Tichadou, Algorithms, 17 hours, UGA.
- Licence 3: Florent Bouchez Tichadou, Programming project, 20 hours, UGA.
- Licence 3: Guillaume Huard, UNIX & C programming, 50 hours, UGA.
- Licence 3: Guillaume Huard, Object Oriented and Event-Driven Programming, 50 hours, UGA.
- Master 1: Guillaume Huard, Programming for teacher apprentices, 8 hours, UGA.
- Licence 3: Guillaume Huard is responsible of L3 Informatique Générale at UGA.
- Licence: Guillaume Huard is responsible of Licence Informatique diploma at UGA.
10.2.2 Supervision
- PhD in progress: Lucas Maisonnave: Maximum Entropy Coding for Deep Neural Networks, march 2023, co-advised by Olivier Bichler (CEA LIAE – 95%) and Fabrice Rastello (5%)
- PhD in progress: Sylvain Noiry: Compilation of Neural Networks for Distributed Memory Architecture, April 2024, advised by Fabrice Rastello
- PhD in progress: Valentin Trophine: Asynchronous Programming Under Memory Constraints, Oct 2024, co-advised by Frédéric Wagner (Inria DATAMOVE – 90%) and Fabrice Rastello (10%)
- PhD in progress: Alban Dutilleul: Beyond Microarchitectural Black Boxes: Grey-Box Throughput Modeling for Superscalar Out-of-Order Processors, Oct 2025, advised by Fabrice Rastello
- PhD in progress: Léon Frénot: Towards an Interactive Compilation Infrastructure Guided by Expertise for the Optimisation of Deep Learning Programs, Oct 2025, advised by Fabrice Rastello
10.2.3 Juries
Guillaume Iooss
- CSI (Comité de Suivi Individuel de thèse), Arthur Viens.
Fabrice Rastello
- PhD, Nahuel PALUMBO–Lille, Chair, Nov 28 2025. DRUID - Métacompilation des compilateurs JIT de base.
- PhD, Clément Rosseti–Strasbourg, Chair, Dec 18 2025. Algebraic loop transformations.
- PhD, Omid Asudeh–Univ Utah, Committee, Dec 15 2025. Reordering Sparse Matrices for Performance Gain: Evaluation, Analytical Modeling, and Strategy Design.
10.3 Popularization
- Fabrice Rastello: scientific council of CEA-EDF-Inria summer schools
- Hugo Pompougnac: scientific council of the MLIR (Un)School September 2025
10.3.1 Book
Hugo Pompougnac co-authored a collective work titled Que faire de l'IA (What to Do with AI), produced in collaboration with economists, legal experts, and union representatives. This book aims to bridge the gap in public understanding of the societal challenges posed by artificial intelligence. Positioned at the intersection of technical expertise and social sciences, the book seeks to demystify the underlying mechanisms of AI while examining their tangible impacts on labor, law, and social relations. It critically engages with dominant narratives about AI—whether overly optimistic or alarmist—by grounding the discussion in economic and legal realities. Central to its analysis are the transformation of work, regulatory frameworks, and decision-making power. The multidisciplinary author team brings together complementary perspectives: technical insights into AI systems, analysis of existing and emerging legal frameworks, and firsthand accounts of working conditions in sectors affected by automation.
11 Scientific production
11.1 Publications of the year
International journals
International peer-reviewed conferences
Reports & preprints
11.2 Cited publications
- 7 articleAutotuning Convolutions is Easier Than You Think.ACM Transactions on Architecture and Code OptimizationNovember 2022, 1-23HALDOIback to text