Languages, compilers, and run-time systems are some of the most important components to bridge the gap between applications and hardware.
With the continuously increasing power of computers, expectations are evolving, with more and more ambitious, computational intensive and complex applications.
As desktop PCs are becoming a niche and servers mainstream, three categories of computing impose themselves for the next decade:
mobile, cloud, and super-computing.
Thus diversity, heterogeneity (even on a single chip) and thus also hardware virtualization is putting more and more pressure both on compilers and run-time systems.
However, because of the energy wall, architectures are becoming more and more complex and parallelism ubiquitous at every level.
Unfortunately, the memory-CPU gap continues to increase and energy consumption remains an important issue for future platforms.
To address the challenge of performance and energy consumption raised by silicon companies, compilers and run-time systems must evolve and, in particular, interact, taking into account the complexity of the target architecture.

The overall objective of Corse is to address this challenge by combining static and dynamic compilation techniques, with more interactive embedding of programs and compiler environment in the run-time system.

One of the characteristics of Corse is to base our researches on diverse advanced mathematical tools.
Compiler optimization requires the usage of the several tools around discrete mathematics:
combinatorial optimization, algorithmic, and graph theory.
The aim of Corse is to tackle optimization not only for general purpose but also for domain specific applications.
We believe that new challenges in compiler technology design and in particular for split compilation should also take advantage of graph labeling techniques.
In addition to run-time and compiler techniques for program instrumentation, hybrid analysis and compilation advances will be mainly based on polynomial and linear algebra.

The other specificity of Corse is to address technical challenges related to compiler technology, run-time systems, and hardware characteristics.
This implies mastering the details of each.
This is especially important as any optimization is based on a reasonably accurate model.
Compiler expertise will be used in modeling applications (e.g. through automatic analysis of memory and computational complexity);
Run-time expertise will be used in modeling the concurrent activities and overhead due to contention (including memory management);
Hardware expertise will be extensively used in modeling physical resources and hardware mechanisms (including synchronization, pipelines, etc.).

The core foundation of the team is related to the combination of static and dynamic techniques, of compilation, and run-time systems. We believe this to be essential in addressing high-performance and low energy challenges in the context of new important changes shown by current application, software, and architecture trends.

Our project is structured along two main directions.
The first direction belongs to the area of run-time systems with the objective of developing strong relations with compilers.
The second direction belongs to the area of compiler analysis and optimization with the objective of combining dynamic analysis and optimization with static techniques.
The aim of Corse is to ground those two research activities on the development of the end-to-end optimization of some specific domain applications.

The main industrial sector related to the research activities of Corse is the one of semi-conductor (programmable architectures spanning from embedded systems to servers).
Obviously any computing application which has the objective of exploiting as much as possible the resources (in terms of high-performance but also low energy consumption) of the host architecture is intended to take advantage of advances in compiler and run-time technology.
These applications are based over numerical kernels (linear algebra, FFT, convolution...) that can be adapted on a large spectrum of architectures.
More specifically, an important activity concerns the optimization of machine learning applications for some high-performance accelerators.
Members of Corse already maintain fruitful and strong collaborations with several companies such as STMicroelectronics, Atos/Bull, Kalray.

This year saw a huge drop with the number of travels. The main cause being the COVID pandemia and travel ban, we expect people to reduce travel activities in the future.

HiPEAC paper award for our PLDI paper IOOPT 10.

This year saw the birth of EasyTracker, a library for building visual representations of the execution of a program.
Concerning GUS (a performance debugging tool based on abstract simulation): We worked on the support for ARM in pipedream (a tool for measuring throughput based on micro-benchmarks); we improved the integration of PALMED (a tool for reverse engineering the throughput characteristics of superscalar CPUs) and uOP.info (an existing throughput characterization that uses hardware counters) in GUS; we improved the modeling of cache in GUS. We developed PALMED web demo.
Concerning IOLB (a tool for automatic derivation of data movement complexity of affine programs) and IOOpt (a tool for automatic derivation of optimal rectangular tiling for affine programs) , we improved the interface and developed IOOPT web demo.
TTile has been intergrated in TVM (a compiler framework for deep neural networks) along with several extensions (OpenOpt and Pariopt – search space exploration strategies for compiler optimization).
Concerning BISM, we worked on augmenting the versatility of the tools by developing new use cases. A redesign and some refactoring also happened for improving usability.

Pipedream reverse engineers the following performance characteristics: (1) Instruction latency – The number of cycles an instruction requires to execute. (2) Peak micro-op retirement rate – How many fused micro-ops the CPU can retire per cycle. (3) Micro-fusion – The number of fused micro-ops an instruction decomposes into. (4) Micro-op decomposition and micro-op port usage – The list of unfused micro-ops every instruction decomposes into and the list of execution ports every one of these micro-ops can execute on.

The first step of the reverse engineering process consists of generating a number of microbenchmarks. Pipedream then runs these benchmark, measuring their performance using hardware counters. The latency, throughput, and micro-fusion of different instructions can then be read directly from these measurements.

The process of finding port mappings, i.e. micro-op decompositions and micro-op port usage, however, is more involved. For this purpose, we have defined a variation of the maximum flow problem which we call the "instruction flow problem". We have developed a linear program (LP) formulation of the instruction flow problem which can be used to calculate the peak IPC and micro-operations per cycle (MPC) a benchmark kernel can theoretically achieve with a given port mapping. The actual port mapping of the underlying hardware is then determined by finding the mapping for which the throughput predicted by instruction flow best matches the actual measured IPC and MPC.

Our current efforts with regard to code optimization follows two directions.

Researchers and practitioners have for long worked on improving the computational complexity of algorithms, focusing on reducing the number of operations needed to perform a computation. However the hardware trend nowadays clearly shows a higher performance and energy cost for data movements than computations: quality algorithms have to minimize data movements as much as possible.

The theoretical operational complexity of an algorithm is a function of the total number of operations that must be executed, regardless of the order in which they will actually be executed. But theoretical data movement (or, I/O) complexity is fundamentally different: one must consider all possible legal schedules of the operations to determine the minimal number of data movements achievable, a major theoretical challenge. I/O complexity has been studied via complex manual proofs, e.g., refined from

We developed the first static analysis to automatically derive non-asymptotic parametric expressions of data movement lower bounds with scaling constants, for arbitrary affine computations. Our approach is fully automatic, assisting algorithm designers to reason about I/O complexity and make educated decisions about algorithmic alternatives. The tool allowed us to compute the lower bound for all the kernels of the Polybench suite. We extended this work by: 1. designing the first algorithm for computing a symbolic over-approximation of the data-movement for a parametric (multi-dimensional) tiled version of an affine code 2. designing the first fully automated scheme for expressing as an operational research problem the minimization of this data-movement expression; 3. integrating those techniques into a tool that computes for a class of affine computations a parametric expressions for I/O upper bounds.

A paper that has been presented at PLDI 2021 10.

Tensor computation such as Sparse Matrix Multi-vector multiplication, Sampled Dense Dense Matrix Multiplication, Dense Matrix Multiplication, Tensor Contraction, Convolution are important kernels used in many domains like Fluid Dynamics, Data Analytics, Economic Modelling, and Machine Learning. Developing highly optimized code for such kernels requires the combination of highly tuned register/instruction level micro-kernels and appropriate multi-level tiling. In this context we developed: an hybrid (analytical/statistical) performance-based optimization scheme along with a code generator for DNNs.

Addressing the problem of automatic generation of optimized operators raises two challenges: The first is associated to the design of a domain specific code generation framework able to output high-quality binary code. The second is to carefully bound the search space and choose an optimizing objective function that neither leads to yet another combinatorial optimizing problem, nor leads to a too approximate performance objective. This work tackles those two challenges by: 1. revisiting the usual belief that packing should enable stride-1 accesses at every level allowing to make packing optional; 2. highlighting the importance of considering the packing decision and shape as being part of the optimization problem; 3. revisiting the usual belief that register spilling should be avoided if possible allowing to consider other (more packing friendly) micro-kernels as good candidates; 4. revisiting the misleading intuition that convolution dimensions should be brought at the innermost level allowing more freedom for memory reuse at outer-dimensions; 5. showing that the optimization problem can be decoupled into: finding a small set of good micro-kernels candidates using an exhaustive search; finding a good schedule (loop tiling/permutation) and associated packing using operational research; finding the best tiles sizes using auto-tuning; 6. designing a single-pass micro-kernel generation algorithm, to emit code for any choice of register blocking dimensions, unrolling factor, and packing decisions; 7. designing a lowering scheme for abstract iterators, compatible with diverse packing and tiling strategies thrifty with integer arithmetic and loop control usage; 8. designing a packing algorithm compatible with various choices of transposition and subviews; 9. implementing a code generator based on these algorithms, driven by a simple and modular configuration language.

This work has been done in the context of the IOcomplexity associate team (see Section 8.1.1) and the PBI project ES3CAP (see Section 9.2.1).

Performance modeling is a critical component for program optimizations, assisting compilers as well as developers in predicting the performance of code variations ahead of time. Performance models can be obtained through different approaches that span from precise and complex simulation of a hardware description (Zesto, GEM5, PTLSim) to application level analytical formulations. An interesting approach for modeling the CPU of modern pipelined, super-scalar, out-of-order processors trades simulation time with accuracy by separately characterizing both latency and throughput of instructions. This approach is suitable both for optimizing compilers, but also for hand-tuning critical kernels written in assembler (see Section 8.1.2). It is used by performance-analysis tools such as CQA, Intel IACA, OSACA, MIAMI or llvm-mca. Cycle-approximate simulators such as ZSim or MCsimA can also take advantage of such an instruction characterization. In this context, we developed two tools: PALMED and GUS (see Section 7).

PALMED is a tool that automatically builds a resource mapping, which can be used as a performance model for pipelined, super-scalar, out-of-order CPU architectures. Resource mappings describe the execution of a program by assigning instructions in the program to abstract resources. They can be used to predict the throughput of basic blocks or as a machine model for the backend of an optimizing compiler.

PALMED does not require hardware performance counters, and relies solely on runtime measurements to construct resource mappings. This allows it to model not only execution port usage, but also other limiting resources, such as the frontend or the reorder buffer. Also, thanks to a dual representation of resource mappings, our algorithm for constructing mappings scales to large instruction sets, like that of x86.

GUS is the first framework for performance debugging of complex, realistic pipelined out-of-order processors executing arbitrary programs, via abstract simulation for sensitivity analysis. Abstract simulation aims at providing significant speed improvement versus cycle-accurate simulation. This allows to perform quickly multiple runs with several value of latency, throughput and count of the modeled resources: this is sensitivity analysis. To determine if a resource is a bottleneck, its capacity is artificially increased to determine whether it affects the (simulated) program execution time accordingly. Such approach is not feasible with cycle-accurate simulator, yet to be realistic it requires an accurate modeling of the execution time.

This work has been done in the context of the European project CPS4EU (see Section 10.3.1).

During the period, our new results and contributions on monitoring can be categorized as follows.

Runtime enforcement analyses an execution trace, detects when this execution deviates from its expected behaviour with respect to a given property, and corrects the trace to make it satisfy the property. In this work 7, we present new enforcement techniques that reorder actions when necessary, inject actions to the application to ensure progress of the property, and discard actions to avoid storing too many unnecessary actions. At any step of the enforcement, we provide a verdict, called enforcement trend in this work, which takes its value in a 4-valued truth domain. Our approach has been implemented in a tool and validated on several application examples. Experimental results show that our techniques better preserve the application actions, hence ensuring better service continuity.

In this work 8, we propose a semi-automated approach for helping non-experts in Business Process Model and Notation (BPMN). We focus on a timed version of BPMN where tasks are associated with a range indicating the minimum and maximum duration needed to execute that task. In a first step, the user defines the tasks involved in the process and possibly gives a partial order between tasks. A first algorithm then generates an abstract graph, which serves as a simplified version of the process being specified. Given such an abstract graph, a second algorithm computes the minimum and maximum time for executing the whole graph. The user can rely on this information for refining the graph. For each version of the graph, these minimum/maximum execution times are computed. Once the user is satisfied with a specific abstract graph, we propose a third algorithm to synthesize the BPMN process from that graph.

In this work 6, we monitor asynchronous distributed component-based systems with multi-party interactions. We consider independent components whose interactions are managed by several distributed schedulers. In this context, neither a global state nor the total ordering of the executions of the system is available at runtime. We instrument the system to retrieve local events from the local traces of the schedulers. Local events are sent to a global observer which reconstructs on-the-fly the set of global traces that are compatible with the local traces, in a concurrency-preserving fashion. The set of compatible global traces is represented in the form of an original lattice over partial states, such that each path of the lattice corresponds to a possible execution of the system.

In this work 9, we consider the runtime enforcement of Linear-time Temporal Logic formulas on decentralized systems with no central observation point nor authority. A so-called enforcer is attached to each system component and observes its local trace. Should the global trace violate the specification, the enforcers coordinate to correct their local traces. We formalize the decentralized runtime enforcement problem and define the expected properties of enforcers, namely soundness, transparency and optimality. We present two enforcement algorithms. In the first one, the enforcers explore all possible local modifications to find the best global correction. Although this guarantees an optimal correction, it forces the system to synchronize and is more costly, computation and communication wise. In the second one, each enforcer makes a local correction before communicating. The reduced cost of this version comes at the price of the optimality of the enforcer corrections.

In this work, we led the edition of an honorary volume dedicated to Klaus Havelund on the occasion of his 66th birthday. The volume features contributions by leading researchers in the domain and a broad range of academic research and engineering successes in programming language design.

In this work, we aim to an efficient and expressive tool for the instrumentation of Java programs at the bytecode-level. BISM (Bytecode-Level Instrumentation for Software Monitoring) is a light-weight Java bytecode instrumentation tool that features an expressive high-level control-flow-aware instrumentation language. The language is inspired by the aspect-oriented programming paradigm in modularizing instrumentation into separate transformers, that encapsulate joinpoint selection and advice inlining. BISM allows capturing joinpoints ranging from bytecode instructions to methods execution and provides comprehensive static and dynamic context information. It runs in two instrumentation modes: build-time and load-time.

This domain is a new axis of the CORSE team. Our goal here is to combine our expertise in compilation and teaching to help teachers and learners in computer science fields such as programming, algorithms, data structures, automata, or more generally computing litteracy. This axis is derived into two projects. The first is the automated generation and recommendation of exercises using artificial intelligence. The second focuses on tools to help learning through visualization (data structures, debugger, automata) or gamification.

In Intelligent Tutoring Systems (ITS), methods to choose the next exercise for a student are inspired from generic recommender systems, used, for instance, in online shopping or multimedia recommendation. As such, collaborative filtering, especially matrix factorization, is often included as apart of recommendation algorithms in ITS. One notable difference in ITS is the rapid evolution of users, who improve their performance, as opposed to multimedia recommendation where preferences are more static. This raises the following question: how reliably can we use matrix factorization, a tool tried and tested in a static environment, in a context where timelines seem to be of importance.In this work we tried to quantify empirically how much information can be extracted statically from datasets in education versus datasets in multimedia, as the quality of such information is critical to be able to accurately make predictions and recommendations. We found that educational datasets contain less static information compared to multimedia datasets, to the extent that vectors of higher dimensions only marginally increase the precision of the matrix factorization compared to a 1-dimensional characterization. These results show that educational datasets must be used with time information, and warn against the dangers of directly trying to use existing algorithms developed for static datasets.

This work has been done in the context of the AI4HI Inria exploratory project and has been presented at EDM 2021 conference 4.

Learning to program involves building a mental representation of how a machine executes instructions and stores information in memory. To help students, teachers often use visual representations to illustrate executions of programs or particular concepts in their lectures. As a famous example, references/pointers are very often represented with arrows pointing to objects or memory locations. While these visual representations are most of the time hand-drawn, they nowadays tend to be supplemented by tool-generated ones. These tools have the advantage of being usable by learners, empowering them with the ability of validating their own understanding of the concept the tool aims at representing. However, building such a tool from scratch requires a lot of effort and a high level of technical expertize, and the ones that already exist are difficult to adapt to different contexts. In this work we developped EasyTracker, a library that assists teachers of programming courses in building tools that generate representations tuned to their needs from actual programs. At its core, EasyTracker provides ways of controlling the execution and inspecting the state of programs. The control and inspection are driven and customized through a Python interface. The controlled program itself can be written either in Python or in any GDB supported language like C. This work showcases two tools built on EasyTracker which are used in a teaching context to explain the notions of stack and heap, and to visualize recursion as well as a game prototype to help students learn debugging.

This work has been submitted for publication at ACM ITICSE 2022.

To address the issue of ventilator shortages, our group (eSpiro Network) developed a freely replicable, open-source hardware ventilator.
Design: We performed a bench study. Setting: Dedicated research room as part of an ICU affiliated to a university hospital.
Subjects: We set the lung model with three conditions of resistance and linear compliance for mimicking different respiratory mechanics of representative intensive care unit (ICU) patients.
Interventions: The performance of the device was tested using the ASL5000 lung model.
Measurements and Main Results: Twenty-seven conditions were tested.
All the measurements fell within the 10% limits for the tidal volume (VT).
The volume error was influenced by the mechanical condition (

This work 2 has been done in the context of the Recovid project supported by Inria.