Languages, compilers, and run-time systems are some of the most important components to bridge the gap between applications and hardware.
With the continuous increasing power of computers, expectations are evolving, with more and more ambitious, computational intensive and complex applications.
As desktop PCs are becoming a niche and servers mainstream, three categories of computing impose themselves for the next decade:
mobile, cloud, and super-computing.
Thus diversity, heterogeneity (even on a single chip) and thus also hardware virtualization is putting more and more pressure both on compilers and run-time systems.
However, because of the energy wall, architectures are becoming more and more complex and parallelism ubiquitous at every level.
Unfortunately, the memory-CPU gap continues to increase and energy consumption remains an important issue for future platforms.
To address the challenge of performance and energy consumption raised by silicon companies, compilers and run-time systems must evolve and, in particular, interact, taking into account the complexity of the target architecture.

The overall objective of Corse is to address this challenge by combining static and dynamic compilation techniques, with more interactive embedding of programs and compiler environment in the run-time system.

One of the characteristics of Corse is to base our researches on diverse advanced mathematical tools.
Compiler optimization requires the usage of the several tools around discrete mathematics:
combinatorial optimization, algorithmic, and graph theory.
The aim of Corse is to tackle optimization not only for general purpose but also for domain specific applications.
We believe that new challenges in compiler technology design and in particular for split compilation should also take advantage of graph labeling techniques.
In addition to run-time and compiler techniques for program instrumentation, hybrid analysis and compilation advances will be mainly based on polynomial and linear algebra.

The other specificity of Corse is to address technical challenges related to compiler technology, run-time systems, and hardware characteristics.
This implies mastering the details of each.
This is especially important as any optimization is based on a reasonably accurate model.
Compiler expertise will be used in modeling applications (e.g. through automatic analysis of memory and computational complexity);
Run-time expertise will be used in modeling the concurrent activities and overhead due to contention (including memory management);
Hardware expertise will be extensively used in modeling physical resources and hardware mechanisms (including synchronization, pipelines, etc.).

The core foundation of the team is related to the combination of static and dynamic techniques, of compilation, and run-time systems. We believe this to be essential in addressing high-performance and low energy challenges in the context of new important changes shown by current application, software, and architecture trends.

Our project is structured along two main directions.
The first direction belongs to the area of run-time systems with the objective of developing strong relations with compilers.
The second direction belongs to the area of compiler analysis and optimization with the objective of combining dynamic analysis and optimization with static techniques.
The aim of Corse is to ground those two research activities on the development of the end-to-end optimization of some specific domain applications.

The main industrial sector related to the research activities of Corse is the one of semi-conductor (programmable architectures spanning from embedded systems to servers).
Obviously any computing application which has the objective of exploiting as much as possible the resources (in terms of high-performance but also low energy consumption) of the host architecture is intended to take advantage of advances in compiler and run-time technology.
These applications are based over numerical kernels (linear algebra, FFT, convolution...) that can be adapted on a large spectrum of architectures.
More specifically, an important activity concerns the optimization of machine learning applications for some high-performance accelerators.
Members of Corse already maintain fruitful and strong collaborations with several companies such as STMicroelectronics, Atos/Bull, Kalray.

Interactive Debugging with a traditional debugger can be tedious. One has to manually run a program step by step and set breakpoints to track a bug.

i-RV is an approach to bug fixing that aims to help developpers during their Interactive Debbugging sessions using Runtime Verification.

Verde is the reference implementation of i-RV.

Pipedream reverse engineers the following performance characteristics: - Instruction latency – The number of cycles an instruction requires to execute. - Peak micro-op retirement rate – How many fused micro-ops the CPU can retire per cycle. - Micro-fusion – The number of fused micro-ops an instruction decomposes into. - Micro-op decomposition and micro-op port usage – The list of unfused micro-ops every instruction decomposes into and the list of execution ports every one of these micro-ops can execute on.

The first step of the reverse engineering process consists of generating a number of microbenchmarks. Pipedream then runs these benchmark, measuring their performance using hardware counters. The latency, throughput, and micro-fusion of different instructions can then be read directly from these measurements.

The process of finding port mappings, i.e. micro-op decompositions and micro-op port usage, however, is more involved. For this purpose, we have defined a variation of the maximum flow problem which we call the "instruction flow problem". We have developed a linear program (LP) formulation of the instruction flow problem which can be used to calculate the peak IPC and micro-operations per cycle (MPC) a benchmark kernel can theoretically achieve with a given port mapping. The actual port mapping of the underlying hardware is then determined by finding the mapping for which the throughput predicted by instruction flow best matches the actual measured IPC and MPC.

Grid'5000 1 is a large-scale and versatile testbed for experiment-driven research in all areas of computer science, with a focus on parallel and distributed computing including Cloud, HPC and Big Data. It provides access to a large amount of resources: 14828 cores, 829 compute-nodes grouped in homogeneous clusters located in 8 sites in France connected through a dedicated network (Renater), and featuring various technologies (GPU, SSD, NVMe, 10G and 25G Ethernet, Infiniband, Omni-Path) and advanced monitoring and measurement features for traces collection of networking and power consumption, providing a deep understanding of experiments. It is highly reconfigurable and controllable. Researchers can experiment with a fully customized software stack thanks to bare-metal deployment features, and can isolate their experiment at the networking layer advanced monitoring and measurement features for traces collection of networking and power consumption, providing a deep understanding of experiments designed to support Open Science and reproducible research, with full traceability of infrastructure and software changes on the testbed. Frédéric Desprez is director of the GRID5000 GIS.

Frédéric Desprez is co-PI with Serge Fdida (Université Sorbonne) of the SILECS 2 infrastructure ("IR ministère") which goal is to provide an experimental platform for experimental computer Science (Internet of things, clouds, HPC, big data, IA, wireless technologies, ...). This new infrastructure is based on two existing infrastructures, Grid'5000 and FIT. A European infrastructure (SLICES, ESFRI proposal) is also currently designed with 15 european partners (Spain, Cyprus, Greece, Netherland, Norway, Poland, Switzerland, ...).

Our current efforts with regard to code optimization follows two directions.

Researchers and practitioners have for long worked on improving the computational complexity of algorithms, focusing on reducing the number of operations needed to perform a computation. However the hardware trend nowadays clearly shows a higher performance and energy cost for data movements than computations: quality algorithms have to minimize data movements as much as possible.

The theoretical operational complexity of an algorithm is a function of the total number of operations that must be executed, regardless of the order in which they will actually be executed. But theoretical data movement (or, I/O) complexity is fundamentally different: one must consider all possible legal schedules of the operations to determine the minimal number of data movements achievable, a major theoretical challenge. I/O complexity has been studied via complex manual proofs, e.g., refined from

We developed the first static analysis to automatically derive non-asymptotic parametric expressions of data movement lower bounds with scaling constants, for arbitrary affine computations. Our approach is fully automatic, assisting algorithm designers to reason about I/O complexity and make educated decisions about algorithmic alternatives. The tool allowed us to compute the lower bound for all the kernels of the Polybench suite. We extended this work by: 1. designing the first algorithm for computing a symbolic over-approximation of the data-movement for a parametric (multi-dimensional) tiled version of an affine code 3. designing the first fully automated scheme for expressing as an operational research problem the minimization of this data-movement expression; 3. integrating those techniques into a tool that computes for a class of affine computations a parametric expressions for I/O upper bounds.

This work has been done in the context of the IOcomplexity associate team (see Section 9.1.1). A paper that has been presented at PLDI 2020 8 details the automated approach for the lower-bound complexity. The extensions to this approach have been conditionaly accepted to PLDI 2021.

Tensor computation such as Sparse Matrix Multi-vector multiplication, Sampled Dense Dense Matrix Multiplication, Dense Matrix Multiplication, Tensor Contraction, Convolution are important kernels used in many domains like Fluid Dynamics, Data Analytics, Economic Modelling, and Machine Learning. Developing highly optimized code for such kernels requires the combination of highly tuned register/instruction level micro-kernels and appropriate multi-level tiling. In this context we developed: 1. an analytical cache model for sparse matrices; 2. an analytical cache model-based tiles size optimization along with a code generator for DNNs.

Tiling is a
key technique to reduce data movement in matrix computations.
While tiling is well understood and widely used for dense matrix/tensor computations,
effective tiling of sparse matrix computations remains a challenging
problem. This work proposes a novel method to efficiently summarize the
impact of the sparsity structure of a matrix on achievable data reuse as a
one-dimensional signature, which is then used to build an
analytical cost model for tile size optimization for sparse matrix
computations. The proposed model-driven approach to sparse tiling is
evaluated on two key sparse matrix kernels: Sparse Matrix - Dense Matrix
Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM).
Experimental results demonstrate that model-based tiled SpMM and
SDDMM achieve high performance relative to the state-of-the-art.

Addressing the problem of automatic generation of optimized operators raises two challenges: The first is associated to the design of a domain specific code generation framework able to output high-quality binary code. The second is to carefully bound the search space and choose an optimizing objective function that neither leads to yet another combinatorial optimizing problem, nor leads to a too approximate performance objective. This work tackles those two challenges by: 1. revisiting the usual belief that packing should enable stride-1 accesses at every level allowing to make packing optional; 2. highlighting the importance of considering the packing decision and shape as being part of the optimization problem; 3. revisiting the usual belief that register spilling should be avoided if possible allowing to consider other (more packing friendly) micro-kernels as good candidates; 4. revisiting the misleading intuition that convolution dimensions should be brought at the innermost level allowing more freedom for memory reuse at outer-dimensions; 5. showing that the optimization problem can be decoupled into: finding a small set of good micro-kernels candidates using an exhaustive search; finding a good schedule (loop tiling/permutation) and associated packing using operational research; finding the best tiles sizes using auto-tuning; 6. designing a single-pass micro-kernel generation algorithm, to emit code for any choice of register blocking dimensions, unrolling factor, and packing decisions; 7. designing a lowering scheme for abstract iterators, compatible with diverse packing and tiling strategies thrifty with integer arithmetic and loop control usage; 8. designing a packing algorithm compatible with various choices of transposition and subviews; 9. implementing a code generator based on these algorithms, driven by a simple and modular configuration language.

This work has been done in the context of the IOcomplexity associate team (see Section 9.1.1) and the PBI project ES3CAP (see Section 8.2.1). A paper that has been presented at SC 2020 that details the tiling for sparse matrix computation.

Performance modeling is a critical component for program optimizations, assisting compilers as well as developers in predicting the performance of code variations ahead of time. Performance models can be obtained through different approaches that span from precise and complex simulation of a hardware description (Zesto, GEM5, PTLSim) to application level analytical formulations. An interesting approach for modeling the CPU of modern pipelined, super-scalar, out-of-order processors trades simulation time with accuracy by separately characterizing both latency and throughput of instructions. This approach is suitable both for optimizing compilers, but also for hand-tuning critical kernels written in assembler (see Section 7.1.2). It is used by performance-analysis tools such as CQA, Intel IACA, OSACA, MIAMI or llvm-mca. Cycle-approximate simulators such as ZSim or MCsimA can also take advantage of such an instruction characterization. In this context, we developed two tools: PALMED and GUS (see Section 6).

PALMED is to a tool that automatically builds a resource mapping, a performance model for pipelined, super-scalar, out-of-order CPU architectures. Resource mappings describe the execution of a program by assigning instructions in the program to abstract resources. They can be used to predict the throughput of basic blocks or as a machine model for the backend of an optimizing compiler.

PALMED does not require hardware performance counters, and relies solely on runtime measurements to construct resource mappings. This allows it to model not only execution port usage, but also other limiting resources, such as the frontend or the reorder buffer. Also, thanks to a dual representation of resource mappings, our algorithm for constructing mappings scales to large instruction sets, like that of x86.

GUS is the first framework for performance debugging of complex, realistic pipelined out-of-order processors executing arbitrary programs, via abstract simulation for sensitivity analysis. Abstract simulation aims at providing significant speed improvement versus cycle-accurate simulation. This allows to perform quickly multiple runs with several value of latency, throughput and count of the modeled resources: this is sensitivity analysis. To determine if a resource is a bottleneck, its capacity is artificially increased to determine whether it affects the (simulated) program execution time accordingly. Such approach is not feasible with cycle-accurate simulator, yet to be realistic it requires an accurate modeling of the execution time.

This work has been done in the context of the associate team IOComplexity (see Section 9.1.1) and the European project CPS4EU (see Section 9.2.1).

Mathieu Stoffel started his PhD in February 2018 on a CIFRE contract with Atos/Bull. The purpose of this work is to enhance the energy consumption of HPC applications on large-scale platforms by monitoring and predicting the behavior of executed applications to optimize resources utilisation, notably on power consumption. In this context, Mathieu developed Phase-TA, an offline tool which detects and characterizes the inherent periodicities of iterative HPC applications, with no prior knowledge of the latter. To do so, it analyses the evolution of several performance counters at the scale of the compute node, and infers patterns representing the identified periodicities. As a result, Phase-TA offers a non-intrusive mean to gain insights on the processor use associated with an application, and paves the way to predicting its behavior. Phase-TA was tested on a panel of 3 applications and benchmarks from the supercomputing field: HPCG, NEMO, and OpenFoam. For all of them, periodicities, accountable for on average 78% of their execution time, were detected and represented by accurate patterns. Furthermore, it was demonstrated that there is no need to analyse the whole profile of an application to precisely characterize its periodic behaviors. Indeed, an extract of the aforementioned profile is enough for Phase-TA to infer representative patterns on-the-fly, opening the way to energy-efficiency optimization through Dynamic Voltage-Frequency Scaling (DVFS). This work has been accepted at the HPCS2020 conference (postponed in 2021). The next step will be to use Phase-TA to identify DVFS opportunities and dynamically adapt the cores frequencies at runtime.

During the period, our new results and contributions can be categorized as follows. First, our main efforts were related to the verification and validation of applications in the domain of the Internet of Things in the context of the CLAPS project. We had two main contributions, namely the development of the BISM tool and the design and implementation of Runtime Enforcement Monitors. BISM is a tool for expressive and efficient instrumentation of Java applications at the Bytecode level. For our verification purposes, source-level instrumentation (which we previously used) was not sufficiently expressive nor sufficiently fine grain (e.g., changes in the control flow); Second, we consolidated work that was started a few years ago through the realisation and revision of journal submission, some of which got published this year.

Moreover, we coedited a special issue on the use of formal methods to help in the software development process 2.

Finally, we realised a teaching book 10 on the introduction of automata theory and regular languages. This is as part of our effort to contribute to the evolution of teaching practice so as to provide better quality material to students and make them more autonomous.

In this work 4 we deal with runtime enforcement of timed properties with uncontrollable events. Runtime enforcement consists in defining and using an enforcement mechanism that modifies the executions of a running system to ensure their correctness with respect to the desired property. Uncontrollable events cannot be modified by the enforcement mechanisms and thus have to be released immediately. We present a complete theoretical framework for synthesizing such mechanism, modeling the runtime enforcement problem as a Büchi game. It permits to pre-compute the decisions of the enforcement mechanism, thus avoiding to explore the whole execution tree at runtime. The obtained enforcement mechanism is sound, compliant and optimal, meaning that it should output as soon as possible correct executions that are as close as possible to the input execution. This framework takes as input any timed regular property modelled by a timed automaton. We present GREP, a tool implementing this approach. We provide algorithms and implementation details of the different modules of GREP, and evaluate its performance. The results are compared with another state of the art runtime enforcement tool.

In this work 3, we define a method to automatically synthesize efficient distributed implementations from high-level global choreographies. A global choreography describes the execution and communication logic between a set of provided processes which are described by their interfaces. At the choreography level, the operations include multiparty communications, choice, loop, and branching. A choreography is master triggered: it has one master to trigger its execution. This allows us to automatically generate conflict-free distributed implementations without controllers. The behavior of the synthesized implementations follows the behavior of choreographies. In addition, the absence of controllers ensures the efficiency of the implementation and reduces the communication needed at runtime. Moreover, we define a translation of the distributed implementations to equivalent Promela versions. The translation allows verifying the distributed system against behavioral properties. We implemented a Java prototype to validate the approach and applied it to automatically synthesize micro-service architectures. We also illustrate our method on the automatic synthesis of a verified distributed buying system.

In this work, we introduce two complementary approaches to monitor decentralized systems. The first approach relies on systems with a centralized specification, i.e., when the specification is written for the behavior of the entire system. To do so, our approach introduces a data structure that (i) keeps track of the execution of an automaton (ii) has predictable parameters and size, and (iii) guarantees strong eventual consistency. The second approach defines decentralized specifications wherein multiple specifications are provided for separate parts of the system. We study two properties of decentralized specifications pertaining to monitorability and compatibility between specification and architecture. We also present a general algorithm for monitoring decentralized specifications. We map three existing algorithms to our approaches and provide a framework for analyzing their behavior. Furthermore, we present THEMIS, a framework for designing such decentralized algorithms and simulating their behavior. We demonstrate the usage of THEMIS to compare multiple algorithms and validate the trends predicted by the analysis in two scenarios: a synthetic benchmark and the Chiron user interface.

In this work 9, we introduce BISM. BISM (Bytecode-level Instrumentation for Software Monitoring) is a lightweight Java bytecode instrumentation tool which features an expressive high-level control-flow-aware instrumentation language. The language follows the aspect-oriented programming paradigm by adopting the joinpoint model, advice inlining, and separate instrumentation mechanisms. BISM provides joinpoints ranging from bytecode instruction to method execution, access to comprehensive context information, and instrumentation methods. BISM runs in two modes: build-time and load-time. We demonstrate BISM effectiveness using two experiments: a security scenario and a general runtime verification case. The results show that BISM instrumentation incurs low runtime and memory overheads.

This domain is a new axis of the Corse team. Our goal here is to combine our expertise in compilation and teaching to help teachers and learners in computer science fields such as programming, algorithms, data strucures, automata, or more generally computing litteracy. The most important project in this regard is the automated generation and recommendation of exercises using artificial intelligence, a thesis that started last year. Other projects focus on tools to help learning through visualization (data structures, debugger, automata) or gamification (AppoLab), and are the source of many internships that give younger students experience in a research team.

In an ideal educative world, each learner would have access to individual pedagogical help, tailored to its needs. For instance, a tutor who could rapidly react to the questions, and propose pedagogical contents that match the learner's kills, and who could identify and work on his or her weaknesses. However, the real world imposes constraints that make this individual pedagogical help hard to achieve.

The goal of the AI4HI project is to combine the new advances in artificial intelligence with the team's skills in compilation and teaching to aid teaching through the automated generation and recommendation of exercises to learners. In particular, we target the teaching of programming and debugging to novices. This system will propose exercises that match the learners' needs and hence improve the learning, progression, and self-confidence of learners.

This projet has received an “Action Exploratoire” funding from Inria and Théo Barollet started his PhD in September 2019. There is a collaboration ongoing with the startup Toxicode, who provided data from one of their educational games. Our current goal is to be able to predict reliably how much time or tries a student will take to successfully pass an exercice, for which we also use publicly avaible data from the educational data mining community. So far, we had mitigated results using CNN neural networks but now promising results using matrix factorization.

Classical teaching of algorithms and low-level data structures is often tedious and unappealing to students. AppoLab is an online platform to engage students in their learning by including gamification in Problem-Based Learning. In its core, it is a server with scripted “exercises”. Students can communicate with the server manually, but ultimately they need to script the communication also from their side, since the server will gradually impose constraints on the problems such as timeouts or large input sizes. This platform receives gradual improvements. This year we provided a full “game” on the platform to teach basic data structure complexity, and how to understand and modify basic algorithms workings on lists over a three-week period.

Debuggers are powerful tools to observe a program behavior and find bugs but they have a hard learning curve. They provide information on low level data but are not able to analyze higher level elements such as data structures. This work tries to provide a more intuitive representation of the program execution to ease debugging and algorithms understanding. We developed a prototype, Moly, a GDB extension that explores a program runtime memory and analyze its data structures. It provides an interface with an external visualizer, Lotos, through a formatted output. Work has also started to include a tutorial on how to use GDB and these extensions but was put on hold due to the difficulties in recruiting interns during the Covid crisis.

The COVID-19 is a pneumonia that may culminate in the acute respiratory distress syndrome (ARDS). With the pandemic, intensive care unit (ICU) beds and ventilators have become resources of the utmost value. While France and many other countries were able to rapidly increase the number of ICU beds by upgrading step-down units and post-operative recovery rooms, a major shortage of ventilators became a worldwide critical concern. Therefore, COVID-19 pandemic stressed the need for emergency ventilator systems that can be rapidly deployed to try to avoid that the demand for ventilators overcame their supply. Various scenarios, such as regional emergencies, global pandemic, or situation in low-resource ICUs, require a ventilator sharing strategy that maximizes the number of patients able to receive potentially life-saving treatment. To address this issue of ventilator shortage, our group (eSpiro Network) developed a rapidly deployable and open-source ventilator. Based on frugal innovation, we expect it to allow performing better care with fewer resources, and for more critically ill patients.

This work has been done in the context of the Recovid project supported by Inria.

Epidemiological modeling is an important tool that allows to predict the evolution of a pandemic (predictive) and understand the parameters that affect it (mechanistic). There exist several approaches usually opposed: agent-based versus compartmental; stochastic versus deterministic; analytical versus computational. The different approaches trade expressiveness with complexity. In this project we started the development of an open-source computational model which objective is to provide the expressiveness of a stochastic agent-based model but with a computational complexity orders of magnitude more efficient. Because of health issues (long-COVID) of the leader of the project, the developments have been interrupted but should start again in 2021.

Most of health database used for COVID suffer from several problems: lack of coordination (double counting); reliability (manual gathering, obfuscation); representativeness (no quota method); lack of longitudinal data...
The objective of the project is the development of an open-source application survey that would provide longitudinal data.
As opposed to Zoe (Covid Symptom Study App – https://

The goal of this project is to extend techniques for automatic characterisation of data movement of an application to the design of performance estimation.

The EA has three main objectives: 1. broader applicability of IO complexity analysis; 2. Hardware characterisation; 3. Performance model.

In this work, we provide a complete support reference manual for Bachelor students involved in a course on automata theory and regular languages. The book covers the major topics: languages, finite-state automate in their various form, regular expressions, grammars. It provides intuitive summaries, lecture notes, and exercises with their solutions.