Keywords
 A3.1.1. Modeling, representation
 A3.1.8. Big data (production, storage, transfer)
 A3.3. Data and knowledge analysis
 A3.3.3. Big data analysis
 A3.4. Machine learning and statistics
 A3.4.1. Supervised learning
 A3.4.2. Unsupervised learning
 A3.4.3. Reinforcement learning
 A3.4.4. Optimization and learning
 A3.4.5. Bayesian methods
 A3.4.6. Neural networks
 A3.4.7. Kernel methods
 A3.4.8. Deep learning
 A3.5.1. Analysis of large graphs
 A6.1. Methods in mathematical modeling
 A9.2. Machine learning
 B1.1.4. Genetics and genomics
 B1.1.7. Bioinformatics
 B2.2.4. Infectious diseases, Virology
 B2.3. Epidemiology
 B4. Energy
 B4.4. Energy delivery
 B4.5. Energy consumption
 B5.2.1. Road vehicles
 B5.2.2. Railway
 B5.5. Materials
 B5.9. Industrial maintenance
 B7.1. Traffic management
 B7.1.1. Pedestrian traffic and crowds
 B9.5.2. Mathematics
 B9.8. Reproducibility
 B9.9. Ethics
1 Team members, visitors, external collaborators
Research Scientists
 Kevin Bleakley [INRIA, Researcher]
 Gilles Celeux [INRIA, Emeritus]
 Evgenii Chzhen [CNRS, Researcher]
 Gilles Stoltz [CNRS, Senior Researcher, HDR]
Faculty Members
 Sylvain Arlot [Team leader, UNIV PARIS SACLAY, Professor]
 Christophe Giraud [UNIV PARIS SACLAY, Professor]
 Alexandre Janon [UNIV PARIS SACLAY, Associate Professor]
 Christine Keribin [UNIV PARIS SACLAY, Associate Professor, HDR]
 Pascal Massart [UNIV PARIS SACLAY, Professor]
 Patrick Pamphile [UNIV PARIS SACLAY, Associate Professor]
 MarieAnne Poursat [UNIV PARIS SACLAY, Associate Professor]
PostDoctoral Fellows
 Pierre Andrieu [UNIV PARISSACLAY, until Aug 2022, Coadvised with Sarah CohenBoulakia (LISN)]
 Pierre Humbert [INRIA]
PhD Students
 Yvenn Amara Ouali [UNIV PARISSACLAY, until Sep 2022, Coadvised with Yannig Goude (EDF)]
 Filippo Antonazzo [Inria, until Sep 2022, Coadvised with Christophe Biernacki (Modal)]
 Emilien Baroux [GROUPE PSA, CIFRE, Coadvised with Andrei Constantinescu (LMS)]
 Antoine Barrier [ENS Lyon, Coadvised with Aurélien Garivier]
 Samy Clementz [UNIV PARIS 1 PANTHEON SORBONNE, Coadvised with Alain Celisse]
 Olivier Coudray [GROUPE PSA, CIFRE]
 Rémi Coulaud [SNCF, CIFRE]
 Solenne Gaucher [UNIV PARISSACLAY, until Jun 2022]
 Karl Hajjar [UNIV PARIS SACLAY, Coadvised with Lénaïc Chizat (EPFL)]
 Perrine Lacroix [UNIV PARIS SACLAY, Coadvised with MarieLaure MartinMagniette (Inrae)]
 Etienne Lasalle [UNIV PARISSACLAY, Coadvised with Frédéric Chazal (Datashape)]
 Leonardo MartinsBianco [Inria, from Apr 2022, Coadvised with Zacharie Naulet (LMO); M2 intern from April to August, then PhD student]
 Chiara Mignacco [UNIV PARISSACLAY, from Sep 2022]
 Louis Pujol [Inria, Coadvised with Marc Glisse (Datashape)]
 El Mehdi Saad [UNIV PARIS SACLAY, Coadvised with Gilles Blanchard (Datashape)]
 Gayane Taturyan [IRT SystemX, Coadvised with JeanMichel Loubes (IMT) and Mohamed Hebiri (Univ ParisEst)]
Administrative Assistants
 AissatouSadio Diallo [Inria, from May 2022]
 Laurence Fontana [Inria, until May 2022]
External Collaborators
 Claire Lacour [UNIV PARIS EST]
 Matthieu Lerasle [ENSAE]
 JeanMichel Poggi [UNIV PARIS CITE, from Mar 2022]
2 Overall objectives
2.1 Mathematical statistics and learning
Data science—a vast field that includes statistics, machine learning, signal processing, data visualization, and databases—has become frontpage news due to its everincreasing impact on society, over and above the important role it already played in science over the last few decades. Within data science, the statistical community has longterm experience in how to infer knowledge from data, based on solid mathematical foundations. The recent field of machine learning has also made important progress by combining statistics and optimization, with a fresh point of view that originates in applications where prediction is more important than building models.
The Celeste projectteam is positioned at the interface between statistics and machine learning. We are statisticians in a mathematics department, with strong mathematical backgrounds, interested in interactions between theory, algorithms, and applications. Indeed, applications are the source of many of our interesting theoretical problems, while the theory we develop plays a key role in (i) understanding how and why successful statistical learning algorithms work—hence improving them—and (ii) building new algorithms upon mathematical statisticsbased foundations. Therefore, we tackle several major challenges of machine learning with our mathematical statistics point of view (in particular the algorithmic fairness issue), always having in mind that modern datasets are often highdimensional and/or largescale, which must be taken into account at the building stage of statistical learning algorithms. For instance, there often are tradeoffs between statistical accuracy and complexity which we want to clarify as much as possible.
In addition, most theoretical guarantees that we prove are nonasymptotic, which is important because the number of features $p$ is often larger than the sample size $n$ in modern datasets, hence asymptotic results with $p$ fixed and $n\to +\infty $ are not relevant. The nonasymptotic approach is also closer to the realworld than specific asymptotic settings, since it is difficult to say whether $p=1000$ and $n=100$ corresponds to the setting $p=10n$ or $p={n}^{3/2}$.
Finally, a key ingredient in our research program is connecting our theoretical and methodological results with (a great number of) realworld applications. This is the reason why a large part of our work is devoted to industrial and medical data modeling on a set of realworld problems coming from our longterm collaborations with several partners, as well as various opportunistic oneshot collaborations.
3 Research program
3.1 General presentation
We split our research program into four research axes, distinguishing problems and methods that are traditionally considered part of mathematical statistics (e.g., model selection and hypothesis testing, see section 3.2) from those usually tackled by the machine learning community (e.g., multiarmed bandits, deep learning, clustering and pairwisedata inference, see section 3.3). Section 3.4 is devoted to industrial and medical data modeling questions which arise from several longterm collaborations and more recent research contracts. Finally, section 3.5 is devoted to algorithmic fairness, a theme of Celeste which we want to specifically emphasize. Despite presenting mathematical statistics, machine learning, and data modeling as separate axes, we would like to make clear that these axes are strongly interdependent in our research and that this dependence is a key factor in our success.
3.2 Mathematical statistics
One of our main goals is to address major challenges in machine learning in which mathematical statistics naturally play a key role, in particular in the following two areas of research.
Estimator selection.
Any machine learning procedure requires a choice for the values of hyperparameters, and one must also choose among the numerous procedures available for any given learning problem; both situations correspond to an estimator selection problem. Highdimensional variable (feature) selection is another key estimator selection problem. Celeste addresses all such estimator selection problems, where the goal is to select an estimator (or a set of features) minimizing the prediction/estimation risk, and the corresponding nonasymptotic theoretical guarantee—which we want to prove in various settings—is an oracle inequality.
Statistical reproducibility.
Science currently faces a reproducibility crisis, making it necessary to provide statistical inference tools (hypotheses tests, confidence regions) for assessing the significance of the output of any learning algorithm in a computationally efficient way. Our goal here is to develop methods for which we can prove upper bounds on the type I error rate, while maximizing the detection power under this constraint. We are particularly interested in the variable selection case, which here leads to a multiple testing problem for which key metrics are the familywise error rate (FWER) and the false discovery rate (FDR).
3.3 Theoretical foundations of machine learning
Our distinguishing approach (compared to peer groups around the world) is to offer a statistical and mathematical point of view on machinelearning (ML) problems and algorithms. Our main focus is to provide theoretical guarantees for certain ML problems, with special attention paid to the statistical point of view, in particular minimax optimality and statistical adaptivity. In the areas of deep learning and big data, computationallyefficient optimization algorithms are essential. The choice of the optimization algorithm has been shown to have a dramatic impact on generalization properties of predictors. Such empirical observations have led us to investigate the interplay between computational efficiency and statistical properties. The set of problems we tackle includes online learning (stochastic bandits and expert aggregation), clustering and coclustering, pairwisedata inference, semisupervised learning, and the interplay between optimization and statistical properties.
3.4 Industrial and medical data modeling
Celeste collaborates with industry and with medicine/public health institutes to develop methods and apply results of a broadly statistical nature—whether they be prediction, aggregation, anomaly detection, forecasting, and so on—in relationship with pressing industrial and/or societal needs (see sections 4 and 5.2). Most of these methods and applied results are directly related to the more theoretical subjects examined in the first two research axes, including for instance estimator selection, aggregation, and supervised and unsupervised classification. Furthermore, Celeste is positioned well for problems with data requiring unconventional methods—for instance, non asymptotic analysis and data with selection bias—, and in particular problems that can give rise to technology transfers in the context of Cifre Ph.D.s.
3.5 Algorithmic fairness
Machinelearning algorithms make pivotal decisions which influence our lives on a daily basis, using data about individuals. Recent studies show that imprudent use of these algorithms may lead to unfair and discriminatory decisions, often inheriting or even amplifying disparities present in data. The goal of Celeste on this topic is to design and analyze novel tractable algorithms that, while still optimizing prediction performance, mitigate or remove unfair decisions of the learned predictor. A major challenge in the machinelearning fairness literature is to obtain algorithms which satisfy fairness and risk guarantees simultaneously. Several empirical studies suggest that there is a tradeoff between the fairness and accuracy of a learned model: more accurate models are less fair. We are focused on providing userfriendly statistical quantification of such tradeoffs and building statisticallyoptimal algorithms in this context, with special attention paid to the online learning setting. Relying on the strong mathematical and statistical competency of the team, we approach the problem from an angle that differs from the mainstream computer science literature.
4 Application domains
4.1 Neglected tropical diseases
Celeste collaborates with researchers at Institut Pasteur on encephalitis in SouthEast Asia, especially with JeanDavid Pommier.
4.2 Electricity load consumption: forecasting and control
Celeste has a longterm collaboration with EDF R&D on electricity consumption. An important problem is to forecast consumption. We currently work on an approach involving back and forth disaggregation (of the total consumption into the consumption of wellchosen groups/regions) and aggregation of local estimates.
4.3 Reliability
Collected product lifetime data is often nonhomogeneous, affected by production variability and differing realworld usage. Usually, this variability is not controlled or observed in any way, but needs to be taken into account for reliability analysis. Latent structure models are flexible models commonly used to model unobservable causes of variability.
Celeste currently collaborates with Stellantis. To dimension its vehicles, Stellantis uses a reliability design method called StrengthStress, which takes into consideration both the statistical distribution of part strength and the statistical distribution of customer load (called Stress). In order to minimize the risk of inservice failure, the probability that a “severe” customer will encounter a weak part must be quantified. Severity quantification is not simple since vehicle use and driver behaviour can be “severe” for some types of materials and not for others. The aim of the study is thus to define a new and richer notion of “severity” from Stellantis's databases, resulting either from tests or client usages. This will lead to more robust and accurate parts dimensioning methods. Two CIFRE theses (one recently defended, the other in progress) tackle such subjects:
Olivier Coudray, “A statistical point of view on fatigue criteria : from supervised classification to positiveunlabeled learning” 24. Here, we are seeking to build probabilistic fatigue criteria to identify the critical zones of a mechanical part.
Emilien Baroux, “Reliability dimensioning under complex loads: from specification to validation”. Here, we seek to identify and model the critical loads that a vehicle can undergo according to its usage profile (driver, roads, climate, etc.).
4.4 Exploiting a datarich environment for railway operations, with SNCF–Transilien
We have an ongoing collaboration with SNCF–Transilien to exploit rich datasets of railway operations and passenger flows, obtained by automatic recording devices (for passenger flows, these correspond to infrasensors at the door level). We tackle two problems. First, we model and estimate the dwell time of trains, as well as their delay to scheduled arrival; both are important factors to control to guarantee punctuality. Second, we model and forecast passenger movements inside train coaches, so as to be able to provide incoming passengers with information on crowding of coaches. Both series of problems come with new results described in section 7.11. They correspond to the final year of the CIFRE PhD of Rémi Coulaud between our Celeste team and SNCF–Transilien 25.
5 Social and environmental responsibility
5.1 Footprint of research activities
Still influenced by the aftermath of the Covid19 pandemic, the carbon emissions of Celeste team members related to their jobs were very low and came essentially from:
 limited levels of transport to and from work, and a small amount for essentially land travel to conferences in France and Europe.
 electronic communication (email, Google searches, Zoom meetings, online seminars, etc.).
 the carbon emissions embedded in their personal computing devices (construction), either laptops or desktops.
 electricity for personal computing devices and for the workplace, plus also water, heating, and maintenance for the latter. Note that only 7.1% (2018) of France's electricity is not sourced from nuclear energy or renewables so team member carbon emissions related to electricity are minimal.
In terms of magnitude, the largest per capita ongoing emissions (excluding flying) are likely simply to be those from buying computers that have a carbon footprint from their construction, in the range of 100 kg Co2e each. In contrast, typical email use per year is around 10 kg Co2e per person, and a Zoom call comes to around 10 g Co2e per hour per person, while web browsing uses around 100g Co2e per hour. Consequently, 2022 was a very low carbon year for the Celeste team. To put this in the context of work travel by flying, one return ParisNice flight corresponds to 160 kg Co2e emissions, which likely dwarfs the total emissions of any one Celeste team member's workrelated emissions in 2020.
The approximate (rounded for simplicity) kg Co2e values cited above come from the book, “How Bad are Bananas” by Mike BernersLee (2020) which estimates carbon emissions in everyday life.
5.2 Impact of research results
In addition to the longterm impact of our theoretical work—which is of course impossible to assess immediately—we are involved in several applied research projects which aim at having a short/midterm positive impact on society.
First, we collaborate with the Pasteur Institute on neglected tropical diseases; encephalitis in particular, with implications in global health strategies.
Second, the broad use of artificial intelligence/machine learning/statistics nowadays comes with several major ethical issues, one being to avoid making unfair or discriminatory decisions. Our theoretical work on algorithmic fairness has already led to several “fair” algorithms that could be widely used in the short term (one of them is already used for enforcing fair decisionmaking in student admissions at the University of Genoa).
Third, we expect shortterm positive impact on society from several direct collaborations with companies such as EDF (forecasting and control of electricity load consumption), SNCF (punctuality of trains and better passenger information on crowding inside train coaches) and Stellantis (automobile reliability, with two Cifre contracts).
6 New software and platforms
6.1 New software
6.1.1 CRTLogit

Keywords:
Hypothesis testing, Variable selection, High Dimensional Data, Statistical learning, Classification

Functional Description:
This packages runs the algorithm CRTLogit to perform the conditional randomization test to identify relevant variables for a classification model.
 URL:
 Publication:

Contact:
Tuan Binh Nguyen
6.1.2 pysarpu

Keywords:
Statistical learning, Semisupervised classification

Functional Description:
Implementation of the estimation procedure of a PU learning model (joint estimation of the classifier and the propensity function) under parametric assumptions using the EM (Expectation Maximization) algorithm. Several classification models (logistic regression, linear discriminant analysis) and propensity models (logistic, probit, Gumbel) are implemented.
 URL:
 Publication:

Contact:
Christine Keribin

Partner:
STELLANTIS
6.1.3 binMixtC

Keywords:
Clustering, EM algorithm

Functional Description:
Software implementing a data compression scheme preserving binmarginal values with a specific EMlike algorithm.
 URL:

Contact:
Christine Keribin
6.1.4 gsbm

Keywords:
Missing data, Outlier detection, Stochastic block model

Functional Description:
Given an adjacency matrix drawn from a Generalized Stochastic Block Model with missing observations, this package robustly estimates the probabilities of connection between nodes and detects outliers node
 URL:
 Publication:

Contact:
Solenne Gaucher

Partner:
Ecole des Ponts ParisTech
6.1.5 DBABST

Keywords:
Classification, Algorithmic fairness, Demographic parity

Functional Description:
Postprocessing algorithm for binary classification with abstention and DP constraints
 URL:
 Publication:

Contact:
Evgenii Chzhen
6.1.6 HiDimStat

Keywords:
Statistical inference, High Dimensional Data, Variable selection

Functional Description:
The HiDimStat package provides statistical inference methods to solve the problem of support recovery in the context of highdimensional and spatiallystructured data.
 URL:
 Publications:

Contact:
Tuan Binh Nguyen
6.2 New platforms
Participants: Alexandre Janon.
 A. Janon : Deployment of a R (Shiny) server for the LAGUN platform in uncertainty quantification. Link.
7 New results
7.1 Conditional Randomization Test for Sparse Logistic Regression in HighDimension
Participants: Sylvain Arlot.
Identifying the relevant variables for a classification model with correct confidence levels is a central but difficult task in highdimension. Despite the core role of sparse logistic regression in statistics and machine learning, it still lacks a good solution for accurate inference in the regime where the number of features $p$ is as large as or larger than the number of samples $n$. In 19, in collaboration with B. Nguyen and B. Thirion, we tackle this problem by improving the Conditional Randomization Test (CRT). The original CRT algorithm shows promise as a way to output pvalues while making few assumptions on the distribution of the test statistics. As it comes with a prohibitive computational cost even in mildly highdimensional problems, faster solutions based on distillation have been proposed. Yet, they rely on unrealistic hypotheses and result in lowpower solutions. To improve this, we propose CRTlogit, an algorithm that combines a variabledistillation step and a decorrelation step that takes into account the geometry of ${\ell}^{1}$penalized logistic regression problem. We provide a theoretical analysis of this procedure, and demonstrate its effectiveness on simulations, along with experiments on largescale brainimaging and genomics datasets.
7.2 Robust Kernel Density Estimation with MedianofMeans principle
Participants: Pierre Humbert.
In 17, in collaboration with Batiste Le Bars (Inria Magnet) and Ludovic Minvielle (ENS ParisSaclay), we introduce a robust nonparametric density estimator combining the popular Kernel Density Estimation method and the MedianofMeans principle (MoMKDE). This estimator is shown to achieve robustness to any kind of anomalous data, even in the case of adversarial contamination. In particular, while previous works only prove consistency results under known contamination model, this work provides finitesample highprobability errorbounds without a priori knowledge on the outliers. Finally, when compared with other robust kernel estimators, we show that MoMKDE achieves competitive results while having significant lower computational complexity.
7.3 Stochastic bandits
Participants: Antoine Barrier, Zhen Li, Gilles Stoltz, Solenne Gaucher, Christophe Giraud .
In 18, we revisit the problem of contextual bandits with budget constraints (called contextual bandits with knapsacks), and apply it to conversion models. This is a joint work with BNP Paribas with a research contract, and our running example is the clever use of a discount budget to grant loans; conversions correspond to whether or not a customer takes the loan offer. We provide a strategy in a direct primal formulation, where previous contributions in the literature rather suggested strategies based on dual formulations of the problem, with tuning issues (on how to set the dual variables).
In 31, we tackle the problem of bestarm identification [BAI] with a fixed budget $T$, a problem that remains vastly unexplored: the literature is rich as far as BAI with fixed confidence is concerned. We survey existing results, both upper bounds, typically based on successiverejectstype algorithms, and lower bounds, and extend them to the case of nonparametric models consisting, e.g., of the set of all distributions over a compact interval. We introduce new informationtheoretic quantities measuring the hardness of these problems. We however still could not close the $lnK$ gap between upper and lower bounds, though we reduced it.
In 16 we address discriminative bias in linear bandits and quantified the price of unfair evaluations in the worst case and the gapminimax setting. The results revealed a transition between a regime where the problem is as difficult as its unbiased counterpart, and a regime where it can be much harder. Unlike previouslymentioned contributions, which were modelfree, this work postulated an explicit source of discriminating bias.
7.4 Constant regret for sequence prediction with limited advice
Participants: El Mehdi Saad.
In 42, in collaboration with Gilles Blanchard (Datashape), we consider the problem of cumulative regret minimization for individual sequence prediction with respect to the best expert in a finite family under limited access to information. We provide a strategy that combines two experts and observes at least two experts' predictions in each round. We prove that this strategy allows having a constant bound on the regret (an upper bound independent of the time horizon $T$), both in expectation and with high probability with respect to the player's randomization. Finally, we show that if the player is restricted to playing one expert or observing one expert's prediction per round, her regret is lower bounded by $\sqrt{T}$ for some sequences.
7.5 High Dimension and Estimation Challenges in ModelBased CoClustering
Participants: Christine Keribin.
In collaboration with Christophe Biernacki (Inria Modal) and Julien Jacques (Université de Lyon 2), we have written a survey 32 on modelbased coclustering. This problem can be seen as a particularly valuable extension of modelbased clustering for three main reasons: (1) while allowing parsimoniously a drastic reduction of both the number of lines/individuals and columns/variables of a data set, (2) it also allows interpretability of such a resulting reduced data set since initial individuals and features meaning is preserved in this latter; (3) moreover it benefits from the powerful mathematical statistics theory for both estimation and model selection. Hence, many authors produced new advances on this topic in the recent years, and we offer a general update of the related literature. In addition, it is the opportunity to pass two messages, supported by specific research materials: (1) coclustering still requires some new and motivating researches for fixing some wellidentified estimation issues, (2) coclustering is probably one of the most promising clustering approaches to be addressed in the (very) high dimensional setting, which corresponds to the global trend on modern data sets.
7.6 Optimization
Participants: Karl Hajjar, Evgenii Chzhen.
Symmetries are expected to play an important role in the effectiveness of Neural Networks. We have described in 37 a class of symmetries that are preserved during the learning process.
Leveraging statistical ideas, we have developed zeroorder optimization algorithms using ${\ell}_{1}$randomized noisy function evaluations 14. We have shown that the dualaveraging algorithm with ${\ell}_{1}$randomized noisy function evaluations improves the convergence rates of the previously bestperforming constructions.
7.7 A statistical point of view on fatigue criteria: from supervised classification to positiveunlabeled learning
Participants: Olivier Coudray, Christine Keribin, Patrick Pamphile.
Celeste currently collaborates with the automobile manufacturer Stellantis. In Olivier Coudray's Ph.D. 20, the challenge for Stellantis was to identify critical zones of a mechanical part. This is an unsupervised classification problem with a selection bias in the data (during a test, the absence of the start of a crack does not necessarily mean that the zone is safe). We proposed to use a semisupervised classification method called positiveunlabelled (PU) learning 34. The optimality of the speed of convergence of the classifier (in the minimax sense) was obtained in this work. The PU learning classifier was then implemented on simulated datasets to compare its performance to other classification methods and on Stellantis datasets [software pysarpu, section 6.1.2].
7.8 Reliable design under complex loads: from specification to validation
Participants: Emilien Baroux, Patrick Pamphile.
Celeste currently collaborates with the automobile manufacturer Stellantis. In Emilien Baroux's Ph.D., coadvised with Andréï Constantinescu, the challenge for Stellantis is to identify critical areas for vehicule chassis fatigue, and in particular to identify severe fatigue profiles (driver and road type) for vehicule chassis. We started by creating a multiaxial mechanical damage measurement system that allowed us to comprehensively understand cases of locally critical loads (torsion, bending, etc.), independent of the vehicle model. During a driving campaign, damage measurements were taken simultaneously in the longitudinal, vertical, and horizontal axes, as well as on the four wheels, for 50 vehicles. We further proposed factorial analyses and unsupervised classification methods to construct a robust severity distribution for fatigue design tasks 46.
7.9 Daily peak electrical load forecasting with a multiresolution approach
Participants: Yvenn AmaraOuali.
In the context of smart grids and load balancing, daily peak load forecasting has become a critical activity for stakeholders in the energy industry. An understanding of peak magnitude and timing is paramount for the implementation of smart grid strategies such as peak shaving. In 3, in collaboration with Matteo Fasiolo, Yannig Goude and Hui Yan, we proposed a modelling approach which leverages highresolution and lowresolution information to forecast daily peak demand size and timing. The resulting multiresolution modelling framework can be adapted to different model classes and are implemented via generalised additive models and neural networks. Experimental results on real data from the UK electricity market confirm that the predictive performance of the proposed modelling approach is competitive with that of low and highresolution alternatives.
7.10 A benchmark of electric vehicle load and occupancy models for dayahead forecasting on open charging session data
Participants: Yvenn AmaraOuali.
The development of electric vehicles (EV) is a major lever towards low carbon transportation. It comes with increasing numbers of charging infrastructures which can be smartly managed to control the CO2 cost of EV electricity consumption or used as flexible assets for grid management. To achieve that, an efficient dayahead forecast of charging behaviours is required at different spatial resolutions (e.g., household and public stations). In 15, in collaboration with Yannig Goude, Bachir Hamrouche and Matthew Bishara, we proposed an extensive benchmark of 14 models for both load and occupancy dayahead forecasts, covering 8 open charging session datasets of different types (residential, workplace and public stations) is proposed. Two modelling approaches are compared: direct and bottomup. The direct approach forecasts the aggregated load (resp. occupation) directly of an area/station whereas the bottomup approach models each individual EV charging session before aggregating them. Both machine learning models and statistical models are considered. Results show that direct approaches reach better performances than bottomup approaches. The different approaches used can lead to an improved performance of direct approaches when using an adaptive aggregation strategy.
7.11 Exploiting a datarich environment for railway operations, with SNCF–Transilien
Participants: Rémi Coulaud, Christine Keribin, Gilles Stoltz.
As mentioned in Section 4.4, two sources of data may and should be mixed: railway operations (e.g., scheduled and observed arrival and departure times of trains in stations) and passenger flows (e.g., numbers of alighting and boarding passengers in stations). We model dwell times, the difference between departure and arrival times, in 8, based on various machinelearning models (linear regression, random forests and XGBoost, neural networks). Typically, the literature was only using one of the two sources of data at a time (just because only one such source was available for each given problem). By combining the two sources, we are able to identify the most critical source (railway operations) and quantify the added value of the other source (passenger flows: to help modeling critical situations, like delayed trains). To be able to exploit this initial complex modeling of dwell time, we need to forecast variables like numbers of passengers alighting and boarding. We do so in 21 by introducing simple biautoregressive models, which we call Lshaped as they exploit the past both in terms of previous trains at a given station and of previous stations of a given train ride.
A second series of work is for now only described in the PhD manuscript by Rémi Coulaud 25 and is currently being finalized (and thus, will be described in greater details in next year's report). It deals with the modeling and forecasting of passengers' movements inside communicating train coaches. The model exhibited relies on an inhomogeneous Markov chain modeling, using coach loads as latent spaces.
7.12 SEAE: The SouthEast Asia Encephalitis Project
Participants: Kevin Bleakley.
In 13, K. Bleakley was the statistician/modeler/machine learning lead for the largescale 3year SEAE encephalitis project in SouthEast Asia, looking for patterns in the (relatively “big”) dataset relating environmental variables to encephalitis causes and outcomes. His work ranged across (multiple) statistical testing, machine learning (trees, random forests, and logistic regression), PCA, data visualisation, survival analysis, and missing data. Several interesting risk factors for severe encephalitis were uncovered, and steps were made to start looking for new, unknown causes (bacterial, viral, etc.) of encephalitis in SouthEast Asia. This work was published in the renowned journal, Lancet Infection Diseases (IF: 71); related work is ongoing.
7.13 Study of demographic parity constraint
Participants: Evgenii Chzhen, Solenne Gaucher.
In 35, we have shown several fundamental characterizations of the optimal classification function under the demographic parity constraint. In the awareness framework, akin to the classical unconstrained classification case, we have shown that maximizing accuracy under this fairness constraint is equivalent to solving a corresponding regression problem followed by thresholding at level 1/2. We have extended this result to linearfractional classification measures (e.g., Fscore, AM measure, balanced accuracy, etc.). These results further deepen our understanding of fairness constraints and their impact on decision making.
8 Bilateral contracts and grants with industry
Participants: Alexandre Janon, Christine Keribin, Patrick Pamphile, JeanMichel Poggi, Gilles Stoltz, Yvenn AmaraOuali.
8.1 Bilateral contracts with industry
 C. Keribin and P. Pamphile. OpenLabIA InriaGroupe Stellantis collaboration contract. 85 KE.
 A. Constantinescu and P. Pamphile. Collaboration contract with Stellantis. 95 KE.
 C. Keribin and G. Stoltz. Ongoing contrat with SNCF (45 kE), on the modeling and forecasting of dwell time.
 J.M. Poggi, Analyse et modélisation des biais des modèles numériques du NO2 en vue de la fusion de données de réseaux de mesure hétérogènes, ATMO NORMANDIE, 20KE.
 Y. AmaraOuali, P. Massart, JM. Poggi, Modélisation spatiotemporelle de la charge des véhicules électriques, EDF, 20KE.
 G. Stoltz: Ongoing contract with BNP Paribas (2 x 10 kE), on stochastic bandits under budget constraints, with applications to loan management.
 A. Janon: Contract with INSERM Toulouse (3,3 kE), on variable selection for identification of link between microbiotal dysbiosis and type2 diabetes.
9 Partnerships and cooperations
9.1 National initiatives
Participants: Sylvain Arlot, Kevin Bleakley, Christophe Giraud, Matthieu Lerasle.
9.1.1 ANR
Sylvain Arlot and Matthieu Lerasle are part of the ANR grant FASTBIG (Efficient Statistical Testing for highdimensional Models: application to Brain Imaging and Genetics), which is lead by Bertrand Thirion (Inria Saclay, Parietal).
Sylvain Arlot and Christophe Giraud are part of the ANR ChairIA grant Biscotte, which is led by Gilles Blanchard (Université Paris Saclay).
Christophe Giraud is part of the ANR ASCAI: Active and batch segmentation, clustering, and seriation: toward unified foundations in AI, with Potsdam University, Munich University, Montpellier INRAE.
9.1.2 Other
K. Bleakley works at 1/3time (disponibilité) with IRT SystemX under the umbrella of Confiance.AI on the subject of anomaly detection in highdimensional time series data for French industry.
10 Dissemination
Participants: The full Celeste team.
10.1 Promoting scientific activities
10.1.1 Scientific events: organisation
General chair, scientific chair
 C. Keribin is president of the Specialized Group MAchine Learning et Intelligence Artificielle (MALIA) of the French Statistical Society (SFdS)
 J.M. Poggi is President of ENBIS (European Network for Business and Industrial Statistics)
Member of the organizing committees
 S. Arlot is member of the scientific committee of the Séminaire Palaisien
 E. Chzhen is coorganizer of DATAIA institutional seminar
 C. Giraud is coorganizer with Estelle Kuhn of StatMathAppli Frejus, 2022
 C. Giraud is organizing a “statistical learning” session for the joint AMS/EMS/SMF conference, Grenoble 2022
 A. Janon is coorganizer of UQSay seminar
 C. Keribin is coorganizer FrENBIS satellite event “Statistical learning for temporal data, new horizons and industrial applications” during JDS2022 (Lyon, 17/06/2022).
 J.M. Poggi is organizer of the ECASENBIS course, Text Mining: from basics to deep learning tools, Trondheim, June 26, 2022
10.1.2 Scientific events: selection
Member of the conference program committees
 A. Janon is member of the program commitee of the MascotNum 2023 conference in Le Croisic.
 J.M. Poggi is member program commitee of ENBIS22 Conference, Trondheim, June 2630, 2022
Reviewer
 We performed many reviews for various international conferences.
10.1.3 Journal
Member of the editorial boards
 S. Arlot: Associate Editor for Annales de l'Institut Henri Poincaré B – Probability and Statistics
 C. Giraud: Action Editor for JMLR
 C. Giraud: Associate Editor for ESAIMproc
 P. Massart: Associate editor for Panoramas et Synthèses (SMF), Foundations and Trends in Machine Learning, and Confluentes Mathematici
 J.M. Poggi: Associate Editor Advances in Data Analysis and Classification
 J.M. Poggi: Associate Editor JDSSV J. Data Science, Statistics and Visualization
 J.M. Poggi: Associate Editor for Journal of Statistical Software
 J.M. Poggi: coEditor of the book "Interpretability for Industry 4.0: Statistical and Machine Learning Approaches", Springer, nov. 2022
 G. Stoltz: Associate Editor for Mathematics of Operations Research
Reviewer  reviewing activities
 We performed many reviews for various international journals.
10.1.4 Invited talks
S. Arlot, Journées Statistiques du Sud, Avignon, 02/06/2022
E. Chzhen, Rethinking Highdimensional Mathematical Statistics, Oberwolfach, 20/05/2022
E. Chzhen, New trends in statistical learning II, Porquerolles, 14/06/2022
E. Chzhen, Computational Statistics and Machine Learning, Genoa, 12/07/2022
E. Chzhen, Workshop on Ethical AI, Paris, 29/09/2022
E. Chzhen, MADSTAT seminar, Toulouse, 13/10/2022
C. Giraud, ASCAI, Montpellier, 29/02/2022
C. Giraud, Séminaire INRIA Paris Centre, 14/04/2022
C. Giraud, ETHFDS seminar series, Zurich, 08/06/2022
C. Giraud, IMS, London, 28/06/2022
C. Giraud, course in the summer school "Geometry and Statistics", Cargèse, 0509/09/2022
C. Giraud, Van Dantzig seminar, Amsterdam, 16/12/2022
C. Keribin, JSTAR 2022 (Rennes)
C. Keribin, CMStatistics Londres 2022
J.M. Poggi, Mathmet 2022, Paris, France, November 2022
J.M. Poggi, Compstat 2022, Bologna, August 2022
J.M. Poggi, ISBIS 2022, June 2022, Naples, Italy
J.M. Poggi, CISEM 2022, Mahdia (Tunisie) mai 2022
J.M. Poggi, Seminar Department of Mathematics, Université du Luxembourg, October 2022
10.1.5 Leadership within the scientific community
G. Stoltz: main contact point [2018–2022] for the author, G. Favre, of a sociologic report on collaborations of mathematicians with companies, commissioned by AMIES (agence mathsentreprises, which is an Inria–CNRS–Université Grenoble Alpes entity).
10.1.6 Research administration
 S. Arlot is member of the council of the Computer Science Graduate School (GS ISN) of University ParisSaclay.
 S. Arlot is member of the council of the Computer Science Doctoral School (ED STIC) of University ParisSaclay.
 E. Chzhen is member of Bibliothèque Jacques Hadamard scientific committee
 C. Giraud is member of the Scientific Committee of labex IRMIA+, Strasbourg
 C. Giraud is in charge of the whole Masters program in mathematics for University ParisSaclay
 C. Giraud is member of the board of the Mathematics Graduate School of University ParisSaclay
 C. Giraud is senior member of CCUPS (Commission Consultative Université Paris Saclay)
 C. Keribin is in charge of the M1Applied Mathematics and M2DataScience program in the master of the mathematical school
 C. Keribin is member of the board of the Computer Science Doctoral School (ED MSTIC) of ParisEst Sup.
 C. Keribin is vicepresident of the Math  CCUPS (Commission consultative de l’Université Paris–Saclay).
 C. Keribin is member of the council of the mathematics department.
 C. Keribin is copresident of the scholarship allocation committee MixtAI of the SaclAI school.
 P. Massart is director of the Fondation Mathématique Jacques Hadamard.
 MA. Poursat is in charge of the M1Mathematics and artificial intelligence program in the master of the mathematical school
10.1.7 Service to the academic community
 K. Bleakley: translation into English of the webpages of the LMO's website dedicated to research activities
 C. Giraud: coordinator of computing ressources at the Institut Mathématiques d'Orsay (10 engineers)
 C. Keribin: selection committee for an assistant professor, UTC, May 2022
 C. Keribin: selection committee for an assistant professor, ENSMM Besançon, May 2022
 C. Keribin: member of the followup committee for PhD student Tom Guedon (Inrae)
 C. Keribin: member of the followup committee for for PhD student Anderson Augusma (Laboratoire d'informatique de Grenoble)
 G. Stoltz: cohead of the committee working out a new version of the LMO's website, 2019–2022, with the year 2022 dedicated to the Intranet
 G. Stoltz: selection committee for a professor position, Université Rennes 2, May 2022
 G. Stoltz: selection committee for a professor position, Université Paris 1, May 2022
10.2 Teaching  Supervision  Juries
10.2.1 Teaching
Most of the team members (especially Professors, Associate Professors and PhD students) teach several courses at University ParisSaclay, as part of their teaching duty. We mention below some of the classes in which we teach.
 Masters: S. Arlot, Statistical learning and resampling, 30h, M2, Université ParisSaclay
 Masters: S. Arlot, Preparation for French mathematics agrégation (statistics), 25h, M2, Université ParisSaclay
 Masters: E. Chzhen, Fairness and Privacy in Machine Learning, 18h, M2 ENSAE
 Masters: E. Chzhen, Statistical Theory of Algorithmic Fairness, 20h, M2 Université ParisSaclay
 Masters: C. Giraud, HighDimensional Probability and Statistics, 45h, M2, Université ParisSaclay
 Masters: C. Giraud, Mathematics for AI, 75h, M1, Université ParisSaclay
 Masters: C. Keribin, unsupervised and supervised learning, M1, 42h, Université ParisSaclay
 Masters: C. Keribin, Cours accéléré en statistiques, M2, 21h, Université ParisSaclay
 Masters: C. Keribin, Modélisation statistique, M1, 2 x 20h, Université ParisSaclay
 Masters: C. Keribin, Internship supervision for M1Applied Mathematics and M2DataScience, Université ParisSaclay
 Masters: MA Poursat, applied statistics, 21h, M1 Artificial Intelligence, Université ParisSaclay
 Masters: MA Poursat, statistical learning, 42h, M2 Bioinformatics, Université ParisSaclay
 Masters: MA Poursat, méthodes de classification, 24h, M1, Université ParisSaclay
 Licence: MA Poursat, inférence statistique, 72h, L3, Université ParisSaclay
 Masters: G. Stoltz, Sequential Learning and Optimization, 18h, M2 Université ParisSaclay
10.2.2 Supervision
 PhD defended on June 2022: Solenne Gaucher, Sequential learning in random networks, started Sept. 2018, C. Giraud.
 PhD defended on Sep. 2022: Yvenn AmaraOuali, Spatiotemporal modelization of electrical vehicles load, started Oct. 2019, coadvised by P. Massart, J.M. Poggi (Université de Paris) and Y. Goude (EDF R& D)
 PhD defended on Sep. 2022: Filippo Antonnazo, Unsupervised learning of huge datasets with limited computer resources, started Nov. 2019, coadvised by C. Biernacki (INRIAModal) and C. Keribin, DGA grant
 PhD defended on Nov. 2022: Rémi Coulaud, Forecast of dwell time during train parking at station, started Oct. 2019, coadvised by G. Stoltz and C. Keribin, Cifre with SNCF
 PhD defended on Dec. 2022: Etienne Lasalle, Statistical foundations of topological data analysis for graph structured data, started Sept. 2018, coadvised by F. Chazal (INRIA Datashape) and P. Massart
 PhD defended on Dec. 2022: Olivier Coudray, Fatigue databased design, started Nov. 2019, coadvised by C. Keribin and P. Pamphile, Cifre with Groupe PSA
 PhD defended on Dec. 2022: Louis Pujol, CYTOPART  Flow cytometry data clustering, started Nov. 2019, coadvised by P. Massart and M. Glisse (INRIA Datashape)
 PhD defended on Dec. 2022: El Mehdi Saad, Interactions between statistical and computational aspects in machine learning, started Sept. 2019, coadvised by S. Arlot and G. Blanchard (INRIA Datashape)
 PhD defended on Dec. 2022: Perrine Lacroix, Highdimensional linear regression applied to gene interactions network inference, started Sept. 2019, coadvised by P. Massart and M.L. MartinMagniette (INRAE)
 PhD in progress: Emilien Baroux, Reliability dimensioning under complex loads: from specification to validation, started July. 2020, coadvised by A. Constantinescu (LMS) and P. Pamphile, CIFRE with Groupe PSA
 PhD in progress: Antoine Barrier, started Sept. 2020, Best Arm Identification, coadvised by G. Stoltz and A. Garivier (ENS Lyon)
 PhD in progress: Karl Hajjar, analyse dynamique de réseaux de neurones, started Oct. 2020, coadvised by C. Giraud and L. Chizat (EPFL).
 PhD in progress: Samy Clementz, Datadriven Early Stopping Rules for saving computation resources in AI, started Sept. 2021, coadvised by S. Arlot and A. Celisse
 PhD in progress: Gayane Taturyan, Fairness and Robustness in Machine Learning, started Nov. 2021, coadvised by E. Chzhen, J.M. Loubes (Univ. Toulouse Paul Sabatier) and M. Hebiri (Univ. Gustave Eiffel)
 PhD in progress: Leonardo MartinsBianco, Disentangling the relationships between different community detection algorithms, started October 2022, coadvised by C. Keribin and Z. Naulet (Univ. ParisSaclay)
 PhD in progress: Chiara Mignacco, Aggregation (orchestration) of reinforcement learning policies, started October 2022, coadvised by G. Stoltz and Matthieu Jonckheere (LAAS Toulouse)
10.2.3 Juries
We participated to many PhD committees (too many to keep an exact record), at University ParisSaclay as well as at other universities, and we refereed several of these PhDs.
10.3 Popularization
10.3.1 Articles and contents
K. Bleakley gave an interview for Inria’s “News and Events” outreach on his work on Encephalitis in SouthEast Asia in collaboration with the Pasteur Institute. The resulting news article was published here.
10.3.2 Education
Christophe Giraud produces educational videos on his YouTube channel "Highdimensional probability and statistics".
10.3.3 Interventions
Christine Keribin was invited speaker at the Ateliers de la Statistique de la SFdS for an introductory lecture to machine learning (2022).
11 Scientific production
11.1 Major publications
 1 articleA minimax framework for quantifying riskfairness tradeoff in regression.Annals of Statistics504August 2022, 2416  2442
 2 articleChildhood encephalitis in the Greater Mekong region (the SouthEast Asia Encephalitis Project): a multicentre prospective study.The Lancet global health107July 2022, e989e1002
11.2 Publications of the year
International journals
 3 articleDaily peak electrical load forecasting with a multiresolution approach.International Journal of ForecastingJuly 2022
 4 articleOnline hierarchical forecasting for power consumption data.International Journal of Forecasting381January 2022, 339351
 5 articleHierarchical clustering of spectral images with spatial constraints for the rapid processing of large and heterogeneous data sets.SN Computer Science3194March 2022
 6 articleOn the robustness of the minimim l2 interpolator.Bernoulli2022
 7 articleA minimax framework for quantifying riskfairness tradeoff in regression.Annals of Statistics504August 2022, 2416  2442
 8 articleModeling dwell time in a datarich railway environment: with operations and passenger flows data.Transportation research. Part C, Emerging technologies1462023, 103980
 9 articleKLUCBswitch: optimal regret bounds for stochastic bandits from both a distributiondependent and a distributionfree viewpoints.Journal of Machine Learning Research2022
 10 articleAggregated hold out for sparse linear regression with a robust loss function.Electronic Journal of Statistics 1612022, 935  997
 11 articleConcentration study of Mestimators using the influence function.Electronic Journal of Statistics 161January 2022, 36953750
 12 articleAdaptive Greedy Algorithm for Moderately Large Dimensions in Kernel Conditional Density Estimation.Journal of Machine Learning Research232542022
 13 articleChildhood encephalitis in the Greater Mekong region (the SouthEast Asia Encephalitis Project): a multicentre prospective study.The Lancet global health107July 2022, e989e1002
International peerreviewed conferences
 14 inproceedingsA gradient estimator via L1randomization for online zeroorder optimization with two point feedback.NeurIPS 2022  Thirtysixth Conference on Neural Information Processing SystemsNew Orleans, United StatesNovember 2022
 15 inproceedingsA benchmark of electric vehicle load and occupancy models for dayahead forecasting on open charging session data.eEnergy '22: The Thirteenth ACM International Conference on Future Energy SystemsVirtual Event, FranceACMJune 2022, 193207
 16 inproceedingsThe price of unfairness in linear bandits with biased feedback.NeurIPS 2022New Orleans, United StatesNovember 2022
 17 inproceedingsRobust Kernel Density Estimation with MedianofMeans principle.Proceedings of the 39th International Conference on Machine Learning (ICML)162Proceedings of Machine Learning ResearchBaltimore, United StatesJuly 2022
 18 inproceedingsContextual Bandits with Knapsacks for a Conversion Model.Thirtysixth Conference on Neural Information Processing Systems35 (NeurIPS 2022)Advances in Neural Information Processing SystemsNew Orleans, United States2022
 19 inproceedingsA Conditional Randomization Test for Sparse Logistic Regression in HighDimension.NeurIPS 202235Advances in Neural Information Processing SystemsNew Orleans, United StatesNovember 2022
National peerreviewed Conferences
 20 inproceedingsConvergence rates for PositiveUnlabeled learning under Selected At Random assumption: sensitivity analysis with respect to propensity.CAp&RFIAP 2022  Conférence sur l'Apprentissage automatiqueVannes, FranceJuly 2022
Conferences without proceedings
 21 inproceedingsOneStationAhead Forecasting of Dwell Time, Arrival Delay and Passenger Flows on Trains Equipped with Automatic Passenger Counting (APC) Device.WCRR 2022  World Congress on Railway ResearchBirmingham, United KingdomJune 2022
Doctoral dissertations and habilitation theses
 22 thesisStatistical modelling of electric vehicle charging behaviours.Université ParisSaclaySeptember 2022
 23 thesisUnsupervised learning of huge data sets with limited computed resources.Université de LilleSeptember 2022
 24 thesisA statistical point of view on fatigue criteria : from supervised classification to positiveunlabeled learning.Université ParisSaclayDecember 2022
 25 thesisModeling and forecasting of railway operations variables and passenger flows for dense traffic areas.Université ParisSaclayNovember 2022
 26 thesisContributions to stochastic bandits and link prediction problems.Université ParisSaclayJune 2022
 27 thesisContributions to variable selection in highdimension and its uses in biology.Université ParisSaclayDecember 2022
 28 thesisContributions to statistical analysis of graphstructured data.Université ParisSaclayDecember 2022
 29 thesisContributions to Frugal Learning.Université ParisSaclayDecember 2022
Reports & preprints
 30 miscHierarchical transfer learning with applications for electricity load forecasting..November 2022
 31 miscOn BestArm Identification with a Fixed Budget in NonParametric MultiArmed Bandits.September 2022
 32 miscA Survey on ModelBased CoClustering: High Dimension and Estimation Challenges.September 2022
 33 miscMultivariate evidencebased pediatric dengue severity prediction at hospital arrival.December 2022
 34 miscRisk bounds for PU learning under Selected At Random assumption.January 2022
 35 miscFair learning with Wasserstein barycenters for nondecomposable performance measures.September 2022

36
miscAdaptation to the Range in
$K$ Armed Bandits.June 2022  37 miscOn the symmetries in the dynamics of wide twolayer neural networks.October 2022
 38 miscLocal asymptotics of crossvalidation around the optimal model.November 2022
 39 miscGradient Descent can Learn Less Overparameterized Twolayer Neural Networks on Classification Problems.November 2022
 40 miscISDE : Independence Structure Density Estimation.May 2022
 41 miscNonparametric estimation of a multivariate density under KullbackLeibler loss with ISDE.May 2022
 42 miscConstant regret for sequence prediction with limited advice.October 2022
 43 miscModelbased Clustering with Missing Not At Random Data.2022
 44 miscOptimal Estimation of Schatten Norms of a rectangular Matrix.December 2022
Other scientific publications
 45 inproceedingsConvergence rates for PU learning under the SAR assumption: influence of propensity.CAp&RFIAP 2022  Conférence sur l'Apprentissage automatiqueVannes, FranceJuly 2022
11.3 Cited publications
 46 articleAnalysis Of RealLife MultiInput Loading Histories For The Reliable Design Of Vehicle Chassis.Procedia Structural Integrity382022, 497506