Data science—a vast field that includes statistics, machine learning, signal processing, data visualization, and databases—has become front-page news due to its ever-increasing impact on society, over and above the important role it already played in science over the last few decades. Within data science, the statistical community has long-term experience in how to infer knowledge from data, based on solid mathematical foundations. The recent field of machine learning has also made important progress by combining statistics and optimization, with a fresh point of view that originates in applications where prediction is more important than building models.

The Celeste project-team is positioned at the interface between statistics and machine learning. We are statisticians in a mathematics department, with strong mathematical backgrounds, interested in interactions between theory, algorithms, and applications. Indeed, applications are the source of many of our interesting theoretical problems, while the theory we develop plays a key role in (i) understanding how and why successful statistical learning algorithms work—hence improving them—and (ii) building new algorithms upon mathematical statistics-based foundations. Therefore, we tackle several major challenges of machine learning with our mathematical statistics point of view (in particular the algorithmic fairness issue), always having in mind that modern datasets are often high-dimensional and/or large-scale, which must be taken into account at the building stage of statistical learning algorithms. For instance, there often are trade-offs between statistical accuracy and complexity which we want to clarify as much as possible.

In addition, most theoretical guarantees that we prove are non-asymptotic, which is important because the number of features

Finally, a key ingredient in our research program is connecting our theoretical and methodological results with (a great number of) real-world applications. This is the reason why a large part of our work is devoted to industrial and medical data modeling on a set of real-world problems coming from our long-term collaborations with several partners, as well as various opportunistic one-shot collaborations.

We split our research program into four research axes, distinguishing problems and methods that are traditionally considered part of mathematical statistics (e.g., model selection and hypothesis testing, see section 3.2) from those usually tackled by the machine learning community (e.g., multi-armed bandits, deep learning, clustering and pairwise-data inference, see section 3.3). Section 3.4 is devoted to industrial and medical data modeling questions which arise from several long-term collaborations and more recent research contracts. Finally, section 3.5 is devoted to algorithmic fairness, a theme of Celeste which we want to specifically emphasize. Despite presenting mathematical statistics, machine learning, and data modeling as separate axes, we would like to make clear that these axes are strongly interdependent in our research and that this dependence is a key factor in our success.

One of our main goals is to address major challenges in machine learning in which mathematical statistics naturally play a key role, in particular in the following two areas of research.

Any machine learning procedure requires a choice for the values of hyper-parameters, and one must also choose among the numerous procedures available for any given learning problem; both situations correspond to an estimator selection problem. High-dimensional variable (feature) selection is another key estimator selection problem. Celeste addresses all such estimator selection problems, where the goal is to select an estimator (or a set of features) minimizing the prediction/estimation risk, and the corresponding non-asymptotic theoretical guarantee—which we want to prove in various settings—is an oracle inequality.

Science currently faces a reproducibility crisis, making it necessary to provide statistical inference tools (hypotheses tests, confidence regions) for assessing the significance of the output of any learning algorithm in a computationally efficient way. Our goal here is to develop methods for which we can prove upper bounds on the type I error rate, while maximizing the detection power under this constraint. We are particularly interested in the variable selection case, which here leads to a multiple testing problem for which key metrics are the family-wise error rate (FWER) and the false discovery rate (FDR).

Our distinguishing approach (compared to peer groups around the world) is to offer a statistical and mathematical point of view on machine-learning (ML) problems and algorithms. Our main focus is to provide theoretical guarantees for certain ML problems, with special attention paid to the statistical point of view, in particular minimax optimality and statistical adaptivity. In the areas of deep learning and big data, computationally-efficient optimization algorithms are essential. The choice of the optimization algorithm has been shown to have a dramatic impact on generalization properties of predictors. Such empirical observations have led us to investigate the interplay between computational efficiency and statistical properties. The set of problems we tackle includes online learning (stochastic bandits and expert aggregation), clustering and co-clustering, pairwise-data inference, semi-supervised learning, and the interplay between optimization and statistical properties.

Celeste collaborates with industry and with medicine/public health institutes to develop methods and apply results of a broadly statistical nature—whether they be prediction, aggregation, anomaly detection, forecasting, and so on—in relationship with pressing industrial and/or societal needs (see sections 4 and 5.2). Most of these methods and applied results are directly related to the more theoretical subjects examined in the first two research axes, including for instance estimator selection, aggregation, and supervised and unsupervised classification. Furthermore, Celeste is positioned well for problems with data requiring unconventional methods—for instance, non asymptotic analysis and data with selection bias—, and in particular problems that can give rise to technology transfers in the context of Cifre Ph.D.s.

Machine-learning algorithms make pivotal decisions which influence our lives on a daily basis, using data about individuals. Recent studies show that imprudent use of these algorithms may lead to unfair and discriminatory decisions, often inheriting or even amplifying disparities present in data. The goal of Celeste on this topic is to design and analyze novel tractable algorithms that, while still optimizing prediction performance, mitigate or remove unfair decisions of the learned predictor. A major challenge in the machine-learning fairness literature is to obtain algorithms which satisfy fairness and risk guarantees simultaneously. Several empirical studies suggest that there is a trade-off between the fairness and accuracy of a learned model: more accurate models are less fair. We are focused on providing user-friendly statistical quantification of such trade-offs and building statistically-optimal algorithms in this context, with special attention paid to the online learning setting. Relying on the strong mathematical and statistical competency of the team, we approach the problem from an angle that differs from the mainstream computer science literature.

Celeste collaborates with researchers at Institut Pasteur on encephalitis in South-East Asia, especially with Jean-David Pommier.

Celeste has a long-term collaboration with EDF R&D on electricity consumption.
An important problem is to forecast consumption. We currently work on an approach involving back and forth disaggregation (of the total consumption into the consumption of well-chosen groups/regions) and aggregation of local estimates.

Collected product lifetime data is often non-homogeneous, affected by production variability and differing real-world usage. Usually, this variability is not controlled or observed in any way, but needs to be taken into account for reliability analysis. Latent structure models are flexible models commonly used to model unobservable causes of variability.

Celeste currently collaborates with Stellantis. To dimension its vehicles, Stellantis uses a reliability design method called Strength-Stress, which takes into consideration both the statistical distribution of part strength and the statistical distribution of customer load (called Stress). In order to minimize the risk of in-service failure, the probability that a “severe” customer will encounter a weak part must be quantified. Severity quantification is not simple since vehicle use and driver behaviour can be “severe” for some types of materials and not for others. The aim of the study is thus to define a new and richer notion of “severity” from Stellantis's databases, resulting either from tests or client usages. This will lead to more robust and accurate parts dimensioning methods. Two CIFRE theses (one recently defended, the other in progress) tackle such subjects:

Olivier Coudray, “A statistical point of view on fatigue criteria : from supervised classification to positive-unlabeled learning” 24. Here, we are seeking to build probabilistic fatigue criteria to identify the critical zones of a mechanical part.

Emilien Baroux, “Reliability dimensioning under complex loads: from specification to validation”. Here, we seek to identify and model the critical loads that a vehicle can undergo according to its usage profile (driver, roads, climate, etc.).

We have an on-going collaboration with SNCF–Transilien to exploit rich datasets of railway operations and passenger flows, obtained by automatic recording devices (for passenger flows, these correspond to infra-sensors at the door level). We tackle two problems. First, we model and estimate the dwell time of trains, as well as their delay to scheduled arrival; both are important factors to control to guarantee punctuality. Second, we model and forecast passenger movements inside train coaches, so as to be able to provide incoming passengers with information on crowding of coaches. Both series of problems come with new results described in section 7.11. They correspond to the final year of the CIFRE PhD of Rémi Coulaud between our Celeste team and SNCF–Transilien 25.

Still influenced by the aftermath of the Covid-19 pandemic, the carbon emissions of Celeste team members related to their jobs were very low and came essentially from:

In terms of magnitude, the largest per capita ongoing emissions (excluding flying) are likely simply to be those from buying computers that have a carbon footprint from their construction, in the range of 100 kg Co2-e each. In contrast, typical email use per year is around 10 kg Co2-e per person, and a Zoom call comes to around 10 g Co2-e per hour per person, while web browsing uses around 100g Co2-e per hour. Consequently, 2022 was a very low carbon year for the Celeste team. To put this in the context of work travel by flying, one return Paris-Nice flight corresponds to 160 kg Co2-e emissions, which likely dwarfs the total emissions of any one Celeste team member's work-related emissions in 2020.

The approximate (rounded for simplicity) kg Co2-e values cited above come from the book, “How Bad are Bananas” by Mike Berners-Lee (2020) which estimates carbon emissions in everyday life.

In addition to the long-term impact of our theoretical work—which is of course impossible to assess immediately—we are involved in several applied research projects which aim at having a short/mid-term positive impact on society.

First, we collaborate with the Pasteur Institute on neglected tropical diseases; encephalitis in particular, with implications in global health strategies.

Second, the broad use of artificial intelligence/machine learning/statistics nowadays comes with several major ethical issues, one being to avoid making unfair or discriminatory decisions. Our theoretical work on algorithmic fairness has already led to several “fair” algorithms that could be widely used in the short term (one of them is already used for enforcing fair decision-making in student admissions at the University of Genoa).

Third, we expect short-term positive impact on society from several direct collaborations with companies such as EDF (forecasting and control of electricity load consumption), SNCF (punctuality of trains and better passenger information on crowding inside train coaches) and Stellantis (automobile reliability, with two Cifre contracts).

Identifying the relevant variables for a classification model with correct confidence levels is a central but difficult task in high-dimension. Despite the core role of sparse logistic regression in statistics and machine learning, it still lacks a good solution for accurate inference in the regime where the number of features

In 17, in collaboration with Batiste Le Bars (Inria Magnet) and Ludovic Minvielle (ENS Paris-Saclay), we introduce a robust nonparametric density estimator combining the popular Kernel Density Estimation method and the Median-of-Means principle (MoM-KDE). This estimator is shown to achieve robustness to any kind of anomalous data, even in the case of adversarial contamination. In particular, while previous works only prove consistency results under known contamination model, this work provides finite-sample high-probability error-bounds without a priori knowledge on the outliers. Finally, when compared with other robust kernel estimators, we show that MoM-KDE achieves competitive results while having significant lower computational complexity.

In 18, we revisit the problem of contextual bandits with budget constraints (called contextual bandits with knapsacks), and apply it to conversion models. This is a joint work with BNP Paribas with a research contract, and our running example is the clever use of a discount budget to grant loans; conversions correspond to whether or not a customer takes the loan offer. We provide a strategy in a direct primal formulation, where previous contributions in the literature rather suggested strategies based on dual formulations of the problem, with tuning issues (on how to set the dual variables).

In 31, we tackle the problem of best-arm identification [BAI] with a fixed budget

In 16 we address discriminative bias in linear bandits and quantified the price of unfair evaluations in the worst case and the gap-minimax setting. The results revealed a transition between a regime where the problem is as difficult as its unbiased counterpart, and a regime where it can be much harder. Unlike previously-mentioned contributions, which were model-free, this work postulated an explicit source of discriminating bias.

In 42, in collaboration with Gilles Blanchard (Datashape), we consider the problem of cumulative regret minimization for individual sequence prediction with respect to the best expert in a finite family under limited access to information. We provide a strategy that combines two experts and observes at least two experts' predictions in each round. We prove that this strategy allows having a constant bound on the regret (an upper bound independent of the time horizon

In collaboration with Christophe Biernacki (Inria Modal) and Julien Jacques (Université de Lyon 2), we have written a survey 32 on model-based co-clustering. This problem can be seen as a particularly valuable extension of model-based clustering for three main reasons: (1) while allowing parsimoniously a drastic reduction of both the number of lines/individuals and columns/variables of a data set, (2) it also allows interpretability of such a resulting reduced data set since initial individuals and features meaning is preserved in this latter; (3) moreover it benefits from the powerful mathematical statistics theory for both estimation and model selection. Hence, many authors produced new advances on this topic in the recent years, and we offer a general update of the related literature. In addition, it is the opportunity to pass two messages, supported by specific research materials: (1) co-clustering still requires some new and motivating researches for fixing some well-identified estimation issues, (2) co-clustering is probably one of the most promising clustering approaches to be addressed in the (very) high dimensional setting, which corresponds to the global trend on modern data sets.

Symmetries are expected to play an important role in the effectiveness of Neural Networks. We have described in 37 a class of symmetries that are preserved during the learning process.

Leveraging statistical ideas, we have developed zero-order optimization algorithms using

Celeste currently collaborates with the automobile manufacturer Stellantis. In Olivier Coudray's Ph.D. 20, the challenge for Stellantis was to identify critical zones of a mechanical part. This is an unsupervised classification problem with a selection bias in the data (during a test, the absence of the start of a crack does not necessarily mean that the zone is safe). We proposed to use a semi-supervised classification method called positive-unlabelled (PU) learning 34. The optimality of the speed of convergence of the classifier (in the minimax sense) was obtained in this work. The PU learning classifier was then implemented on simulated datasets to compare its performance to other classification methods and on Stellantis datasets [software pysarpu, section 6.1.2].

Celeste currently collaborates with the automobile manufacturer Stellantis. In Emilien Baroux's Ph.D., co-advised with Andréï Constantinescu, the challenge for Stellantis is to identify critical areas for vehicule chassis fatigue, and in particular to identify severe fatigue profiles (driver and road type) for vehicule chassis. We started by creating a multi-axial mechanical damage measurement system that allowed us to comprehensively understand cases of locally critical loads (torsion, bending, etc.), independent of the vehicle model. During a driving campaign, damage measurements were taken simultaneously in the longitudinal, vertical, and horizontal axes, as well as on the four wheels, for 50 vehicles. We further proposed factorial analyses and unsupervised classification methods to construct a robust severity distribution for fatigue design tasks 46.

In the context of smart grids and load balancing, daily peak load forecasting has become a critical activity for stakeholders in the energy industry. An understanding of peak magnitude and timing is paramount for the implementation of smart grid strategies such as peak shaving. In 3, in collaboration with Matteo Fasiolo, Yannig Goude and Hui Yan, we proposed a modelling approach which leverages high-resolution and low-resolution information to forecast daily peak demand size and timing. The resulting multi-resolution modelling framework can be adapted to different model classes and are implemented via generalised additive models and neural networks. Experimental results on real data from the UK electricity market confirm that the predictive performance of the proposed modelling approach is competitive with that of low- and high-resolution alternatives.

The development of electric vehicles (EV) is a major lever towards low carbon transportation. It comes with increasing numbers of charging infrastructures which can be smartly managed to control the CO2 cost of EV electricity consumption or used as flexible assets for grid management. To achieve that, an efficient day-ahead forecast of charging behaviours is required at different spatial resolutions (e.g., household and public stations). In 15, in collaboration with Yannig Goude, Bachir Hamrouche and Matthew Bishara, we proposed an extensive benchmark of 14 models for both load and occupancy day-ahead forecasts, covering 8 open charging session datasets of different types (residential, workplace and public stations) is proposed. Two modelling approaches are compared: direct and bottom-up. The direct approach forecasts the aggregated load (resp. occupation) directly of an area/station whereas the bottom-up approach models each individual EV charging session before aggregating them. Both machine learning models and statistical models are considered. Results show that direct approaches reach better performances than bottom-up approaches. The different approaches used can lead to an improved performance of direct approaches when using an adaptive aggregation strategy.

As mentioned in Section 4.4, two sources of data may and should be mixed: railway operations (e.g., scheduled and observed arrival and departure times of trains in stations) and passenger flows (e.g., numbers of alighting and boarding passengers in stations). We model dwell times, the difference between departure and arrival times, in 8, based on various machine-learning models (linear regression, random forests and XGBoost, neural networks). Typically, the literature was only using one of the two sources of data at a time (just because only one such source was available for each given problem). By combining the two sources, we are able to identify the most critical source (railway operations) and quantify the added value of the other source (passenger flows: to help modeling critical situations, like delayed trains). To be able to exploit this initial complex modeling of dwell time, we need to forecast variables like numbers of passengers alighting and boarding. We do so in 21 by introducing simple bi-autoregressive models, which we call L-shaped as they exploit the past both in terms of previous trains at a given station and of previous stations of a given train ride.

A second series of work is for now only described in the PhD manuscript by Rémi Coulaud 25 and is currently being finalized (and thus, will be described in greater details in next year's report). It deals with the modeling and forecasting of passengers' movements inside communicating train coaches. The model exhibited relies on an inhomogeneous Markov chain modeling, using coach loads as latent spaces.

In 13, K. Bleakley was the statistician/modeler/machine learning lead for the large-scale 3-year SEAE encephalitis project in South-East Asia, looking for patterns in the (relatively “big”) dataset relating environmental variables to encephalitis causes and outcomes. His work ranged across (multiple) statistical testing, machine learning (trees, random forests, and logistic regression), PCA, data visualisation, survival analysis, and missing data. Several interesting risk factors for severe encephalitis were uncovered, and steps were made to start looking for new, unknown causes (bacterial, viral, etc.) of encephalitis in South-East Asia. This work was published in the renowned journal, Lancet Infection Diseases (IF: 71); related work is ongoing.

In 35, we have shown several fundamental characterizations of the optimal classification function under the demographic parity constraint. In the awareness framework, akin to the classical unconstrained classification case, we have shown that maximizing accuracy under this fairness constraint is equivalent to solving a corresponding regression problem followed by thresholding at level 1/2. We have extended this result to linear-fractional classification measures (e.g., F-score, AM measure, balanced accuracy, etc.). These results further deepen our understanding of fairness constraints and their impact on decision making.

Sylvain Arlot and Matthieu Lerasle are part of the ANR grant FAST-BIG (Efficient Statistical Testing for high-dimensional Models: application to Brain Imaging and Genetics), which is lead by Bertrand Thirion (Inria Saclay, Parietal).

Sylvain Arlot and Christophe Giraud are part of the ANR Chair-IA grant Biscotte, which is led by Gilles Blanchard (Université Paris Saclay).

Christophe Giraud is part of the ANR ASCAI: Active and batch segmentation, clustering, and seriation: toward unified foundations in AI, with Potsdam University, Munich University, Montpellier INRAE.

K. Bleakley works at 1/3-time (disponibilité) with IRT SystemX under the umbrella of Confiance.AI on the subject of anomaly detection in high-dimensional time series data for French industry.

S. Arlot, Journées Statistiques du Sud, Avignon, 02/06/2022

E. Chzhen, Re-thinking High-dimensional Mathematical Statistics, Oberwolfach, 20/05/2022

E. Chzhen, New trends in statistical learning II, Porquerolles, 14/06/2022

E. Chzhen, Computational Statistics and Machine Learning, Genoa, 12/07/2022

E. Chzhen, Workshop on Ethical AI, Paris, 29/09/2022

E. Chzhen, MADSTAT seminar, Toulouse, 13/10/2022

C. Giraud, ASCAI, Montpellier, 29/02/2022

C. Giraud, Séminaire INRIA Paris Centre, 14/04/2022

C. Giraud, ETH-FDS seminar series, Zurich, 08/06/2022

C. Giraud, IMS, London, 28/06/2022

C. Giraud, course in the summer school "Geometry and Statistics", Cargèse, 05-09/09/2022

C. Giraud, Van Dantzig seminar, Amsterdam, 16/12/2022

C. Keribin, JSTAR 2022 (Rennes)

C. Keribin, CMStatistics Londres 2022

J.-M. Poggi, Mathmet 2022, Paris, France, November 2022

J.-M. Poggi, Compstat 2022, Bologna, August 2022

J.-M. Poggi, ISBIS 2022, June 2022, Naples, Italy

J.-M. Poggi, CISEM 2022, Mahdia (Tunisie) mai 2022

J.-M. Poggi, Seminar Department of Mathematics, Université du Luxembourg, October 2022

G. Stoltz: main contact point [2018–2022] for the author, G. Favre, of a sociologic report on collaborations of mathematicians with companies, commissioned by AMIES (agence maths-entreprises, which is an Inria–CNRS–Université Grenoble Alpes entity).

Most of the team members (especially Professors, Associate Professors and PhD students) teach several courses at University Paris-Saclay, as part of their teaching duty. We mention below some of the classes in which we teach.

We participated to many PhD committees (too many to keep an exact record), at University Paris-Saclay as well as at other universities, and we refereed several of these PhDs.

K. Bleakley gave an interview for Inria’s “News and Events” outreach on his work on Encephalitis in South-East Asia in collaboration with the Pasteur Institute. The resulting news article was published here.

Christophe Giraud produces educational videos on his YouTube channel "High-dimensional probability and statistics".

Christine Keribin was invited speaker at the Ateliers de la Statistique de la SFdS for an introductory lecture to machine learning (2022).