Data science—a vast field that includes statistics, machine learning, signal processing, data visualization, and databases—has become front-page news due to its ever-increasing impact on society, over and above the important role it already played in science over the last few decades. Within data science, the statistical community has long-term experience in how to infer knowledge from data, based on solid mathematical foundations. The recent field of machine learning has also made important progress by combining statistics and optimization, with a fresh point of view that originates in applications where prediction is more important than building models.

The Celeste project-team is positioned at the interface between statistics and machine learning. We are statisticians in a mathematics department, with strong mathematical backgrounds, interested in interactions between theory, algorithms, and applications. Indeed, applications are the source of many of our interesting theoretical problems, while the theory we develop plays a key role in (i) understanding how and why successful statistical learning algorithms work—hence improving them—and (ii) building new algorithms upon mathematical statistics-based foundations. Therefore, we tackle several major challenges of machine learning with our mathematical statistics point of view (in particular the algorithmic fairness issue), always having in mind that modern datasets are often high-dimensional and/or large-scale, which must be taken into account at the building stage of statistical learning algorithms. For instance, there often are trade-offs between statistical accuracy and complexity which we want to clarify as much as possible.

In addition, most theoretical guarantees that we prove are non-asymptotic, which is important because the number of features

Finally, a key ingredient in our research program is connecting our theoretical and methodological results with (a great number of) real-world applications. This is the reason why a large part of our work is devoted to industrial and medical data modeling on a set of real-world problems coming from our long-term collaborations with several partners, as well as various opportunistic one-shot collaborations.

We split our research program into four research axes, distinguishing problems and methods that are traditionally considered part of mathematical statistics (e.g., model selection and hypothesis testing, see section 3.2) from those usually tackled by the machine learning community (e.g., multi-armed bandits, deep learning, clustering and pairwise-data inference, see section 3.3). Section 3.4 is devoted to industrial and medical data modeling questions which arise from several long-term collaborations and more recent research contracts. Finally, section 3.5 is devoted to algorithmic fairness, a theme of Celeste which we want to specifically emphasize. Despite presenting mathematical statistics, machine learning, and data modeling as separate axes, we would like to make clear that these axes are strongly interdependent in our research and that this dependence is a key factor in our success.

One of our main goals is to address major challenges in machine learning in which mathematical statistics naturally play a key role, in particular in the following two areas of research.

Any machine learning procedure requires a choice for the values of hyper-parameters, and one must also choose among the numerous procedures available for any given learning problem; both situations correspond to an estimator selection problem. High-dimensional variable (feature) selection is another key estimator selection problem. Celeste addresses all such estimator selection problems, where the goal is to select an estimator (or a set of features) minimizing the prediction/estimation risk, and the corresponding non-asymptotic theoretical guarantee—which we want to prove in various settings—is an oracle inequality.

Science currently faces a reproducibility crisis, making it necessary to provide statistical inference tools (hypotheses tests, confidence regions) for assessing the significance of the output of any learning algorithm in a computationally efficient way. Our goal here is to develop methods for which we can prove upper bounds on the type I error rate, while maximizing the detection power under this constraint. We are particularly interested in the variable selection case, which here leads to a multiple testing problem for which key metrics are the family-wise error rate (FWER) and the false discovery rate (FDR).

Our distinguishing approach (compared to peer groups around the world) is to offer a statistical and mathematical point of view on machine-learning (ML) problems and algorithms. Our main focus is to provide theoretical guarantees for certain ML problems, with special attention paid to the statistical point of view, in particular minimax optimality and statistical adaptivity. In the areas of deep learning and big data, computationally-efficient optimization algorithms are essential. The choice of the optimization algorithm has been shown to have a dramatic impact on generalization properties of predictors. Such empirical observations have led us to investigate the interplay between computational efficiency and statistical properties. The set of problems we tackle includes online learning (stochastic bandits and expert aggregation), clustering and co-clustering, pairwise-data inference, semi-supervised learning, and the interplay between optimization and statistical properties.

Celeste collaborates with industry and with medicine/public health institutes to develop methods and apply results of a broadly statistical nature—whether they be prediction, aggregation, anomaly detection, forecasting, and so on—in relationship with pressing industrial and/or societal needs (see sections 4 and 5.2). Most of these methods and applied results are directly related to the more theoretical subjects examined in the first two research axes, including for instance estimator selection, aggregation, and supervised and unsupervised classification. Furthermore, Celeste is positioned well for problems with data requiring unconventional methods—for instance, non asymptotic analysis and data with selection bias—, and in particular problems that can give rise to technology transfers in the context of Cifre Ph.D.s.

Machine-learning algorithms make pivotal decisions which influence our lives on a daily basis, using data about individuals. Recent studies show that imprudent use of these algorithms may lead to unfair and discriminatory decisions, often inheriting or even amplifying disparities present in data. The goal of Celeste on this topic is to design and analyze novel tractable algorithms that, while still optimizing prediction performance, mitigate or remove unfair decisions of the learned predictor. A major challenge in the machine-learning fairness literature is to obtain algorithms which satisfy fairness and risk guarantees simultaneously. Several empirical studies suggest that there is a trade-off between the fairness and accuracy of a learned model: more accurate models are less fair. We are focused on providing user-friendly statistical quantification of such trade-offs and building statistically-optimal algorithms in this context, with special attention paid to the online learning setting. Relying on the strong mathematical and statistical competency of the team, we approach the problem from an angle that differs from the mainstream computer science literature.

Celeste collaborates with researchers at Institut Pasteur on encephalitis in South-East Asia, especially with Jean-David Pommier.

Celeste has a long-term collaboration with EDF R&D on electricity consumption. An important problem is to forecast consumption. We currently work on an approach involving back and forth disaggregation (of the total consumption into the consumption of well-chosen groups/regions) and aggregation of local estimates.

Data collected on the lifetime of complex systems is often non-homogeneous, affected by variability in component production and differences in real-world system use. In general, this variability is neither controlled nor observed in any way, but must be taken into account in reliability analysis. We use latent structure models to identify the main causes of failure, and to predict system reliability as accurately as possible.

Celeste has established a collaboration with the manufacturer Stellantis, which has led to the completion of two CIFRE theses. The first was defended in 2022, followed by Emilien Baroux's thesis, which was defended in 2023. In the case of personal vehicles, various loads need to be taken into account due to different road types (e.g. freeway, city) and driving styles (aggressive, sporty, etc.).
In Emilien Baroux's thesis we used a multidimensional characterization of damage caused by multiple-input external wheel loads during vehicle use, including load combinations between the left and right wheels of the front and rear axles.
Field measurements were used to calculate pseudo-damage for each load and road type, creating multivariate data with a hierarchical structure. Unsupervised analyses were used to explore correlations between pseudo-damage and to identify driving profiles, providing a multidimensional assessment of severity while avoiding over-learning. A multidimensional Gaussian mixture model was then fitted to damage-equivalent constraints. The resulting probabilistic model could then be used to extrapolate damage calculations and simulate driving styles, providing design teams with a stress analysis tool for accurate and realistic fatigue design of chassis components in future vehicles.

Celeste collaborates with Metafora to explore the use of multiple instance learning in flow cytometry as a means of early detection of specific cancers. This is in collaboration with Pascal Massart, in the context of Pierre-André Mikem's Cifre PhD, which follows Louis Pujol's thesis defended in 2022.

Ensuring student success is a central goal of universities, and a high success rate is seen as an indicator of the academic excellence of the institution. The transition from high school to university is seen as a critical time for first-time students, who must adapt not only to a different academic environment but also to greater autonomy. Universities face the challenge of facilitating this transition for a student population that is heterogeneous in terms of academic preparation, cultural and socio-economic background.

We currently collaborate with the EST laboratory (Univ. Paris-Saclay) on this topic. Our works are aimed at identifying the different factors that hinder success, identifying success profiles, and proposing welcoming and support solutions to effectively support the success of each student.

Still influenced by the aftermath of the Covid-19 pandemic, the carbon emissions of Celeste team members related to their jobs were very low and came essentially from:

In terms of magnitude, the largest per capita ongoing emissions (excluding flying) are likely simply to be those from buying computers that have a carbon footprint from their construction, in the range of 100 kg Co2-e each. In contrast, typical email use per year is around 10 kg Co2-e per person, and a Zoom call comes to around 10g Co2-e per hour per person, while web browsing uses around 100g Co2-e per hour. Consequently, 2023 was a very low carbon year for the Celeste team. To put this in the context of work travel by flying, one return Paris-Nice flight corresponds to 160 kg Co2-e emissions, which likely dwarfs the total emissions of any one Celeste team member's work-related emissions in 2023.

The approximate (rounded for simplicity) kg Co2-e values cited above come from the book, “How Bad are Bananas” by Mike Berners-Lee (2020) which estimates carbon emissions in everyday life.

In addition to the long-term impact of our theoretical work—which is of course impossible to assess immediately—we are involved in several applied research projects which aim at having a short/mid-term positive impact on society.

First, we collaborate with the Pasteur Institute on neglected tropical diseases; encephalitis in particular, with implications in global health strategies.

Second, we collaborate with the EST laboratory (Univ. Paris-Saclay) on questions related to student success in universities and how to maximize it (see Section 4.5).

Third, the broad use of artificial intelligence/machine learning/statistics nowadays comes with several major ethical issues, one being to avoid making unfair or discriminatory decisions. Our theoretical work on algorithmic fairness has already led to several “fair” algorithms that could be widely used in the short term (one of them is already used for enforcing fair decision-making in student admissions at the University of Genoa).

Fourth, we expect short-term positive impact on society from several direct collaborations with companies such as EDF (forecasting and control of electricity load consumption), Stellantis (automobile reliability, with two Cifre contracts) and Metafora (early detection of cancers).

In collaboration with Batiste Le Bars and Aurélien Bellet (Inria Lille, Magnet project-team), we introduce in 19 a conformal prediction method to construct prediction sets in a one-shot federated learning setting. More specifically, we defined a quantile-of-quantiles estimator and proved that for any distribution, it is possible to output prediction sets with desired coverage in only one round of communication. To mitigate privacy issues, we also described a locally differentially private version of our estimator. Finally, over a wide range of experiments, we showed that our method returns prediction sets with coverage and length very similar to those obtained in a centralized setting. Overall, these results demonstrate that our method is particularly well-suited to perform conformal predictions in a one-shot federated learning setting.

In 28, we consider the problem of minimizing a convex function over a closed convex set, with Projected Gradient Descent (PGD). We propose a fully parameter-free version of AdaGrad, which is adaptive to the distance between the initialization and the optimum, and to the sum of the square norm of the subgradients. Our algorithm is able to handle projection steps, does not involve restarts, reweighing along the trajectory or additional gradient evaluations compared to the classical PGD. It also fulfills optimal rates of convergence for cumulative regret up to logarithmic factors. We provide an extension of our approach to stochastic optimization.

In collaboration with Arya Akhavan (CMAP, IP Paris), Massimiliano Pontil (Italian Institute of Technology) and Alexandre Tsybakov (CREST, IP Paris), we study in 24 minimization problems with zero-order noisy oracle information under the assumption that the objective function is highly smooth and possibly satisfies additional properties. We consider two kinds of zero-order projected gradient descent algorithms, which differ in the form of the gradient estimator. The first algorithm uses a gradient estimator based on randomization over the

In collaboration with Sholom Schechtman (IP Paris), we consider in 29 the problem of unconstrained minimization of finite sums of functions. We propose a simple, yet, practical way to incorporate variance reduction techniques into SignSGD, guaranteeing convergence that is similar to the full sign gradient descent. The core idea is first instantiated on the problem of minimizing sums of convex and Lipschitz functions and is then extended to the smooth case via variance reduction. Our analysis is elementary and much simpler than the typical proof for variance reduction methods.

In collaboration with Zhen Li (BNP Paribas), we consider in 16 contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The learner aims to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. We assume that contexts come from a continuous set, that costs can be signed, and that the expected reward and cost functions, while unknown, may be uniformly estimated – a typical assumption in the literature. In this setting, total cost constraints had so far to be at least of order

In collaboration with Matthieu Jonckheere (LAAS Toulouse), we consider in 31 a setting of structured reinforcement learning and work on the concept of orchestration, where a (small) set of expert policies guides decision-making. The modeling thereof, with expert policies considered as super-actions, constitutes our first contribution. We then establish value-functions regret bounds for orchestration in the tabular setting by transferring regret-bound results from adversarial settings. We generalize and extend the analysis of natural policy gradient in Agarwal et al. [2021, Section 5.3] to arbitrary adversarial aggregation strategies. We also extend it to the case of estimated advantage functions, providing insights into sample complexity both in expectation and high probability. A key point of our approach lies in its arguably more transparent proofs compared to existing methods.

In collaboration with Alex Barbier-Chebbah, Christian Vestergaard and Jean-Baptiste Masson (Institut Pasteur & Inria Paris, EPIMETHEE project-team), we proposed in 25 a new bandits algorithm based on information maximisation principles. More precisely, we propose a new class of bandit algorithms that maximize an approximation to the information of a key variable within the system. To this end, we develop an approximated analytical physics-based representation of an entropy to forecast the information gain of each action and greedily choose the one with the largest information gain.

Multiplayer bandits have recently been extensively studied because of their application to cognitive radio networks. While the literature mostly considers synchronous players, radio networks (e.g. for IoT) tend to have asynchronous devices. This motivates the harder, asynchronous multiplayer bandits problem, which was first tackled with an explore-then-commit (ETC) algorithm (see Dakdouk, 2022), with a regret upper-bound in

In collaboration with Hugo Richard (Criteo and Inria Saclay, Mind project-team) and Vianney Perchet (Criteo and Ensae),
in 34 we answer positively this question, as a natural extension of UCB exhibits a

Model-based co-clustering can be seen as a particularly important extension of model-based clustering. It allows for a significant reduction of both the number of rows (individuals) and columns (variables) of a data set in a parsimonious manner, and also allows interpretability of the resulting reduced data set since the meaning of the initial individuals and features is preserved. Moreover, it benefits from the rich statistical theory for both estimation and model selection. Many works have produced new advances on this topic in recent years.

In collaboration with Christophe Biernacki (Inria Lille, Modal project-team) and Julien Jacques (Univ. Lyon), we offer in 6 a general update of the related literature. In addition, we advocate two main messages, supported by specific research material: (1) co-clustering requires further research to fix some well-identified estimation issues, and (2) co-clustering is one of the most promising approaches for clustering in the (very) high-dimensional setting, which corresponds to the global trend in modern data sets.

Vehicles reliability is a major issue for automotive manufacturers. In particular, mechanical fatigue is an important preoccupation of the design office. In order to accelerate the development of new mechanical parts, car manufacturers want to rely more on numerical simulation and drastically reduce the number of validation tests on prototypes. To do this, they need efficient fatigue criteria, able to correctly identify critical zones on a numerical model. However, the current fatigue criteria used to post process numerical results fail to correlate well on fatigue test rig.

In 30, in collaboration with Philippe Bristiel and Miguel Dinis (Stellantis), we first propose a probabilistic Dang Van criterion that accounts for the dispersion of fatigue results in a multiaxial setting. We then introduce a fatigue database built upon numerical results and fatigue test reports on automotive chassis components. A novel approach, based on Positive-Unlabeled learning (PU learning), is developed to leverage this source of data and improve the predictivity of the fatigue criterion. The methodology is applied to the fatigue database to illustrate the interest of the approach.

Quantitative surveys are effective at identifying general trends, but can lack depth. Qualitative surveys, on the other hand, generate rich textual data that provides a deep, individualized understanding of the situation under study. However, processing this data is more complex.

In our recent works 7, 21, 33, 37 in collaboration with the EST laboratory (Univ. Paris-Saclay), we propose a mixed method that uses both unsupervised methods and neural network models to process both quantitative and textual data, in order to take advantage of the articulation between quantitative and qualitative surveys. See also Section 4.5 about our work applied to education sciences.

Consider a hiring process with candidates coming from different universities. It is easy to order candidates who have the same background, yet it can be challenging to compare them otherwise. The latter case requires additional costly assessments and can result in sub-optimal hiring decisions. Given an assigned budget, what would be an optimal strategy to select the most qualified candidate?

In 27, in collaboration with Ziyad Benomar (Inria Saclay, Fairplay project-team), Nicolas Schreuder (LIGM) and Vianney Perchet (Criteo and Ensae), we model the above problem by introducing a new variant of the secretary problem in which sequentially observed candidates are split into two distinct groups. For each new candidate, the decision maker observes its rank among already seen candidates from the same group and can access its rank among all observed candidates at some fixed cost. To tackle this new problem, we introduce and study the family of Dynamic Double Threshold (DDT) algorithms. We show that, with well-chosen parameters, their success probability converges rapidly to

Sylvain Arlot is part of the ANR grant FAST-BIG (Efficient Statistical Testing for high-dimensional Models: application to Brain Imaging and Genetics), which is lead by Bertrand Thirion (Inria Saclay, Parietal).

Sylvain Arlot, Evgenii Chzhen, Christophe Giraud and Gilles Stoltz are part of the PEPR-IA grant CAUSALI-T-AI (CAUSALIty Teams up with Artificial Intelligence), which is led by Marianne Clausel (Univ. de Lorraine).

Sylvain Arlot and Christophe Giraud are part of the ANR Chair-IA grant Biscotte, which is led by Gilles Blanchard (Université Paris Saclay).

Christophe Giraud is part of the ANR ASCAI: Active and batch segmentation, clustering, and seriation: toward unified foundations in AI, with Potsdam University, Munich University, Montpellier INRAE.

disponibilité) with IRT SystemX under the umbrella of Confiance.AI on the subject of anomaly detection in high-dimensional time series data for French industry.

Most of the team members (especially Professors, Associate Professors and Ph.D. students) teach several courses at University Paris-Saclay, as part of their teaching duty. We mention below some of the classes in which we teach.

We participated in many PhD committees (too many to keep an exact record), at University Paris-Saclay as well as at other universities, and we refereed several of these PhDs.

Christine Keribin wrote two articles for Tangente magazine: on clustering here and mixture models here.

Christophe Giraud produces educational videos on his YouTube channel "High-dimensional probability and statistics" here.

Gilles Stoltz supervises an “atelier MATh.en.JEANS” at Lycée Douanier Rousseau and Collège Fernand Puech, Laval.