Data science – a vast field that includes statistics, machine learning, signal processing, data visualization, and databases – has become front-page news due to its ever-increasing impact on society, over and above the important role it already played in science over the last few decades. Within data science, the statistical community has long-term experience in how to infer knowledge from data, based on solid mathematical foundations. The more recent field of machine learning has also made important progress by combining statistics and optimization, with a fresh point of view that originates in applications where prediction is more important than building models.

The Celeste project-team is positioned at the interface between statistics and machine learning. We are statisticians in a mathematics department, with strong mathematical backgrounds behind us, interested in interactions between theory, algorithms and applications. Indeed, applications are the source of many of our interesting theoretical problems, while the theory we develop plays a key role in (i) understanding how and why successful statistical learning algorithms work – hence improving them – and (ii) building new algorithms upon mathematical statistics-based foundations

In the theoretical and methodological domains,
Celeste aims to analyze statistical learning algorithms
– especially those which are most used in practice –
with our mathematical statistics point of view,
and develop new learning algorithms based upon our
mathematical statistics skills.

A key ingredient in our research program is connecting our theoretical and methodological results with (a great number of) real-world applications. Indeed, Celeste members work in many domains, including—but not limited to—Covid-19, neglected tropical diseases, pharmacovigilance, high-dimensional transcriptomic analysis, and energy and the environment.

Our objectives correspond to four major challenges of machine learning where mathematical statistics have a key role. First, any machine learning procedure depends on hyperparameters that must be chosen, and many procedures are available for any given learning problem: both are an estimator selection problem. Second, with high-dimensional and/or large data, the computational complexity of algorithms must be taken into account differently, leading to possible trade-offs between statistical accuracy and complexity, for machine learning procedures themselves as well as for estimator selection procedures. Third, real data are almost always corrupted partially, making it necessary to provide learning (and estimator selection) procedures that are robust to outliers and heavy tails, while being able to handle large datasets. Fourth, science currently faces a reproducibility crisis, making it necessary to provide statistical inference tools (p-values, confidence regions) for assessing the significance of the output of any learning algorithm (including the tuning of its hyperparameters), in a computationally efficient way.

An important goal of Celeste is to build and study procedures
that can deal with general estimators (especially those actually used in practice, which often rely on some optimization algorithm), such as cross-validation and Lepski's method.
In order to be practical, estimator selection procedures must be
fully data-driven (that is, not relying on any unknown quantity),
computationally tractable (especially in the high-dimensional setting,
for which specific procedures must be developed)
and robust to outliers (since most real data sets include a few outliers).
Celeste aims at providing a precise theoretical analysis
(for new and existing popular estimator selection procedures),
that explains as well as possible their observed behaviour in practice.

When several learning algorithms are available, with increasing computational complexity and statistical performance, which one should be used, given the amount of data and the computational power available?
This problem has emerged as a key question induced by the challenge of analyzing large amounts of data – the “big data” challenge.
Celeste wants to tackle the major challenge of understanding the time-accuracy trade-off,
which requires providing new statistical analyses of machine learning procedures
– as they are done in practice, including optimization algorithms –
that are precise enough in order to account for differences of performance observed in practice,
leading to general conclusions that can be trusted more generally.
For instance, we study the performance of ensemble methods combined with subsampling, which is a common strategy for handling big data; examples include random forests and median-of-means algorithms.

The classical theory of robustness in statistics
has recently received a lot of attention in the machine learning community.
The reason is simple: large datasets are easily corrupted, due to – for instance – storage and transmission issues, and most learning algorithms are highly sensitive to dataset corruption.
For example, the lasso can be completely misled by the presence of even a single outlier in a dataset.
A major challenge in robust learning is to provide computationally tractable
estimators with optimal subgaussian guarantees.
A second important challenge in robust learning is to deal with datasets
where every

Celeste considers the problems of quantifying the uncertainty of predictions or estimations (thanks to confidence intervals)
and of providing significance levels (

Celeste collaborates with Anavaj Sakuntabhai and Philippe Dussart (Pasteur Institute) on predicting dengue
severity using only low-dimensional clinical data obtained at hospital arrival. Further collaborations are
underway in dengue fever and encephalitis with researchers at the Pasteur Institute, including with Jean-David
Pommier.

We collaborate with researchers at the Pasteur Institute and the University Hospital of Guadeloupe on the development of a rapid test for Covid-19 severity prediction as well as risk modeling and outcome prediction for patients admitted to ICU units.

Celeste has a long-term collaboration with EDF R&D on electricity consumption.
An important problem is to forecast consumption. We currently work on an approach involving back and forth disaggregation (of the total consumption into the consumptions of well-chosen groups/regions) and aggregation of local estimates.
We also work on consumption control by price incentives sent to specific users (volunteers), seeing it as a bandit problem.

Collected product lifetime data is often non-homogeneous, affected by production variability and differing real-world usage. Usually, this variability is not controlled or observed in any way, but needs to be taken into account for reliability analysis. Latent structure models are flexible models commonly used to model unobservable causes of variability.

Celeste currently collaborates with PSA Group. To dimension its vehicles, the PSA Group uses a reliability design method called Strength-Stress, which takes into consideration both the statistical distribution of part strength and the statistical distribution of customer load (called Stress). In order to minimize the risk of in-service failure, the probability that a “severe” customer will encounter a weak part must be quantified. Severity quantification is not simple since vehicle use and driver behaviour can be “severe” for some types of materials and not for others. The aim of the study is thus to define a new and richer notion of “severity” from PSA databases, resulting either from tests or client usages. This will lead to more robust and accurate parts dimensioning methods. Two CIFRE theses are in progress on such subjects:

Olivier COUDRAY, “Fatigue Data-based Design: Probabilistic Modeling of Fatigue Behavior and Analysis of Fatigue Data to Assist in the Numerical Design of a Mechanical Part”. Here, we are seeking to build probabilistic fatigue criteria to identify the critical zones of a mechanical part.

Emilien BAROUX, “Reliability dimensioning under complex loads: from specification to validation”. Here, we seek to identify and model the critical loads that a vehicle can undergo according to its usage profile (driver, roads, climate, etc.).

Ancient materials, encountered in archaeology and paleontology are often complex, heterogeneous and poorly
characterized before physico-chemical analysis. A popular technique is to gather as much physico-chemical
information as possible, is spectro-microscopy or spectral imaging, where a full spectra, made of more than
a thousand samples, is measured for each pixel. The produced data is tensorial with two or three spatial
dimensions and one or more spectral dimensions, and requires the combination of an "image" approach with
a "curve analysis" approach. Since 2010 Celeste (previously Select) collaborates with Serge Cohen (IPANEMA) on clustering
problems, taking spatial constraints into account.

This is a Cifre PhD in collaboration with the SNCF.

One of the factors in the punctuality of trains in dense areas (and management crises in the event of an incident on a line) is the respect of both the travel time between two stations and the parking time in a station. These depend, among other things, on the train, its mission, the schedule, the instantaneous charge, and the configuration of the platform or station. Preliminary internal studies at the SNCF have shown that the problem is complex. From a dataset concerning the line E of the Transilien in Paris, we aim to address prediction (machine learning) and modeling (statistics): (1) construct a model of station-hours, station-hours-type of train, by example using co-clustering techniques; (2) study the correlations between the number of passengers (load), up and down flows, and parking times, and possibly other variables to be defined; (3) model the flows or loads (within the same station, or the same train) as a stochastic process; (4) develop a realistic digital simulator of passenger flows and test different scenarios of incidents and resolution, in order to propose effective solutions.

Machine learning algorithms make pivotal decisions, which influence our lives on a daily basis, using data about individuals. Recent studies show that imprudent use of these algorithms leads to unfair and discriminating decisions, often inheriting or even amplifying disparities present in data. The goal of this research program is to design and analyze novel tractable algorithms that, while still optimizing prediction performance, mitigate or remove unfair decisions of the learned predictor. A major challenge in the machine learning fairness literature is to obtain algorithms which satisfy fairness and risk guarantees simultaneously. Several empirical studies suggest that there is a trade-off between fairness and accuracy of a learned model – more accurate models are less fair. A theoretical study of these types of trade-offs is among the main directions of this research project. The goal is to provide user-friendly statistical quantification of these trade-offs and build statistically optimal algorithms in this context.

Influenced in particular by the Covid-19 pandemic in 2020, the carbon emissions of Celeste team members related to their jobs were very low and came essentially from:

In terms of magnitude, the largest per capita ongoing emissions (excluding flying) are likely simply to be those from buying computers that have a carbon footprint from their construction, in the range of 100 kg Co2-e each. In contrast, typical email use per year is around 10 kg Co2-e per person, and a Zoom call comes to around 10g Co2-e per hour per person, while web browsing uses around 100g Co2-e per hour. Consequently, 2020 was a very low carbon year for the Celeste team. To put this in the context of work travel by flying, one return Paris-Nice flight corresponds to 160 kg Co2-e emissions, which likely dwarfs the total emissions of any one Celeste team member's work-related emissions in 2020.

The approximate (rounded for simplicity) Co2-e values cited above come from the book, “How Bad are Bananas” by Mike Berners-Lee (2020) which estimates carbon emissions in everyday life.

In addition to the long-term impact of our theoretical works—which is of course impossible to assess immediately—we are involved in several applied research projects which aim at having a short/mid-term positive impact on society.

First, we collaborate with the Pasteur Institute and the University Hospital of Guadeloupe on medical issues related to some neglected tropical diseases and to Covid-19.

Second, the broad use of artificial intelligence/machine learning/statistics nowadays comes with several major ethical issues, one being to avoid making unfair or discriminatory decisions. Our theoretical work on algorithmic fairness has already led to several “fair” algorithms that could be widely used in the short term (one of them is already used for enforcing fair decision-making in student admissions at the University of Genoa).

Third, we expect short-term positive impact on society thanks to several direct collaborations with companies such as EDF (forecasting and control of electricity load consumption), SNCF (punctuality of trains in densely-populated regions, 1 Cifre contract ongoing) and the PSA group (reliability, with 2 Cifre contracts ongoing).

S. Arlot is junior member of Institut Universitaire de France (IUF) since September 2020.

The paper 2, first-authored by E. Chzhen, has been selected for an oral presentation at NeurIPS 2020 (1.1% of submitted works accepted).

Mixmod is a free toolbox for data mining and statistical learning designed for large and highdimensional data sets. Mixmod provides reliable estimation algorithms and relevant model selection criteria.

It has been successfully applied to marketing, credit scoring, epidemiology, genomics and reliability among other domains. Its particularity is to propose a model-based approach leading to a lot of methods for classification and clustering.

Mixmod allows to assess the stability of the results with simple and thorough scores. It provides an easy-to-use graphical user interface (mixmodGUI) and functions for the R (Rmixmod) and Matlab (mixmodForMatlab) environments.

Aggregated hold-out (Agghoo) is a method which averages learning rules selected by hold-out (that is, cross-validation with a single split).

G. Maillard, S. Arlot and M. Lerasle provided in 11
the first theoretical guarantees on Agghoo, ensuring that it can be used safely: Agghoo performs at worst like hold-out when the risk is convex. The same holds true in classification with the 0-1 risk, with an additional constant factor.
For hold-out, oracle inequalities are known for bounded losses, as in binary classification.
They show that similar results can be proved, under appropriate assumptions, for other risk-minimization problems. In particular, an oracle inequality holds true for regularized kernel regression with a Lipschitz loss, without requiring that the

In another paper 33, G. Maillard studied aggregated hold out for sparse linear regression with a robust loss function. Sparse linear regression methods generally have a free hyperparameter which controls the amount of sparsity, and is subject to a bias-variance tradeoff. This article considers the use of aggregated hold-out to aggregate over values of this hyperparameter, in the context of linear regression with the Huber loss function. Aggregated hold-out (Agghoo) is a procedure which averages estimators selected by hold-out (cross-validation with a single split). In the theoretical part of the article, it is proved that Agghoo satisfies a non-asymptotic oracle inequality when it is applied to sparse estimators which are parametrized by their zero-norm. In particular, this includes a variant of the Lasso introduced by Zou, Hastie and Tibshirani. Simulations are used to compare Agghoo with cross-validation. They show that Agghoo performs better than CV when the intrinsic dimension is high and when there are confounders correlated with the predictive covariates.

In his Ph.D. thesis 4, G. Maillard obtained more precise results in a specific setting, showing that Agghoo then strictly improves the performance of any model selection procedure. This is a remarkable result, which is to the best of our knowledge the first result of that kind. It required the use of several advanced mathematical results to be proved.

Greedy algorithms for feature selection are widely used for recovering sparse high-dimensional vectors in linear models. In classical procedures, the main emphasis is put on the sample complexity, with little or no consideration of the computation resources required. E.M. Saad and S. Arlot, in collaboration with G. Blanchard proposed in 34 a novel online algorithm, called Online Orthogonal Matching Pursuit (OOMP), for online support recovery in the random design setting of sparse linear regression. Our procedure selects features sequentially, alternating between allocation of samples only as needed to candidate features, and optimization over the selected set of variables to estimate the regression coefficients. Theoretical guarantees about the output of this algorithm are proven and its computational complexity is analysed.

T.-B. Nguyen and S. Arlot, in collaboration with J.-A. Chevalier and B. Thirion, developped an extension of the knockoff inference procedure, introduced by Barber and Candès [2015]. This new method, called aggregation of multiple knockoffs (AKO), addresses the instability inherent to the random nature of knockoff-based inference. Specifically, AKO improves both the stability and power compared with the original knockoff algorithm while still maintaining guarantees for false discovery rate control. They provided in 13 a new inference procedure, prove its core properties, and demonstrate its benefits in a set of experiments on synthetic and real datasets.

G. Stoltz and H. Hadiji (see 30) studied adaptation to the range for
stochastic bandit problems with finitely many arms, each associated
with a distribution supported on a given finite range

The finite continuum-armed bandit problem arises in many applications where an agent must allocate a finite budget

In collaboration with S. Minsker (USC), T. Mathieu worked on obtaining new excess risk bounds in robust empirical risk minimization. The method proposed in their paper 36 is inspired from the robust risk minimization procedure using median-of-means estimators in Lecué, Lerasle and Mathieu (2018). The obtained excess risk are faster than the so-called “slow rate of convergence” obtained for the minimization procedure in Lecué, Lerasle and Mathieu (2018) and a slightly modified procedure achieves a minimax rate of convergence under low moment assumptions. Experiments on synthetic corrupted data and a real dataset illustrate the accuracy of the method, showing high performance in classification and regression tasks in a corrupted setting.

Until very recently results on algorithmic fairness were almost exclusively focused on classification problems. Yet, in a lot of application domains, continuous outputs are more valuable even if the underlying problem is that of classification (e.g., credit scoring). In collaboration with C. Denis, M. Hebiri (Univ. Gustave Eiffel), L. Oneto (Univ. Genoa), M. Pontil (Istituto Italiano di Tecnologia, Univ. College London), E. Chzhen proposed a post-processing regression method which enjoys risk and fairness finite sample guarantees in 18. Their approach is based on a carefully chosen discretization of the signal space, essentially reducing the problem of regression to a problem of multi-class classification. Later, in 19 a connection between the problem of finding the optimal fair regression (in the sense of Demographic Parity) and the Wasserstein barycenter problem is derived. This connection allows us to build a data-driven post-processing method, which avoids the discretization step using the theory of optimal transport. This algorithm enjoys distribution-free fairness guarantees. Under additional assumptions, risk guarantees are also derived. A statistical minimax framework is proposed by E. Chzhen and N. Schreuder (CREST, ENSAE) in 27. This framework is built upon the earlier established connection of fair regression and the optimal transport theory, and allows us to study partially fair predictions. Within the proposed setup, Chzhen and Schreduer quantify the trade-off between Demographic Parity fairness and squared risk by obtaining a characterization of the Pareto frontier. Finally, they derive a general-problem dependent lower bound on the risk of any partially fair prediction and confirm its tightness on a Gaussian regression model with systematic group-dependent bias.

When clustering the nodes of a graph, a unique partition of the nodes is usually built, whether the graph is undirected or directed. While this choice is pertinent for undirected graphs, it is debatable for directed graphs because it implies that no difference is made between the clusters of source and target nodes. Defining two different clusterings for source and target nodes leads to considering a kind of bipartite clustering. We examine this question in the context of probabilistic models with latent variables, and compare the use of the stochastic block model (SBM) and the latent block model (LBM). We analyze and discuss this comparison through simulated and real data sets and provide recommendation 32.

Live imaging of lysosomal secretion monitored by total internal reflection fluorescence imaging of VAMP7-pHluorin is a straightforward way to explore secretion from this compartment. Taking advantage of cell culture on micropatterned surfaces to normalize cell shape, we employed a variety of statistical tools to perform a spatial analysis of secretory patterns. Using Ripley’s K function and a statistical test based on nearest neighbor distance (NND) we confirmed that secretion from lysosomes is not a random process but shows significant clustering 9.

The World Health Organization (WHO) proposed guidelines on dengue clinical classification in 1997 and more recently in 2009 for the clinical management of patients. The WHO 1997 classification defines three categories of dengue infection according to severity: dengue fever (DF), dengue hemorrhagic fever (DHF), and dengue shock syndrome (DSS). Alternative WHO 2009 guidelines provide a cross-sectional classification aiming to discriminate dengue fever from dengue with warning signs (DWWS) and severe dengue (SD). In this study we performed a comparison of the two dengue classifications both from a biological and statistical point of view 7.

The Latent Block Model (LBM) is a model-based method to cluster simultaneously the

We state and prove in 8 a quantitative version of the bounded difference inequality for geometrically ergodic Markov chains. Our proof uses the same martingale decomposition as in an earlier result but compared to this paper the exact coupling argument is modified to fill a gap between the strongly aperiodic case and the general aperiodic case.

We introduce in 10 new estimators for robust machine learning based on median-of-means (MOM) estimators of the mean of real valued random variables. These estimators achieve optimal rates of convergence under minimal assumptions on the dataset. The dataset may also have been corrupted by outliers on which no assumption is granted. We also analyze these new estimators with standard tools from robust statistics. In particular, we revisit the concept of breakdown point. We modify the original definition by studying the number of outliers that a dataset can contain without deteriorating the estimation properties of a given estimator. This new notion of breakdown number, that takes into account the statistical performances of the estimators, is non-asymptotic in nature and adapted for machine learning purposes. We proved that the breakdown number of our estimator is of the order of (number of observations)*(rate of convergence). For instance, the breakdown number of our estimators for the problem of estimation of a

Clustering is impacted by the regular increase of sample sizes which provides opportunity to reveal information previously out of scope. However, the volume of data leads to some issues related to the need of many computational resources and also to high energy consumption. Resorting to binned data depending on an adaptive grid is expected to give proper answer to such green computing issues while not harming the quality of the related estimation. After a brief review of existing methods, a first application in the context of univariate model-based clustering is provided in 12, with a numerical illustration of its advantages. Finally, an initial formalization of the multivariate extension is done, highlighting both issues and possible strategies.

C. Keribin collaborates with Christophe Biernacki (INRIA-Modal) on unsupervised learning of huge datasets with limited computer resources. A co-advised thesis (DGA grant) is ongoing.

Sylvain Arlot and Matthieu Lerasle are part of the ANR grant FAST-BIG (Efficient Statistical Testing for high-dimensional Models: application to Brain Imaging and Genetics), which is lead by Bertrand Thirion (Inria Saclay, Parietal).

Sylvain Arlot and Christophe Giraud are part of the ANR Chair-IA grant Biscotte, which is led by Gilles Blanchard (Université Paris Saclay).

C. Giraud: Co-organizer with Estelle Kuhn of the conference “StatMathAppli”, to occur in August 2021.

We performed many reviews for various international conferences.

We performed many reviews for various international journals.

S. Arlot, Statistics seminar of LPSM, Paris, 01/12/2020.

C. Keribin, Statistics seminar AgroParisTech, Paris, 18/05/2020

C. Keribin, Statistics seminar INRAE-MaIAGE, Jouy en Josas, 14/12/2020

C. Keribin, ERCIM-CMStatistics, online, 20/12/2020

E. Chzhen, Le Seminaire Palaisien, online, 06/10/2020

E. Chzhen, Statistics seminar, AgroParisTech, Paris, 02/03/2020

E. Chzhen, Stat

C. Keribin is President of the MALIA (Machine Learning and IA) group of the French Statistical Society (SFdS).

Most of the team members (especially Professors, Associate Professors and Ph.D. students) teach several courses at University Paris-Saclay, as part of their teaching duty. We mention below some of the classes in which we teach.

S. Arlot is member of the steering committee of a general-audience exhibition about artificial intelligence (“Entrez dans le monde de l'IA”), that is co-organized by Fermat Science (Toulouse), Institut Henri Poincaré (IHP, Paris) and Maison des Mathématiques et de l'Informatique (MMI, Lyon).