Un dispositif d’accompagnement dans la transition lycée-université (IUT) : enjeux et effets

CELESTE mathematical statistics and learning

Optimization, machine learning and statistical methods

Applied Mathematics, Computation and Simulation

http://team.inria.fr/celeste Laboratoire de mathématiques d'Orsay de l'Université de Paris-Sud (LMO) CNRS, Université Paris-Saclay Creation of the Project-Team: 2019 June 01 Project-Team A3.1.1. - Modeling, representation A3.1.8. - Big data (production, storage, transfer) A3.3. - Data and knowledge analysis A3.3.3. - Big data analysis A3.4. - Machine learning and statistics A3.4.1. - Supervised learning A3.4.2. - Unsupervised learning A3.4.3. - Reinforcement learning A3.4.4. - Optimization and learning A3.4.5. - Bayesian methods A3.4.7. - Kernel methods A3.5.1. - Analysis of large graphs A5.9.2. - Estimation, modeling A6. - Modeling, simulation and control A6.1. - Methods in mathematical modeling A6.2. - Scientific computing, Numerical Analysis & Optimization A6.2.4. - Statistical methods A6.3. - Computation-data interaction A6.3.1. - Inverse problems A6.3.3. - Data processing A6.3.4. - Model reduction A9.2. - Machine learning B1.1.4. - Genetics and genomics B1.1.7. - Bioinformatics B2.2.4. - Infectious diseases, Virology B2.3. - Epidemiology B2.4.1. - Pharmaco kinetics and dynamics B3.4. - Risks B4. - Energy B4.4. - Energy delivery B4.5. - Energy consumption B5.2.1. - Road vehicles B5.2.2. - Railway B5.2.3. - Aviation B5.5. - Materials B5.9. - Industrial maintenance B7.1. - Traffic management B7.1.1. - Pedestrian traffic and crowds B9.5.2. - Mathematics B9.8. - Reproducibility B9.9. - Ethics Saclay - Île-de-France Kevin Bleakley Chercheur Inria, Researcher Gilles Celeux Chercheur Inria, Emeritus, until Jul 2021 Evgenii Chzhen Chercheur CNRS, Researcher, from Oct 2021 Gilles Stoltz Chercheur CNRS, Researcher oui Sylvain Arlot Enseignant Team leader, Univ Paris-Saclay, Professor oui Christophe Giraud Enseignant Univ Paris-Saclay, Professor oui Alexandre Janon Enseignant Univ Paris-Saclay, Associate Professor Christine Keribin Enseignant Univ Paris-Saclay, Associate Professor oui Pascal Massart Enseignant Univ Paris-Saclay, Professor oui Patrick Pamphile Enseignant Univ Paris-Saclay, Associate Professor Marie-Anne Poursat Enseignant Univ Paris-Saclay, Associate Professor Pierre Andrieu PostDoc Univ Paris-Saclay, from Oct 2021 Evgenii Chzhen PostDoc Univ Paris-Saclay, until Sep 2021 Pierre Humbert PostDoc Inria, from Sep 2021 Yvenn Amara-Ouali PhD Univ Paris-Saclay Filippo Antonazzo PhD Inria, Emilien Baroux PhD Groupe PSA Antoine Barrier PhD ENS Lyon Simon Briend PhD Univ Paris-Saclay, from Sep 2021, Samy Clementz PhD Univ Paris 1, from Sep 2021 Olivier Coudray PhD Groupe PSA Remi Coulaud PhD SNCF, CIFRE Solenne Gaucher PhD Univ Paris-Saclay Karl Hajjar PhD Univ Paris-Saclay Perrine Lacroix PhD Univ Paris-Saclay Etienne Lasalle PhD Inria, Timothee Mathieu PhD Ecole normale supérieure Paris-Saclay, until Aug 2021 Binh Nguyen PhD Inria, Louis Pujol PhD Inria, from Nov 2021, El Mehdi Saad PhD Univ Paris-Saclay, Gayane Taturyan PhD IRT SystemX Nawel Arab Stagiaire Inria, from Apr 2021 until Sep 2021 Diane Iris Berenger Stagiaire Inria, from Apr 2021 until Jul 2021 Laurence Fontana Assistant Inria Claire Lacour CollaborateurExterieur Univ Paris-Est Marne La Vallée Matthieu Lerasle CollaborateurExterieur CNRS Overall objectives Mathematical statistics and learning

Data science – a vast field that includes statistics, machine learning, signal processing, data visualization, and databases – has become front-page news due to its ever-increasing impact on society, over and above the important role it already played in science over the last few decades. Within data science, the statistical community has long-term experience in how to infer knowledge from data, based on solid mathematical foundations. The more recent field of machine learning has also made important progress by combining statistics and optimization, with a fresh point of view that originates in applications where prediction is more important than building models.

The Celeste project-team is positioned at the interface between statistics and machine learning. We are statisticians in a mathematics department, with strong mathematical backgrounds behind us, interested in interactions between theory, algorithms and applications. Indeed, applications are the source of many of our interesting theoretical problems, while the theory we develop plays a key role in (i) understanding how and why successful statistical learning algorithms work – hence improving them – and (ii) building new algorithms upon mathematical statistics-based foundations

In the theoretical and methodological domains, Celeste aims to analyze statistical learning algorithms – especially those which are most used in practice – with our mathematical statistics point of view, and develop new learning algorithms based upon our mathematical statistics skills.

A key ingredient in our research program is connecting our theoretical and methodological results with (a great number of) real-world applications. Indeed, Celeste members work in many domains, including—but not limited to—Covid-19, neglected tropical diseases, reliability, and energy and the environment.

Research program General presentation

Our objectives correspond to four major challenges of machine learning where mathematical statistics have a key role. First, any machine learning procedure depends on hyperparameters that must be chosen, and many procedures are available for any given learning problem: both are an estimator selection problem. Second, with high-dimensional and/or large-scale data, the computational complexity of algorithms must be taken into account differently, leading to possible trade-offs between statistical accuracy and complexity, for machine learning procedures themselves as well as for estimator selection procedures. Third, the imprudent use of machine learning algorithms may lead to unfair and discriminatory decisions on individuals, often inheriting or even amplifying data biases; such biases must be taken into account in order to build algorithms with fairness guarantees on their decisions. Fourth, science currently faces a reproducibility crisis, making it necessary to provide statistical inference tools (p-values, confidence regions) for assessing the significance of the output of any learning algorithm (including the tuning of its hyperparameters) in a computationally efficient way.

Estimator selection

An important goal of Celeste is to build and study procedures that can deal with general estimators (especially those actually used in practice, which often rely on some optimization algorithm), such as cross-validation and Lepski's method. In order to be practical, estimator selection procedures must be fully data-driven (that is, not relying on any unknown quantity), computationally tractable (especially in the high-dimensional setting, for which specific procedures must be developed), and robust to outliers (since most real data sets include a few outliers). Celeste aims to provide a precise theoretical analysis (for new and existing popular estimator selection procedures) that describes as well as possible their observed behaviour in practice.

Relating statistical accuracy to computational complexity

When several learning algorithms are available, with increasing computational complexity and statistical performance, which one should be used, given the amount of data and the computational power available? This problem has emerged as a key question induced by the challenge of analyzing large amounts of data – the “big data” challenge. Celeste wants to tackle the major challenge of understanding the time-accuracy trade-off, which requires providing new statistical analyses of machine learning procedures – as they are done in practice, including optimization algorithms – that are precise enough in order to account for differences of performance observed in practice, leading to general conclusions that can be trusted more generally. For instance, we study the performance of ensemble methods combined with subsampling, which is a common strategy for handling big data; examples include random forests and median-of-means algorithms.

Algorithmic fairness

Machine learning algorithms make pivotal decisions, which influence our lives on a daily basis, using data about individuals. Recent studies show that imprudent use of these algorithms may lead to unfair and discriminatory decisions, often inheriting or even amplifying disparities present in data. The goal of Celeste on this topic is to design and analyze novel tractable algorithms that, while still optimizing prediction performance, mitigate or remove unfair decisions of the learned predictor. A major challenge in the machine learning fairness literature is to obtain algorithms which satisfy fairness and risk guarantees simultaneously. Several empirical studies suggest that there is a trade-off between fairness and accuracy of a learned model – more accurate models are less fair. One of our main research directions is to provide a theoretical study of these types of trade-offs. The goal is to provide user-friendly statistical quantification of such trade-offs and build statistically optimal algorithms in this context. A special attention will be paid to the online learning setting.

Statistical inference: (multiple) tests and confidence regions (including post-selection)

Celeste considers the problems of quantifying the uncertainty of predictions or estimations (thanks to confidence intervals) and of providing significance levels (p-values, corrected for multiplicity if needed) for each “discovery” made by a learning algorithm. This is an important practical issue when performing feature selection – one then speaks of post-selection inference, change-point detection, and outlier detection, to name but a few. We tackle this in particular through a collaboration with the Parietal team (Inria Saclay) and LBBE (CNRS), with applications in neuroimaging and genomics.

Application domains Neglected tropical diseases

Celeste collaborates with Anavaj Sakuntabhai and Philippe Dussart (Pasteur Institute) on predicting dengue severity using only low-dimensional clinical data obtained at hospital arrival. Other collaborations are underway in dengue fever and encephalitis with researchers at the Pasteur Institute, including with Jean-David Pommier.

Covid-19

We collaborate with researchers at the Pasteur Institute and the University Hospital of Guadeloupe on the development of a rapid test for Covid-19 severity prediction as well as risk modeling and outcome prediction for patients admitted to ICU units.

Electricity load consumption: forecasting and control

Celeste has a long-term collaboration with EDF R&D on electricity consumption. An important problem is to forecast consumption. We currently work on an approach involving back and forth disaggregation (of the total consumption into the consumption of well-chosen groups/regions) and aggregation of local estimates. We also work on consumption control by price incentives sent to specific users (volunteers), seeing it as a bandit problem.

Reliability

Collected product lifetime data is often non-homogeneous, affected by production variability and differing real-world usage. Usually, this variability is not controlled or observed in any way, but needs to be taken into account for reliability analysis. Latent structure models are flexible models commonly used to model unobservable causes of variability.

Celeste currently collaborates with Stellantis. To dimension its vehicles, Stellantis uses a reliability design method called Strength-Stress, which takes into consideration both the statistical distribution of part strength and the statistical distribution of customer load (called Stress). In order to minimize the risk of in-service failure, the probability that a “severe” customer will encounter a weak part must be quantified. Severity quantification is not simple since vehicle use and driver behaviour can be “severe” for some types of materials and not for others. The aim of the study is thus to define a new and richer notion of “severity” from Stellantis's databases, resulting either from tests or client usages. This will lead to more robust and accurate parts dimensioning methods. Two CIFRE theses are in progress on such subjects:

Olivier COUDRAY, “Fatigue Data-based Design: Probabilistic Modeling of Fatigue Behavior and Analysis of Fatigue Data to Assist in the Numerical Design of a Mechanical Part”. Here, we are seeking to build probabilistic fatigue criteria to identify the critical zones of a mechanical part.

Emilien BAROUX, “Reliability dimensioning under complex loads: from specification to validation”. Here, we seek to identify and model the critical loads that a vehicle can undergo according to its usage profile (driver, roads, climate, etc.).

Spectroscopic imaging analysis of ancient materials

Ancient materials, encountered in archaeology and paleontology are often complex, heterogeneous and poorly characterized before physico-chemical analysis. A popular technique is to gather as much physico-chemical information as possible, with spectro-microscopy or spectral imaging, where a full spectra, made of more than a thousand samples, is measured for each pixel. The produced data is tensorial with two or three spatial dimensions and one or more spectral dimensions, and requires the combination of an "image" approach with a "curve analysis" approach. Since 2010 Celeste (previously Select) collaborates with Serge Cohen (IPANEMA) on clustering problems, taking spatial constraints into account.

Forecast of dwell time during train parking at stations

One of the factors in the punctuality of trains in dense areas (and management crises in the event of an incident on a line) is the respect of both the travel time between two stations and the parking time in a station. These depend, among other things, on the train, its mission, the schedule, the instantaneous charge, and the configuration of the platform or station. Preliminary internal studies at the SNCF have shown that the problem is complex. From a dataset concerning the line E of the Transilien in Paris, we aim to address prediction (machine learning) and modeling (statistics): (1) construct a model of station-hours, station-hours-type of train, by example using co-clustering techniques; (2) study the correlations between the number of passengers (load), up and down flows, and parking times, and possibly other variables to be defined; (3) model the flows or loads (within the same station, or the same train) as a stochastic process; (4) develop a realistic digital simulator of passenger flows and test different scenarios of incidents and resolution, in order to propose effective solutions. A CIFRE PhD is in progress on this topic, in collaboration with the SNCF (Rémi Coulaud).

Social and environmental responsibility Footprint of research activities

As in 2020, the carbon emissions of Celeste team members related to their jobs were very low and came essentially from:

limited levels of transport to and from work, and not from travel to conferences.

electronic communication (email, Google searches, Zoom meetings, online seminars, etc.).

the carbon emissions embedded in their personal computing devices (construction), either laptops or desktops.

electricity for personal computing devices and for the workplace, plus also water, heating, and maintenance for the latter. Note that only 7.1% (2018) of France's electricity is not sourced from nuclear energy or renewables so team member carbon emissions related to electricity are minimal.

In terms of magnitude, the largest per capita ongoing emissions (excluding flying) are likely simply to be those from buying computers that have a carbon footprint from their construction, in the range of 100 kg Co2-e each. In contrast, typical email use per year is around 10 kg Co2-e per person, and a Zoom call comes to around 10g Co2-e per hour per person, while web browsing uses around 100g Co2-e per hour. Consequently, 2021 was a very low carbon year for the Celeste team. To put this in the context of work travel by flying, one return Paris-Nice flight corresponds to 160 kg Co2-e emissions, which likely dwarfs the total emissions of any one Celeste team member's work-related emissions in 2021.

The approximate (rounded for simplicity) Co2-e values cited above come from the book, “How Bad are Bananas” by Mike Berners-Lee (2020) which estimates carbon emissions in everyday life.

Impact of research results

In addition to the long-term impact of our theoretical work—which is of course impossible to assess immediately—we are involved in several applied research projects which aim at having a short/mid-term positive impact on society.

First, we collaborate with the Pasteur Institute and the University Hospital of Guadeloupe on medical issues related to some neglected tropical diseases and to Covid-19.

Second, the broad use of artificial intelligence/machine learning/statistics nowadays comes with several major ethical issues, one being to avoid making unfair or discriminatory decisions. Our theoretical work on algorithmic fairness has already led to several “fair” algorithms that could be widely used in the short term (one of them is already used for enforcing fair decision-making in student admissions at the University of Genoa).

Third, we expect short-term positive impact on society from several direct collaborations with companies such as EDF (forecasting and control of electricity load consumption), SNCF (punctuality of trains in densely-populated regions, one Cifre contract ongoing) and Stellantis (automobile reliability, with two Cifre contracts ongoing).

Highlights of the year Awards

Margaux Brégère: “Coup de cœur du jury” in the Smart Grids 2021 Thesis Prizes

Margaux Brégère: Finalist in the Amies Thesis Prize, 2021

Evegnii Chzhen, with Nicolas Schreuder (Univ. of Genoa): runner-up best student paper award at UAI 2021 15

New results Model selection criteria for the Latent Block Model ChristineKeribinDiane-IrisBéranger

The Latent Block Model (LBM) is a model-based approach for co-clustering: the simultaneous clustering of rows and columns of a data matrix. After results obtained last year on the consistency and asymptotic normality of the maximum likelihood estimator for LBM, we continued the theoretical study of this model: C. Keribin supervised M2 student Diane-Iris Béranger during a four month internship hosted by Inria-Celeste on model selection criteria for LBM. They extended to LBM the results of Wang et al (2017) on stochastic block models for clustering random graphs. They provided the asymptotic behavior of the likelihood ratio, and conducted numerical experiments to illustrate it in underestimation cases. Furthermore, they defined a penalty term which leads to a consistant criterion for model selection in LBM. However, the BIC penalty does not fulfill this condition, which is only sufficient. BIC consistency therefore remains a conjecture only.

Data analysis of bacterial interactions in complex ecosystems ChristineKeribin

Datasets in biology are often faced with the problem of having many more variables than individuals or units. In the supervised context, well-known regularized methods involving lasso or ridge penalization bring answers. The ridge approach uniformly shrinks the estimates, while the lasso can be viewed as a variable selection method and is then often used for its ease of interpretation. However, variables in biology are often correlated and it could be better to select groups of variables rather than individual variables. The group-lasso is one method, but it requires knowing the group a priori. C. Keribin (in collaboration with Béatrice Laroche, INRAE- MaIAGE) has started to study the added value of using co-clustering as an unsupervised method to define a way of qualifying groups of variables and their interactions along with classification or regression tasks. Several use cases underlie this study, involving the determination of color of cheese with regards to bacteria and yeast composition of the flora, or the dynamics of bacteria flora in the presence or absence of a pathogen.

Fast rates for prediction with limited expert advice E.M.Saad

E.M. Saad, in collaboration with G. Blanchard, investigated in 9 the problem of minimizing the excess generalization error with respect to the best expert prediction in a finite family in the stochastic setting, under limited access to information. They assumed that the learner only has access to a limited number of expert advice per training round, as well as for prediction. Assuming that the loss function is Lipschitz and strongly convex, they showed that if we are allowed to see the advice of only one expert per round for $T$ rounds in the training phase, or to use the advice of only one expert for prediction in the test phase, the worst-case excess risk is $Ω (1 / \sqrt{T})$ with probability lower bounded by a constant. However, if we are allowed to see at least two actively chosen experts' advice per training round and use at least two experts for prediction, a fast rate of $𝒪 (1 / T)$ can be achieved. We designed novel algorithms achieving this rate in this setting and that where the learner has a budget constraint on the total number of observed experts' advice, and gave precise instance-dependent bounds on the number of training rounds and queries needed to achieve a given generalization error precision.

A binned technique for scalable model-based clustering on huge datasets FilippoAntonazzoChristineKeribin

Clustering is impacted by the regular increase in sample sizes, the latter providing opportunity to reveal previously-undiscoverable information. However, the sheer volume of data leads to issues related to computational resource requirements, not to mention high energy consumption. Resorting to binned data depending on an adaptive grid is expected to help with such green computing issues, while not overly harming the quality of the related estimation. After a brief review of existing methods, a first application in the context of univariate model-based clustering is provided by F. Antonazzo and C. Keribin (in collaboration with C. Biernacki) in 16, with a numerical illustration of its advantages.

In a second step, F. Antonazzo and C. Keribin (in collaboration with C. Biernacki) focus on discovery of tiny but possible high-valued clusters which were not visible with more modest sample sizes. In this case, clustering is dependent on computational limits due to the high volume of data, possibly requiring extremely high memory and computation resources. In addition, the classical subsampling strategy, often adopted to overcome such limitations, is expected to fail to discover clusters in highly imbalanced cluster cases. Our proposal first consists in drastically compressing the data volume by only preserving its marginal-bin values, thus discarding the cross-bin ones. Despite this extreme information loss, we nevertheless prove an identifiability property for the diagonal mixture model, and also introduce a specific EM-like algorithm associated with a composite likelihood approach. The latter is much more frugal than a regular but unfeasible EM algorithm you would normally use on such marginal-bin data, while preserving all consistency properties. Finally, numerical experiments highlight that this proposed method outperforms subsampling both in controlled simulations and in various real applications where imbalanced clusters may typically appear, such as image segmentation, hazardous asteroid recognition, and fraud detection 19.

Optimality of variational inference for a stochastic block model with missing links SolenneGaucher

Variational methods are extremely popular in the analysis of network data. Statistical guarantees obtained for these methods typically provide asymptotic normality for the problem of estimation of global model parameters under the stochastic block model. In 24, S. Gaucher (in collaboration with O. Klopp) considers the case of networks with missing links, which is important in applications, and show that the variational approximation to the maximum likelihood estimator converges at the minimax rate. This provides the first minimax optimal and tractable estimator for the problem of parameter estimation for the stochastic block model with missing links. The theoretical results are complemented with numerical studies of simulated and real networks, which confirm the advantages of this estimator over other current methods.

Massive data in structural geology ChristineKeribinNawalArab

C. Keribin initiated a collaboration with Antonio Benedicto (GEOPS Paris-Saclay). Data in structural geology are essentially treated manually and machine learning could contribute to real advances in this field. To conduct an exploratory study, they co-supervised the five-month long M2-Datascience internship of the student Nawel Arab (supported by Inria-Celeste) and M2-Geology student Muchan Chai (supported by GEOPS). The goal was to integrate heterogeneous databases (mineralogy, structural data, geochemistry, lithology, radiometry) to determine types of structures and more generally features that could predict uranium fields. Heterogeneity (due to the types of variables as well as the location of the measures) and an extremely unbalanced situation (very few mineralized samples) made the learning tricky. An integrated database was built, and standard machine learning methods tested. While the preliminary results are not completely convincing for the moment, some interesting points have been identified to be studied further.

Localization in 1D non-parametric latent space models from pairwise affinities ChristopheGiraud

In 25, C. Giraud (in collaboration with Y. Issartel and N. Verzelen) considers the problem of estimating latent positions in a one-dimensional torus from pairwise affinities. The observed affinity between a pair of items is modeled as a noisy observation of a function $f (x_{i}, x_{j})$ of the latent positions $x_{i}, x_{j}$ of the two items on the torus. The affinity function $f$ is unknown, and it is only assumed to fulfill some shape constraints ensuring that $f (x, y)$ is large when the distance between $x$ and $y$ is small, and vice-versa. This non-parametric modeling offers good flexibility to fit data. An estimation procedure is introduced that provably locates all of the latent positions with a maximum error in the order of $log (n) / n$ , with high probability. This rate is proven to be minimax optimal. A computationally efficient variant of the procedure is also analyzed under some more restrictive assumptions. These general results can be used for the problem of statistical seriation, leading to new bounds for the maximum error in an ordering.

Algorithmic fairness EvgeniiChzhenChristopheGiraudGillesStoltz

Many decision problems are of a sequential nature, and efforts are needed to better handle fairness in such settings. In 8, E. Chzhen, C. Giraud, and G. Stoltz have introduced a unified approach to sequential fair learning in the presence of sensitive and non-sensitive contexts. The introduced approach translates the problem of fair learning as an approachability problem. It relies on Blackwell's approachability framework and, hence, it inherits the main appealing features of the Blackwell's result: a generic way to produce necessary and sufficient conditions when learning is possible, and a tractable algorithm in the latter case. Using this framework, the authors provided several (im)possibility results unifying previous results, and obtaining brand new ones. In particular, the authors provide a complete description of the trade-off between the demographic parity constraint and the group-wise calibration performance criterion.

In 15, N. Schreuder (Univ. of Genoa) and E. Chzhen study the problem of binary classification under demographic parity constraint with controlled abstention rate. They provide an efficient post-processing algorithm and exhibit distribution-free fairness, abstention, and risk guarantees. An empirical study on real data demonstrates that a very low level of abstention suffices in order to bypass the trade-off between fairness and risk. Thus, this work proposes a third parameter, which, in the presence of humans in the loop, can lighten the burden of additional constraints. Furthermore, a python implementation is provided by the authors: here.

Classification with the F-score: minimax optimality EvgeniiChzhen

In 3, E. Chzhen studies the problem of binary classification with the F-score as the performance measure. We propose a post-processing algorithm for this problem which fits a threshold for any score base classifier to yield a high F-score. The post-processing step involves only unlabeled data and can be performed in logarithmic time. Finite sample post-processing bounds for the proposed procedure are derived. Furthermore, it is shown that the procedure is minimax rate optimal, when the underlying distribution satisfies classical nonparametric assumptions. This result improves upon previously known rates for the F-score classification and bridges the gap between standard classification risk and the F-score. Finally, the generalization of this approach to the set-valued classification is provided.

Set-valued classification: advantages of the semi-supervised approach EvgeniiChzhen

In collaboration with C. Denis and M. Hebiri 2, E. Chzhen studies study supervised and semi-supervised algorithms in the set-valued classification framework with controlled expected size. While the former can use only labeled samples, the latter are able to make use of additional unlabeled data. The theoretical analysis implies that if no further assumption is made, there is no supervised method that outperforms the semi-supervised estimator proposed in this work – the best achievable rate for any supervised method is parametric even if the margin assumption is extremely favorable; on the contrary, the developed semi-supervised estimator can achieve faster rate of convergence provided that sufficiently many unlabeled samples are available. We also show that under additional smoothness assumption, supervised methods are able to achieve faster rates and the unlabeled sample cannot improve the rate of convergence. Finally, a numerical study supports our theory and emphasizes the relevance of the theoretical assumptions from an empirical perspective.

Spatially relaxed inference on high-dimensional linear models T.-B.Nguyen

T.-B. Nguyen, in collaboration with J.-A. Chevalier, B. Thirion, and J. Salmon, considered the inference problem for high-dimensional linear models, when covariates have an underlying spatial organization reflected in their correlation. A typical example of such a setting is high-resolution imaging, in which neighboring pixels are usually very similar. Accurate point and confidence interval estimation is not possible in this context with many more covariates than samples, not to mention high correlation between covariates. This calls for a reformulation of the statistical inference problem that takes into account the underlying spatial structure: if covariates are locally correlated, it is acceptable to detect them up to a given spatial uncertainty. They thus proposed to rely on the $δ$ -FWER, that is, the probability of making a false discovery at a distance greater than $δ$ from any true positive. With this target measure in mind, they studied the properties of ensemble clustered inference algorithms which combine three techniques: spatially constrained clustering, statistical inference, and ensembling to aggregate several clustered inference solutions. They showed that ensembled clustered inference algorithms control the $δ$ -FWER under standard assumptions for $δ$ equal to the largest cluster diameter. They complemented the theoretical analysis with empirical results, demonstrating accurate $δ$ -FWER control and decent power achieved by such inference algorithms.

Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit KarlHajjarChristopheGiraud

To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models make these dynamics difficult to analyze. To overcome these challenges, large-width asymptotics have recently emerged as a fruitful viewpoint and have led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood via these asymptotics that the nature of the trained model radically changes depending on the scale of the initial random weights, ranging from a kernel regime (for large initial variance) to a feature learning regime (for small initial variance). For deeper networks, more regimes are possible. In 26, K. Hajjar, L. Chizat and C. Giraud study in detail a specific choice of a “small” initialization corresponding to “mean-field” limits of neural networks, called integrable parameterizations (IPs). First, they show that under a standard i.i.d. zero-mean initialization, integrable parameterizations of neural networks with more than four layers start at a stationary point in the infinite-width limit, and no learning occurs. Then, they propose various methods to avoid this kind of behavior, and analyze in detail the resulting dynamics. Theoretical results were confirmed by numerical experiments on image classification tasks.

Characteristics and Mortality Risk Factors for Covid ICU Patients in the French West Indies KevinBleakley

Guadeloupe, a French West Indies island, was heavily affected by the first two large Covid waves. The therapeutic approach taken was different for the two waves in the ICU. We aimed to compare the two different periods in terms of characteristics and outcomes, and to evaluate risk factors associated with 60-day mortality in our overall cohort.

Patients were treated during the first wave with a combination of Hydroxychloroquine and Azithromycin, and during the second wave with dexamethasone and reinforced anticoagulation. We found that overall mortality at day 60 was high (45%) and not different between the two waves. In patients under mechanical ventilation, risk factors associated with death in a multivariate analysis were a high number of comorbidities, a high SOFA (Sequential Organ Failure Assesment) score, and—surprisingly—a delay in starting invasive mechanical ventilation after admission to the ICU 7.

Hierarchical transfer learning with applications for electricity load forecasting SolenneGaucher

The recent abundance of data on electricity consumption at different scales opens new challenges and highlights the need for new techniques to leverage information present at finer scales in order to improve forecasts at wider scales. In 34, S. Gaucher (in collaboration with A. Antoniadis and Y. Goude) takes advantage of the similarity between this hierarchical prediction problem and multi-scale transfer learning. They develop two methods for hierarchical transfer learning, based respectively on the stacking of generalized additive models and random forests, and on the use of aggregation of experts. They apply these methods to two problems of electricity load forecasting at a national scale, using smart meter data in the first case, and regional data in the second. For these two use cases, they compare the performances of their methods to that of benchmark algorithms, and investigate their behaviour using variable importance analysis. Their results demonstrate the interest of the two methods, both of which lead to a significant improvement in predictions.

Fatigue data-based design OlivierCoudrayChristineKeribinPatrickPamphilePascalMassart

Deterministic fatigue criteria are used to identify critical zones for fatigue failure of mechanical parts. While these criteria prove to be effective on experimental test data with standardized specimens, they are less effective for bench tests with prototypes: the variability inherent to tests on prototypes is poorly addressed by deterministic criteria, not to mention errors due to numerical simulations. We therefore propose to use statistical methods, (1) to improve the deterministic criteria; (2) to build new fatigue criteria 10. While an observed failure during testing suggests the presence of a critical zone, the absence of failure does not necessarily imply a safe zone. We have related this setting to PU learning (learning from positive and unlabeled data); indeed, it can be seen as a semi-supervised situation, with a special case of label noise where only a fraction of the positive instances is labeled. Some results exist for when the labeling noise is constant (the selected completely at random assumption). We are interested in the case where the probability of being labeled may depend on the covariates (selection bias, selected at random assumption). In this context, we have provided upper and lower bounds on the minimax risk, proving that the upper bound on the excess risk is almost optimal. In addition, we have quantified the impact of label noise on PU learning compared to the standard classification setting. 23.

Analysis Of Real-Life Multi-Input Loading Histories For The Reliable Design Of Vehicle Chassis EmilienBarouxPatrickPamphile

In order to reliably design automotive structures, engineers must determine and justify validation conditions and levels. These must be derived from a thorough understanding of the structural damage induced by in-service loading conditions. Based on variable amplitudes and multiple input loading histories applied on car axles, E. Baroux and P. Pamphile (in collaboration with B. Delattre, A. Constantinescu, and I. Raoult) propose in 12 a multidimensional description of pseudo-damage for the design of weak points of a car chassis. A classification of drivers allowed us to identify driving profiles that are more or less damaging to the structure. This allows the design office to identify critical points of the structure specific to certain driving events (sharp turns, potholes, pavements, etc.) Moreover, we propose a damage reconstruction of a critical driving profile using track tests. The design office can then set up bench tests representative of custumer driving profiles.

Bilateral contracts and grants with industry ChristineKeribinPatrickPamphileGillesStoltz Bilateral contracts with industry

C. KERIBIN and P. PAMPHILE. OpenLabIA Inria-Groupe Stellantis collaboration contract. 85 KE.

A. CONSTANTINESCU and P. PAMPHILE. Collaboration contract with Stellantis. 95 KE.

C. KERIBIN and G. STOLTZ. Ongoing contrat with SNCF (45 kE), on the modeling and forecasting of dwell time.

G. STOLTZ: Ongoing contract with BNP Paribas (10 kE), on stochastic bandits under budget constraints, for an application to loan management.

Partnerships and cooperations SylvainArlotKevinBleakleyChristopheGiraudMatthieuLerasle International initiatives

Christophe Giraud is part of the DFG/ANR PCRI project ASCAI (“Segmentation, clustering, et seriation actifs et passifs: vers des fondations unifiées en IA”), which is jointly lead by Alexandra Carpentier (Postdam University) and Nicolas Verzelen (INRA Montpellier)

National initiatives ANR

Sylvain Arlot and Matthieu Lerasle are part of the ANR grant FAST-BIG (Efficient Statistical Testing for high-dimensional Models: application to Brain Imaging and Genetics), which is lead by Bertrand Thirion (Inria Saclay, Parietal).

Sylvain Arlot and Christophe Giraud are part of the ANR Chair-IA grant Biscotte, which is led by Gilles Blanchard (Université Paris Saclay).

Other

Kevin Bleakley worked partially for IRT SystemX inside the Confiance.ai program.

Dissemination EquipeCeleste Promoting scientific activities Scientific events: organisation General chair, scientific chair

C. Keribin is president of the Specialized Group MAchine Learning et Intelligence Artificielle (MALIA) of the French Statistical Society (SFdS)

Member of the organizing committees

Christophe Giraud is co-organizer with Estelle Kuhn of StatMathAppli Frejus, 2022

Christophe Giraud is organizing a “statistical learning” session for the joint AMS/EMS/SMF conference, Grenoble 2022

Scientific events: selection Chair of conference program committees

Christine Keribin is VP of the scientific committee of Federated Learning Workshops, in collaboration with Owkin and Accenture Lab (two live-streamed events and one full day in person): workshop webpage.

Reviewer

We performed many reviews for various international conferences.

Journal Member of the editorial boards

S. Arlot: associate editor for Annales de l'Institut Henri Poincaré B – Probability and Statistics

G. Stoltz: associate editor for Mathematics of Operations Research

Reviewer - reviewing activities

We performed many reviews for various international journals.

Invited talks

S. Arlot, Colloquium, Department of Statistics and Actuarial Science, The University of Iowa, 18/03/2021

S. Arlot, International Conference on Statistics and Related Fields (ICON STARF), University of Luxembourg, 14/07/2021

S. Arlot, Séminaire MODAL'X, Université Paris Nanterre, 23/09/2021

S. Arlot, Séminaire de probabilités de Lyon, ENS Lyon, 25/11/2021

K. Bleakley, Parietal Inria Team seminar, 01/06/2021

E. Chzhen, DATAIA seminar, DATAIA, 22/06/2021

E. Chzhen, Séminaire Rennais de statistique, ENSAI Rennes, 05/10/2021

E. Chzhen, ML-MTP : Machine Learning in Montpellier, Theory & Practice, Université de Montpellier, 25/10/2021

C. Giraud, Weierstrass Insitute seminar, Berlin, 17/11/2021

C. Giraud, ETH Foundations of Data Science monthly seminar, ETH Zurich, 02/12/2021

C. Keribin, MHC2021 international conference, Université Paris-Saclay, 03/06/2021

C. Keribin, Working Group on Mode Based Clustering WGMBC2021, Athènes, 27/10/2021

C. Keribin, Séminaire Entropie-Mots-Stats, Caen, 19/11/2021

G. Stoltz, Séminaire de l'équipe MIA de l'AgroParisTech, 12/04/2021

Scientific expertise

G. Stoltz: HCERES panel member for Laboratoire Raphaël Salem, Rouen

Research administration

S. Arlot is a member of the board of the Computer Science Graduate School of University Paris-Saclay.

S. Arlot is a member of the board of the Computer Science Doctoral School (ED STIC) of University Paris-Saclay.

C. Giraud is a member of the Scientific Committe of labex IRMIA+, Strasbourg

C. Giraud is in charge of the whole Masters program in mathematics for University Paris-Saclay

C. Giraud is member of the board of the Mathematics Graduate School of University Paris-Saclay

C. Keribin is member of the board of the Computer Science Doctoral School (ED MSTIC) of Paris-Est Sup.

M-A. Poursat is in charge of the bachelor program in computer science and mathematics of University Paris-Saclay

Teaching - Supervision - Juries Teaching

Most of the team members (especially Professors, Associate Professors and Ph.D. students) teach several courses at University Paris-Saclay, as part of their teaching duty. We mention below some of the classes in which we teach.

Masters: S. Arlot, Statistical learning and resampling, 30h, M2, Université Paris-Sud

Masters: S. Arlot, Preparation for French mathematics agrégation (statistics), 50h, M2, Université Paris-Sud

Masters: C. Giraud, High-Dimensional Probability and Statistics, 45h, M2, Université Paris-Saclay

Masters: C. Giraud, Mathematics for AI, 75h, M1, Université Paris-Saclay

Masters: C. Keribin, unsupervised and supervised learning, M1, 42h, Université Paris-Saclay

Masters: M-A Poursat, applied statistics, 21h, in M1 Artificial Intelligence and statistical learning, 42h, in M2 Bioinformatics Université paris-Saclay

Licence: M-A Poursat, statistical inference, 72h, in LDD3 Université Paris Saclay

Masters: G. Stoltz, Sequential Learning and Optimization, 18h, M2 Université Paris-Saclay

Supervision

PhD defended on Jan. 2021: Timothée Mathieu, M-estimation and Median of Means applied to statistical learning, started Sept. 2018, co-advised by G. Lecué (ENSAE) and M. Lerasle

PhD defended on Aug. 2021: Vincent Divol, Contributions to geometric inference on manifolds and to the statistical study of persistence diagrams, started Oct. 2018, co-advised by F. Chazal (INRIA Datashape) and P. Massart

PhD defended on Dec. 2021: Tuan-Binh Nguyen, Efficient Statistical Testing for high-dimensional Models, started Oct. 2018, co-advised by S. Arlot and B. Thirion (INRIA Parietal)

PhD in progress: Solenne Gaucher, Sequential learning in random networks, started Sept. 2018, C. Giraud.

PhD in progress: Etienne Lasalle, Statistical foundations of topological data analysis for graph structured data, started Sept. 2018, co-advised by F. Chazal (INRIA Datashape) and P. Massart

PhD in progress: El Mehdi Saad, Interactions between statistical and computational aspects in machine learning, started Sept. 2019, co-advised by S. Arlot and G. Blanchard (INRIA Datashape)

PhD in progress: Perrine Lacroix, High-dimensional linear regression applied to gene interactions network inference, started Sept. 2019, co-advised by P. Massart & M.-L. Martin-Magniette (INRAE)

PhD in progress: Yvenn Amara-Ouali, Spatio-temporal modelization of electrical vehicles load, started Oct. 2019, co-advised by P. Massart, J.M. Poggi (Université de Paris) and Y. Goude (EDF R& D)

PhD in progress: Rémi Coulaud, Forecast of dwell time during train parking at station, started Oct. 2019, co-advised by G. Stoltz and C. Keribin, Cifre with SNCF

PhD in progress: Olivier Coudray, Fatigue data-based design, started Nov. 2019, co-advised by C. Keribin and P. Pamphile, Cifre with Groupe PSA

PhD in progress: Filippo Antonnazo, Unsupervised learning of huge datasets with limited computer resources, started Nov. 2019, co-advised by C. Biernacki (INRIA-Modal) and C. Keribin, DGA grant

PhD in progress: Louis Pujol, CYTOPART - Flow cytometry data clustering, started Nov. 2019, co-advised by P. Massart and M. Glisse (INRIA Datashape)

PhD in progress: Emilien Baroux, Reliability dimensioning under complex loads: from specification to validation, started July. 2020, co-advised by A. Constantinescu and P. Pamphile, CIFRE with Groupe PSA

PhD in progress: Antoine Barrier, started Sept. 2020, Best Arm Identification, co-advised by G. Stoltz and A. Garivier (ENS Lyon)

PhD in progress: Karl Hajjar, analyse dynamique de réseaux de neurones, started Oct. 2020, C. Giraud and L. Chizat.

PhD in progress: Samy Clementz, Data-driven Early Stopping Rules for saving computation resources in AI, started Sept. 2021, co-advised by S. Arlot and A. Celisse

PhD in progress: Simon Briend, Statistical theory for overparametrized deep neural networks, started Sept. 2021, co-advised by C. Giraud and G. Lugosi

PhD in progress: Gayane Taturyan, Fairness and Robustness in Machine Learning, started Nov. 2021, co-advised by E. Chzhen, J.-M. Loubes (Univ. Toulouse Paul Sabatier) and M. Hebiri (Univ. Gustave Eiffel)

Juries

S. Arlot: member of the HdR committee of Mohamed Hebiri, Université Gustave Eiffel, 07/01/2021.

S. Arlot: referee for the PhD of TrungTin Nguyen, Université de Caen Normandie, 14/12/2021.

C. Keribin: referee for the PhD of Gabriel Frisch, Université de Technologie de Compiègne, 18/10/2021

C. Keribin: member of the PhD jury for Rem-Sophia Mouradi, Institut National Polytechnique de Toulouse, 16/03/2021

G. Stoltz: referee for the PhD of David Obst, Aix-Marseille Université, 10/12/2021.

G. Stoltz: president of the PhD jury for Rémy Garnier, Université de Cergy, 08/12/2021.

Popularization Interventions

S. Arlot is member of the steering committee of a general-audience exhibition about artificial intelligence (“Entrez dans le monde de l'IA”), that is co-organized by Fermat Science (Toulouse), Institut Henri Poincaré (IHP, Paris) and Maison des Mathématiques et de l'Informatique (MMI, Lyon). The exhibition was inaugurated at MMI Lyon in September 2021.

Un dispositif d’accompagnement dans la transition lycée-université (IUT) : enjeux et effets I. Isabelle Bournaud P. Patrick Pamphile Revue internationale de pédagogie de l’enseignement supérieur March 2021 37 2 Minimax semi-supervised set-valued approach to multi-class classification E. Evgenii Chzhen C. Christophe Denis M. Mohamed Hebiri Bernoulli November 2021 Optimal Rates for Nonparametric F-Score Binary Classification via Post-Processing E. Evgenii Chzhen Mathematical Methods of Statistics September 2021 Cluster or co-cluster the nodes of oriented graphs? C. Christine Keribin Journal de la Société Française de Statistique 2021 162 1 24 Cross-validation improved by aggregation: Agghoo G. Guillaume Maillard S. Sylvain Arlot M. Matthieu Lerasle Journal of Machine Learning Research February 2021 22 20 1--55 Excess risk bounds in robust empirical risk minimization S. Stanislav Minsker T. Timothée Mathieu Information and Inference April 2021 A Tale of a Two Waves Epidemic: Characteristics and Mortality Risk Factors for COVID-19 ICU Patients in the French West Indies J.-D. Jean-David Pommier F. Frederic Martino K. Kevin Bleakley L. Laure Flurin F. V. Fabien Van Roy M. Michel Carles M. Marc Valette L. Laurent Camous Journal of Biology and Today's World April 2021 A Unified Approach to Fair Online Learning via Blackwell Approachability E. Evgenii Chzhen C. Christophe Giraud G. Gilles Stoltz Advances in Neural Information Processing Systems 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Virtual conference, Australia 2021 34 Fast rates for prediction with limited expert advice E. M. El Mehdi Saad G. Gilles Blanchard NeurIPS 2021 proceedings Neural Information Processing Systems (NeurIPS) 2021 [Online], United States December 2021 34 Fatigue Data-Based Design: statistical methods for the identification of critical zones O. Olivier Coudray P. Philippe Bristiel M. Miguel Dinis C. Christine Keribin P. Patrick Pamphile SIA Simulation Numérique Online, France April 2021 Frugal Gaussian clustering of huge imbalanced datasets through a bin-marginal approach F. Filippo Antonazzo C. Christophe Biernacki C. Christine Keribin Working Group - Model-based Clustering Athens, Greece October 2021 Analysis Of Real-Life Multi-Input Loading Histories For The Reliable Design Of Vehicle Chassis E. Emilien Baroux B. Benoit Delattre A. Andrei Constantinescu P. Patrick Pamphile I. Ida Raoult Fatigue Design 2021 SENLIS, France November 2021 Dealing with missing data in model-based clustering through a MNAR model C. Christophe Biernacki C. Claire Boyer G. Gilles Celeux J. Julie Josse F. Fabien Laporte M. M. Matthieu Marbac Lourdelle A. Aude Sportisse The 14th Professor Aleksander Zeliaś International Conference on Modelling and Forecasting of Socio-Economic Phenomena Zakopane, Poland May 2021 Impact of Missing Data on Mixtures and Clustering C. Christophe Biernacki C. Claire Boyer G. Gilles Celeux J. Julie Josse F. Fabien Laporte M. Matthieu Marbac Lourdelle A. Aude Sportisse V. Vincent Vandewalle MHC2021 - Mixtures, Hidden Markov Models, Clustering Orsay, France June 2021 Classification with abstention but without disparities N. Nicolas Schreuder E. Evgenii Chzhen Conference on Uncertainty in Artificial Intelligence (UAI 2021) Virtual conference, Australia July 2021 A binned technique for scalable model-based clustering on huge datasets F. Filippo Antonazzo C. Christophe Biernacki C. Christine Keribin October 2021 11-16 Daily peak electrical load forecasting with a multi-resolution approach Y. Yvenn Amara-Ouali M. Matteo Fasiolo Y. Yannig Goude H. Hui Yan December 2021 A review of electric vehicle load open data and models Y. Yvenn Amara-Ouali Y. Yannig Goude P. Pascal Massart J.-M. Jean-Michel Poggi H. Hui Yan April 2021 Frugal Gaussian clustering of huge imbalanced datasets through a bin-marginal approach F. Filippo Antonazzo C. Christophe Biernacki C. Christine Keribin December 2021 Hierarchical transfer learning with applications for electricity load forecasting. A. Anestis Antoniadis S. Solenne Gaucher Y. Yannig Goude November 2021 Hierarchical clustering of spectral images with spatial constraints for the rapid processing of large and heterogeneous datasets G. Gilles Celeux S. X. Serge X. Cohen A. Agnès Grimaud P. Pierre Gueriau December 2021 Set-valued classification -- overview via a unified framework E. Evgenii Chzhen C. Christophe Denis M. Mohamed Hebiri T. Titouan Lorieul March 2021 Risk bounds for PU learning under Selected At Random assumption O. Olivier Coudray C. Christine Keribin P. Pascal Massart P. Patrick Pamphile January 2022 Optimality of variational inference for stochastic block model with missing links S. Solenne Gaucher O. Olga Klopp November 2021 Localization in 1D non-parametric latent space models from pairwise affinities C. Christophe Giraud Y. Yann Issartel N. Nicolas Verzelen August 2021 Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit K. Karl Hajjar L. Lénaïc Chizat C. Christophe Giraud October 2021 A comprehensive review of variable selection in high-dimensional regression for molecular biology P. Perrine Lacroix M. Mélina Gallopin M.-L. Marie-Laure Martin October 2021 Heat diffusion distance processes: a statistically founded method to analyze graph data sets E. Etienne Lasalle October 2021 Aggregated hold out for sparse linear regression with a robust loss function G. Guillaume Maillard May 2021 Bayesian Estimations for Weibull Competing Risk Model with Masked Causes and Heavily Censored Data P. Patrick Pamphile G. Gilles Celeux August 2021 ISDE : Independence Structure Density Estimation L. Louis Pujol November 2021 Online Orthogonal Matching Pursuit E. M. El Mehdi Saad G. Gilles Blanchard S. Sylvain Arlot February 2021 Frugal Gaussian clustering of huge imbalanced datasets through a bin-marginal approach F. Filippo Antonazzo C. Christophe Biernacki C. Christine Keribin France November 2021 Hierarchical transfer learning with applications for electricity load forecasting S. Solenne Gaucher Y. Yannig Goude A. Anestis Antoniadis arXiv preprint arXiv:2111.08512 2021