Keywords
 A3.1.4. Uncertain data
 A3.2.3. Inference
 A3.3.2. Data mining
 A3.3.3. Big data analysis
 A3.4.1. Supervised learning
 A3.4.2. Unsupervised learning
 A3.4.5. Bayesian methods
 A3.4.7. Kernel methods
 A5.2. Data visualization
 A5.9.2. Estimation, modeling
 A6.2.3. Probabilistic methods
 A6.2.4. Statistical methods
 A6.3.3. Data processing
 A9.2. Machine learning
 B2.2.3. Cancer
 B9.5.6. Data science
 B9.6.3. Economy, Finance
 B9.6.5. Sociology
1 Team members, visitors, external collaborators
Research Scientists
 Christophe Biernacki [Team leader, Inria, Senior Researcher, team leader until Nov 2020, HDR]
 Benjamin Guedj [Inria, Researcher]
 Hemant Tyagi [Inria, Researcher]
Faculty Members
 Cristian Preda [Team leader, Université de Lille, Professor, team leader from Dec 2020, HDR]
 Vlad Barbu [Université de Rouen, Associate Professor, until Feb 2020, HDR]
 Alain Celisse [Université de Lille, Associate Professor, HDR]
 Sophie DaboNiang [Université de Lille, Professor, HDR]
 Philippe Heinrich [Université de Lille, Associate Professor]
 Serge Iovleff [Université de Lille, Associate Professor]
 Guillemette Marot [Université de Lille, Associate Professor, HDR]
 Vincent Vandewalle [Université de Lille, Associate Professor, HDR]
PostDoctoral Fellows
 Florent Dewez [Inria]
 Vera Shalaeva [Inria, until Jun 2020]
PhD Students
 Reuben Adams [University College London, United Kingdom, from Sep 2020]
 Filippo Antonazzo [Inria]
 Yaroslav Averyanov [Inria]
 Felix Biggs [University College London, United Kingdom]
 Rajeev Bopche [Inria, from Oct 2020]
 Guillaume Braun [Insee]
 Wilfried Heyse [Inserm]
 Eglantine Karlé [Inria, from Nov 2020]
 Etienne Kronert [Wordline, from Jul 2020]
 Arthur Leroy [Université ParisDescartes, until Sep 2020]
 Issam Ali Moindjie [Inria, from Oct 2020]
 Axel Potier [Inria, from Jul 2020]
 Antonin Schrab [University College London, United Kingdom, from Sep 2020]
 Antoine Vendeville [University College London, United Kingdom]
 Luxin Zhang [Wordline, CIFRE]
Technical Staff
 Maxime Brunin [Inria, Engineer, from Jul 2020]
 Iheb Eladib [Inria, Engineer, until Feb 2020]
 Quentin Grimonprez [Inria, Engineer, until Sep 2020]
 Etienne Kronert [Inria, Engineer, from Feb 2020 until Jun 2020]
 Issam Ali Moindjie [Inria, Engineer, until Sep 2020]
 Arthur Talpaert [Inria, Engineer, until Sep 2020]
Interns and Apprentices
 Theophile Cantelobre [Inria, from Feb 2020 until Jul 2020]
 Issa Dabo [Inria, from Jun 2020 until Aug 2020]
 Cadmos KahaleAbdou [Inria, from Jul 2020 until Oct 2020]
 Komlan Midodzi Noukpoape [Inria, from Apr 2020 until Sep 2020]
Administrative Assistant
 Anne Rejl [Inria]
Visiting Scientist
 Apoorv Vikram Singh [Indian Institute of Science, Bangalore, India, until Jan 2020]
External Collaborators
 JeanFrancois Bouin [DiagRAMS Technologies, until Mar 2020]
 Margot Correard [DiagRAMS Technologies, until Mar 2020]
2 Overall objectives
2.1 Context
In several respects, modern society has strengthened the need for statistical analysis both from applied and theoretical point of view. The genesis comes from the easier availability of data thanks to technological breakthroughs (storage, transfer, computing), and are now so widespread that they are no longer limited to large human organizations. The more or less conscious goal of such data availability is the expectation of improving the quality of “since the dawn of time” statistical stories which are namely discovering new knowledge or doing better predictions. These both central tasks can be referred respectively as unsupervised learning or supervised learning, even if it is not limited to them or other names exist depending on communities. Somewhere, it pursues the following hope: “more data for better quality and more numerous results”.
However, today's data are increasingly complex. They gather mixed type features (for instance continuous data mixed with categorical data), missing or partially missing items (like intervals) and numerous variables (high dimensional situation). As a consequence, the target “better quality and more numerous results” of the previous adage (both words are important: “better quality” and also “more numerous”) could not be reached through a somewhat “manual” way, but should inevitably rely on some theoretical formalization and guarantee. Indeed, data can be so numerous and so complex (data can live in quite abstract spaces) that the “empirical” statistician is quickly outdated. However, data being subject by nature to randomness, the probabilistic framework is a very sensible theoretical environment to serve as a general guide for modern statistical analysis.
2.2 Goals
Modal is a projectteam working on today's complex data sets (mixed data, missing data, highdimensional data), for classical statistical targets (unsupervised learning, supervised learning, regression etc.) with approaches relying on the probabilistic framework. This latter can be tackled through both modelbased methods (as mixture models for a generic tool) and modelfree methods (as probabilistic bounds on empirical quantities). Furthermore, Modal is connected to the real world by applications, typically with biological ones (some members have this skill) but many other ones are also considered since the application coverage of the Modal methodology is very large. It is also important to note that, in return, applications are often real opportunities for initiating academic questioning for the statistician (case of some projects treated by bilille platform and some bilateral contracts of the team).
From the academic communities point of view, Modal can be seen as belonging simultaneously to both the statistical learning and machine learning ones, as attested by its publications. Somewhere it is the opportunity to make a bridge between these two stochastic communities around a common but large probabilistic framework.
3 Research program
3.1 Research axis 1: Unsupervised learning
Scientific locks related to unsupervised learning are numerous, concerning the clustering outcome validity, the ability to manage different kinds of data, the missing data questioning, the dimensionality of the data set etc. Many of them are addressed by the team, leading to publication achievements, often with a specific package delivery (sometimes upgraded as a software or even as a platform grouping several software). Because of the variety of the scope, it involves nearly all the permanent team members, often with PhD students and some engineers. The related works are always embedded inside a probabilistic framework, typically modelbased approaches but also modelfree ones like PACBayes (PAC stands for Probably Approximately Correct), because such a mathematical environment offers both a wellposed problem and a rigorous answer.
3.2 Research axis 2: Performance assessment
One main concern of the Modal team is to provide theoretical justifications on the procedures which are designed. Such guarantees are important to avoid misleading conclusions resulting from any unsuitable use. For example, one ingredient in proving these guarantees is the use of the PAC framework, leading to finitesample concentration inequalities. More precisely, contributions to PAC learning rely on the classical empirical process theory and the PACBayesian theory. The Modal team exploits such nonasymptotic tools to analyze the performance of iterative algorithms (such as gradient descent), crossvalidation estimators, online changepoint detection procedures, ranking algorithms, matrix factorization techniques and clustering methods, for instance. The team also develops some expertise on the formal dynamic study of algorithms related to mixture models (important models used in the previous unsupervised setting), like degeneracy for EM algorithm or also label switching for Gibbs algorithm.
3.3 Research axis 3: Functional data
Mainly due to technological advances, functional data are more and more widespread in many application domains. Functional data analysis (FDA) is concerned with the modeling of data, such as curves, shapes, images or a more complex mathematical object, though as smooth realizations of a stochastic process (an infinite dimensional data object valued in a space of eventually infinite dimension; space of squared integrable functions etc.). Time series are an emblematic example even if it should not be limited to them (spectral data, spatial data etc.). Basically, FDA considers that data correspond to realizations of stochastic processes, usually assumed to be in a metric, semimetric, Hilbert or Banach space. One may consider, functional independent or dependent (in time or space) data objects of different types (qualitative, quantitative, ordinal, multivariate, timedependent, spatialdependent etc.). The last decade saw a dynamic literature on parametric or nonparametric FDA approaches for different types of data and applications to various domains, such as principal component analysis, clustering, regression and prediction.
3.4 Research axis 4: Applications motivating research
The fourth axis consists in translating real application issues into statistical problems raising new (academic) challenges for models developed in Modal team. Cifre PhDs in industry and interdisciplinary projects with research teams in Health and Biology are at the core of this objective. The main originality of this objective lies in the use of statistics with complex data, including in particular ultrahigh dimension problems. We focus on real applications which cannot be solved by classical data analysis.
4 Application domains
4.1 Economic world
The Modal team applies it research to the economic world through CIFRE PhD supervision such as CACF (credit scoring), AVolute (expert in 3D sound), Meilleur Taux (insurance comparator), Worldline. It also has several contracts with companies such as COLAS, NokiaApsys/Airbus, Safety Line (through the PERFAI consortium).
4.2 Biology
The second main application domain of the team is the biology. Members of the team are involved in the supervision and scientific animation of bilille, the bioinformatics platform of Lille, and of OncoLille Institute.
5 Highlights of the year
 Christophe Biernacki is now Deputy Scientific Director at Inria in charge of the national scientific domain “applied mathematics, computation and simulation”.
 Christophe Biernacki has been president of the scientific comitee of the JdS 2020.
 Benjamin Guedj has led the emerging Inria London Programme since 2019 and was appointed Scientific Director of the programme in September 2020. The partnernship involves Inria and University College London (United Kingdom) as of February 1, 2021 and the official kickoff.
 Sophie DaboNiang has been nominated in 2020 member of the Committee of Diversity of International Mathematical Union (IMU).
 DiagRAMS Technologies, a software editor dedicated to predictive maintenance, has been created this year. This startup relies on the research of the MODAL team, developing a data analysis solution to anticipate breakdowns and malfunctions on industrial equipment.
 Cristian Preda is the new head of the MODAL team since December 2020. Vincent Vandewalle is the deputy director of the team.
5.1 Awards
Wilfried Heyse has been awarded at Spring of Cardiology prize for the best oral presentation of his poster 79.
Benjamin Guedj has obtained a best reviewer award (top 10% of the reviewers) for NeurIPS 2020.
Benjamin Guedj has coauthored a paper at NeurIPS 2020 which was selected for an oral presentation (top 3%) 39.
6 New software and platforms
6.1 New software
6.1.1 pycobra
 Keywords: Statistics, Data visualization, Machine learning

Scientific Description:
pycobra is a python library for ensemble learning, which serves as a toolkit for regression, classification, and visualisation. It is scikitlearn compatible and fits into the existing scikitlearn ecosystem.
pycobra offers a python implementation of the COBRA algorithm introduced by Biau et al. (2016) for regression.
Another algorithm implemented is the EWA (Exponentially Weighted Aggregate) aggregation technique (among several other references, you can check the paper by Dalalyan and Tsybakov (2007).
Apart from these two regression aggregation algorithms, pycobra implements a version of COBRA for classification. This procedure has been introduced by Mojirsheibani (1999).
pycobra also offers various visualisation and diagnostic methods built on top of matplotlib which lets the user analyse and compare different regression machines with COBRA. The Visualisation class also lets you use some of the tools (such as Voronoi Tesselations) on other visualisation problems, such as clustering.

Functional Description:
pycobra is a python library for ensemble learning, which serves as a toolkit for regression, classification, and visualisation. It is scikitlearn compatible and fits into the existing scikitlearn ecosystem.
pycobra offers a python implementation of the COBRA algorithm introduced by Biau et al. (2016) for regression.
Another algorithm implemented is the EWA (Exponentially Weighted Aggregate) aggregation technique (among several other references, you can check the paper by Dalalyan and Tsybakov (2007).
Apart from these two regression aggregation algorithms, pycobra implements a version of COBRA for classification. This procedure has been introduced by Mojirsheibani (1999).
pycobra also offers various visualisation and diagnostic methods built on top of matplotlib which lets the user analyse and compare different regression machines with COBRA. The Visualisation class also lets you use some of the tools (such as Voronoi Tesselations) on other visualisation problems, such as clustering.

URL:
https://
github. com/ bhargavvader/ pycobra  Publication: hal01514059
 Contact: Benjamin Guedj
 Participants: Bhargav Srinivasa Desikan, Benjamin Guedj
6.1.2 MixtComp.V4
 Keywords: Clustering, Statistics, Missing data, Mixed data
 Functional Description: MixtComp (Mixture Computation) is a modelbased clustering package for mixed data originating from the Modal team (Inria Lille). It has been engineered around the idea of easy and quick integration of all new univariate models, under the conditional independence assumption. New models will eventually be available from researches, carried out by the Modal team or by other teams. Currently, central architecture of MixtComp is built and functionality has been fieldtested through industry partnerships. Five basic models (Gaussian, Multinomial, Poisson, Weibull, NegativeBinomial) are implemented, as well as two advanced models (Functional and Rank). MixtComp has the ability to natively manage missing data (completely or by interval). MixtComp is used as an R package, but its internals are coded in C++ using state of the art libraries for faster computation.
 Release Contributions:  New I/O system  Replacement of regex library  Improvement of initialization  Criteria for stopping the algorithm  Added management of partially missing data for several models  User documentation  Adding user features in R
 Contact: Christophe Biernacki
 Participants: Christophe Biernacki, Vincent Kubicki, Matthieu MarbacLourdelle, Serge Iovleff, Quentin Grimonprez, Etienne Goffinet
 Partners: Université de Lille, CNRS
6.1.3 MASSICCC
 Name: Massive Clustering with Cloud Computing
 Keywords: Statistic analysis, Big data, Machine learning, Web Application
 Scientific Description: The web application let users use several software packages developed by INRIA directly in a web browser. Mixmod is a classification library for continuous and categorical data. MixtComp allows for missing data and a larger choice of data types. BlockCluster is a library for coclustering of data. When using the web application, the user can first upload a data set, then configure a job using one of the libraries mentioned and start the execution of the job on a cluster. The results are then displayed directly in the browser allowing for rapid understanding and interactive visualisation.
 Functional Description: The MASSICCC web application offers a simple and dynamic interface for analysing heterogeneous data with a web browser. Various software packages for statistical analysis are available (Mixmod, MixtComp, BlockCluster) which allow for supervised and supervised classification of large data sets.

URL:
https://
massiccc. lille. inria. fr  Contact: Christophe Biernacki
6.1.4 cfda
 Name: Categorical functional data analysis
 Keyword: Functional data
 Functional Description: The R package cfda performs:  descriptive statistics for categorical functional data  dimension reduction end optimal encoding of states (correspondance multiple analyses towards functional data)

URL:
https://
github. com/ modalinria/ cfda  Contact: Cristian Preda
 Participants: Cristian Preda, Quentin Grimonprez, Vincent Vandewalle
 Partner: Université de Lille
6.1.5 PyRotor
 Name: Python Route Trajectory Optimiser
 Keywords: Optimization, Machine learning, Trajectory Modeling

Scientific Description:
PyRotor is a Python implementation of the trajectory optimisation method introduced in the paper: “An endtoend datadriven optimisation framework for constrained trajectories”
The method proposes trajectories optimizing a given criterion. Unlike classical approaches (such as optimal control), the method is based on the information contained in the available data. This permits to restrict the search area to a neighborhood of the observed trajectories and incorporates the correlations estimated from the data. This is achieved by means of a regularization term in the cost function. An iterative approach is also developed to verify additional constraints.
 Functional Description: PyRotor leverages available trajectory data to focus the search space and to estimate some properties which are then incorporated in the optimisation problem. This constraints in a natural and simple way the optimisation problem whose solution inherits realistic patterns from the data. In particular PyRotor does not require any knowledge on the dynamics of the system.
 News of the Year: Methodology development and implementation of the first results

URL:
https://
pypi. org/ project/ pyrotor/  Publication: hal03024720
 Contact: Florent Dewez
 Participants: Florent Dewez, Benjamin Guedj, Arthur Talpaert, Vincent Vandewalle
6.2 New platforms
6.2.1 MASSICCC Platform
MASSICCC is a demonstration platform giving access through a SaaS (service as a software) concept to data analysis libraries developed at Inria. It allows obtaining results either directly through a website specific display (specific and interactive visual outputs) or through an R data object download. It started in October 2015 for two years and is common to the Modal team (Inria Lille) and the Select team (Inria Saclay). In 2016, two packages have been integrated: Mixmod and MixtComp (see the specific section about MixtComp). In 2017, the BlockCluster package has been integrated and also a particular attention to provide meaningful graphical outputs (for Mixmod, MixtComp and BlockCluster) directly in the web platform itself has led to some specific developments. In 2019, a new version of the MixtComp software has been developed. From 2020, Julien Vandaele joined the MODAL team as a research engineer for upgrading both the MixtComp software and the MASSICCC platform.
7 New results
7.1 Axis 1: Modelbased Coclustering for Ordinal Data of Different Dimensions
Participants: Christophe Biernacki.
This work has been motivated by a psychological survey on women affected by a breast tumor. Patients replied at different moments of their treatment to questionnaires with answers on ordinal scale. The questions relate to aspects of their life called dimensions. To assist the psychologists in analyzing the results, it is useful to emphasize a structure in the dataset. The clustering method achieves that by creating groups of individuals that are depicted by a representative of the group. From a psychological position, it is also useful to observe how questions may be grouped. This is why a clustering should also be performed on the features, which is called a coclustering problem. However, gathering questions that are not related to the same dimension does not make sense from a psychologist stance. Therefore, the present work corresponds to perform a constrained coclustering method aiming to prevent questions from different dimensions from getting assembled in a same columncluster. In addition, evolution of coclusters along time has been investigated. The method relies on a constrained Latent Block Model embedding a probability distribution for ordinal data. Parameter estimation relies on a Stochastic EMalgorithm associated to a Gibbs sampler, and the ICLBIC criterion is used for selecting the numbers of coclusters. The resulting work has been accepted in an international journal in 2019 and the related R package ordinalClust has been accepted this year in another international journal 29.
This is a joint work with Margot Selosse (PhD student) and Julien Jacques, both from Université de Lyon 2, and Florence CoussonGélie from Université Paul Valéry Montpellier 3.
7.2 Axis 1: Modelbased Coclustering for Mixed Type Data
Participants: Christophe Biernacki.
Over decades, a lot of studies have shown the importance of clustering to emphasize groups of observations. More recently, due to the emergence of highdimensional datasets with a huge number of features, coclustering techniques have emerged and proposed several methods for simultaneously producing groups of observations and features. By synthesizing the dataset in blocks (the crossing of a rowcluster and a columncluster), this technique can sometimes summarize better the data and its inherent structure. The Latent Block Model (LBM) is a wellknown method for performing a coclustering. However, recently, contexts with features of different types (here called mixed type datasets) are becoming more common. Unfortunately, the LBM is not directly applicable on this kind of dataset. The present work extends the usual LBM to the socalled Multiple Latent Block Model (MLBM) which is able to handle mixed type datasets. The inference is done through a Stochastic EMalgorithm embedding a Gibbs sampler and model selection criterion is defined to choose the number of row and column clusters. This method was successfully used on simulated and real datasets. This work is now accepted in an international journal 27.
This is joint work with Margot Selosse (PhD student) and Julien Jacques, both from Université de Lyon 2.
7.3 Axis 1: Relaxing the Identically Distributed Assumption in Gaussian Coclustering for High Dimensional Data
Participants: Christophe Biernacki.
A coclustering model for continuous data that relaxes the identically distributed assumption within blocks of traditional coclustering is presented. The proposed model, although allowing more flexibility, still maintains the very high degree of parsimony achieved by traditional coclustering. A stochastic EM algorithm along with a Gibbs sampler is used for parameter estimation and an ICL criterion is used for model selection. Simulated and real datasets are used for illustration and comparison with traditional coclustering. This work has been submitted to an international journal 65.
This is a joint work with Michael Gallaugher (PhD student) and Paul McNicholas, both from McMaster University (Canada). Michael Gallaugher visited Modal for three months in 2018.
7.4 Axis 1: Gaussianbased Visualization of Gaussian and nonGaussian Modelbased Clustering
Participants: Christophe Biernacki, Vincent Vandewalle.
A generic method is introduced to visualize in a Gaussianlike way, and onto ${R}^{2}$, results of Gaussian or nonGaussian modelbased clustering. The key point is to explicitly force a spherical Gaussian mixture visualization to inherit from the within cluster overlap which is present in the initial clustering mixture. The result is a particularly userfriendly draw of the clusters, allowing any practitioner to have a thorough overview of the potentially complex clustering result. An entropic measure allows us to inform of the quality of the drawn overlap, in comparison to the true one in the initial space. The proposed method is illustrated on four real data sets of different types (categorical, mixed, functional and network) and is implemented on the R package ClusVis. This work is now published in an international journal 15.
This is a joint work with Matthieu Marbac from ENSAI.
7.5 Axis 1: Dealing with Missing Data in Modelbased Clustering through a MNAR Model
Participants: Christophe Biernacki.
Since the 90s, modelbased clustering is largely used to classify data. Nowadays, with the increase of available data, missing values are more frequent. Traditional ways to deal with them consist in obtaining a filled data set, either by discarding missing values or by imputing them. In the first case, some information is lost; in the second case, the final clustering purpose is not taken into account through the imputation step. Thus, both solutions risk to blur the clustering estimation result. Alternatively, we defend the need to embed the missingness mechanism directly within the clustering modeling step. There exists three types of missing data: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). In all situations logistic regression is proposed as a natural and flexible candidate model. In particular, its flexibility property allows us to design some meaningful parsimonious variants, as dependency on missing values or dependency on the cluster label. In this unified context, standard model selection criteria can be used to select between such different missing data mechanisms, simultaneously with the number of clusters. Practical interest of our proposal is illustrated on data derived from medical studies suffering from many missing data. Currently, a preprint is being finalized for submission to an international journal.
It is a joint work with Claire Boyer from Sorbonne Université, Gilles Celeux from Inria Saclay, Julie Josse from Inria Montpellier, Fabien Laporte from Institut Pasteur and Matthieu Marbac from ENSAI.
7.6 Axis 1: Organized Coclustering for Textual Data Synthesis
Participants: Christophe Biernacki.
Recently, different studies have demonstrated the interest of coclustering, which simultaneously produces clusters of lines and columns. The present work introduces a novel coclustering model for parsimoniously summarizing textual data in documents × terms format. Besides highlighting homogeneous coclusters  as other existing algorithms do  we also distinguish noisy coclusters from significant ones, which is particularly useful for sparse documents × term matrices. Furthermore, our model proposes a structure among the significant coclusters and thus obtains a better interpretability to the user. By forcing a structure through rowclusters and columnclusters, this approach is competitive in terms of documents clustering, and offers userfriendly results. The algorithm derived for the proposed method is a Stochastic EM algorithm embedding a Gibbs sampling step and the Poisson distribution. A paper has now been accepted in an international journal 28 and also in a national conference with international audience 47.
This is joint work with Margot Selosse (PhD student) and Julien Jacques, both from Université de Lyon 2.
7.7 Axis 1: ModelBased Coclustering with Covariables
Participants: Serge Iovleff.
This work has been motivated by an epidemiological and genetic survey of malaria disease in Senegal. Data were collected between 1990 and 2008. It is based on a latent block model taking into account the problem of grouping variables and clustering individuals by integrating information given by a set of covariables. Numerical experiments on simulated data sets and an application on real genetic data highlight the interest of this approach. An article has been submitted to Journal of Classification and should incorporate “Major Revisions”.7.8 Axis 1: Predictive Clustering
Participants: Christophe Biernacki, Vincent Vandewalle.
Many data, for instance in biostatistics, contain some sets of variables which permit evaluating unobserved traits of the subjects (e.g. we ask question about how many pizzas, hamburgers, chips etc. are eaten to know how healthy are the food habits of the subjects). Moreover, we often want to measure the relations between these unobserved traits and some target variables (e.g. obesity). Thus, a twosteps procedure is often used: first, a clustering of the observations is performed on the sets of variables related to the same topic; second, the predictive model is fitted by plugging the estimated partitions as covariates. Generally, the estimated partitions are not exactly equal to the true ones. We investigate the impact of these measurement errors on the estimators of the regression parameters, and we explain when this twosteps procedure is consistent. We also present a specific EM algorithm which simultaneously estimates the parameters of the clustering and predictive models. This has led to the preprint 71 now submitted to an international journal.
It is a joint work with Matthieu Marbac from ENSAI and Mohammed Sedki from Université ParisSaclay.
7.9 Axis 1: A Binned Technique for Scalable Modelbased Clustering on Huge Datasets
Participants: Filippo Antonazzo, Christophe Biernacki.
Clustering is impacted by the regular increase of sample sizes which provides opportunity to reveal information previously out of scope. However, the volume of data leads to some issues related to the need of many computational resources and also to high energy consumption. Resorting to binned data depending on an adaptive grid is expected to give proper answer to such green computing issues while not harming the quality of the related estimation. After a brief review of existing methods, a first application in the context of univariate modelbased clustering is provided, with a numerical illustration of its advantages. Finally, an initial formalization of the multivariate extension is done, highlighting both issues and possible strategies. This work has been accepted to a national conference with international audience 43 and also to an international conference 33.
It is a joint work with Christine Keribin from Université ParisSaclay.
7.10 Axis 1: A Bumpy Journey: Exploring Deep Gaussian Mixture Models
Participants: Christophe Biernacki.
The deep Gaussian mixture model (DGMM) is a framework directly inspired by the finite mixture of factor analysers model (MFA) and the deep learning architecture composed of multiple layers. The MFA is a generative model that considers a data point as arising from a latent variable (termed the score) which is sampled from a standard multivariate Gaussian distribution and then transformed linearly. The linear transformation matrix (termed the loading matrix) is specific to a component in the finite mixture. The DGMM consists of stacking MFA layers, in the sense that the latent scores are no longer assumed to be drawn from a standard Gaussian, but rather are drawn from a mixture of factor analysers model. Thus the latent scores are at one point considered to be the input of an MFA and also to have latent scores themselves. The latent scores of the DGMM’s last layer only are considered to be drawn from a standard multivariate Gaussian distribution. In recent years, the DGMM has gained prominence in the literature: intuitively, this model should be able to capture complex distributions more precisely than a simple Gaussian mixture model. We show in this work that while the DGMM is an original and novel idea, in certain cases it is challenging to infer its parameters. In addition, we give some insights to the probable reasons of this difficulty. Experimental results are provided on github: https://
This is a joint work with Margot Selosse (PhD student) and Julien Jacques, both from Université de Lyon 2, and also Isobel Claire Gormley from University College Dubin (Ireland).
7.11 Axis 1: Multiple partition clustering subspaces
Participants: Vincent Vandewalle.
In model based clustering, it is often supposed that only one clustering latent variable explains the heterogeneity of the whole dataset. However, in many cases several latent variables could explain the heterogeneity of the data at hand. Finding such class variables could result in a richer interpretation of the data. In the continuous data setting, a multipartition model based clustering is proposed. It assumes the existence of several latent clustering variables, each one explaining the heterogeneity of the data with respect to some clustering subspace. It allows to simultaneously find the multipartitions and the related subspaces. Parameters of the model are estimated through an EM algorithm relying on a probabilistic reinterpretation of the factorial discriminant analysis. A model choice strategy relying on the BIC criterion is proposed to select to number of subspaces and the number of clusters by subspace. The obtained results are thus several projections of the data, each one conveying its own clustering of the data.
This work in now published in 32.
7.12 Axis 1: Ranking and synchronization from pairwise measurements via SVD
Participants: Hemant Tyagi.
Given a measurement graph $G=\left(\right[n],E)$ and an unknown signal $r\in {R}^{n}$, we investigate algorithms for recovering $r$ from pairwise measurements of the form ${r}_{i}{r}_{j}$; $\{i,j\}\in E$. This problem arises in a variety of applications, such as ranking teams in sports data and time synchronization of distributed networks. Framed in the context of ranking, the task is to recover the ranking of $n$ teams (induced by $r$) given a small subset of noisy pairwise rank offsets. We propose a simple SVDbased algorithmic pipeline for both the problem of time synchronization and ranking. We provide a detailed theoretical analysis in terms of robustness against both sampling sparsity and noise perturbations with outliers, using results from matrix perturbation and random matrix theory. Our theoretical findings are complemented by a detailed set of numerical experiments on both synthetic and real data, showcasing the competitiveness of our proposed algorithms with other stateoftheart methods.This is joint work with Alexandre d'Aspremont (CNRS & ENS, Paris) and Mihai Cucuringu (University of Oxford, United Kingdom) and has now been published in an international journal 19.
7.13 Axis 1: Regularized spectral methods for clustering signed networks
Participants: Hemant Tyagi.
We study the problem of $k$way clustering in signed graphs. Considerable attention in recent years has been devoted to analyzing and modeling signed graphs, where the affinity measure between nodes takes either positive or negative values. Recently, Cucuringu et al. [CDGT 2019] proposed a spectral method, namely SPONGE (Signed Positive over Negative Generalized Eigenproblem), which casts the clustering task as a generalized eigenvalue problem optimizing a suitably defined objective function. This approach is motivated by social balance theory, where the clustering task aims to decompose a given network into disjoint groups, such that individuals within the same group are connected by as many positive edges as possible, while individuals from different groups are mainly connected by negative edges. Through extensive numerical simulations, SPONGE was shown to achieve stateoftheart empirical performance. On the theoretical front, [CDGT 2019] analyzed SPONGE and the popular Signed Laplacian method under the setting of a Signed Stochastic Block Model (SSBM), for $k=2$ equalsized clusters, in the regime where the graph is moderately dense. In this work, we build on the results in [CDGT 2019] on two fronts for the normalized versions of SPONGE and the Signed Laplacian. Firstly, for both algorithms, we extend the theoretical analysis in [CDGT 2019] to the general setting of $k\ge 2$ unequalsized clusters in the moderately dense regime. Secondly, we introduce regularized versions of both methods to handle sparse graphs – a regime where standard spectral methods underperform – and provide theoretical guarantees under the same SSBM model. To the best of our knowledge, regularized spectral methods have so far not been considered in the setting of clustering signed graphs. We complement our theoretical results with an extensive set of numerical experiments on synthetic data.
This is joint work with Mihai Cucuringu (University of Oxford, United Kingdom), Apoorv Vikram Singh (NYU), Deborah Sulem (University of Oxford, United Kingdom). It was initiated when Apoorv Vikram Singh visited the MODAL team to work with Hemant Tyagi from Oct 2019Jan 2020. It is currently under review in an international journal. A summary of the results was presented at the GCLR (Graphs and more Complex structures for Learning and Reasoning) workshop at AAAI 2021 (https://
7.14 Axis 1: An extension of the angular synchronization problem to the heterogeneous setting
Participants: Hemant Tyagi.
Given an undirected measurement graph $G=\left(\right[n],E)$, the classical angular synchronization problem consists of recovering unknown angles ${\theta}_{1},\cdots ,{\theta}_{n}$ from a collection of noisy pairwise measurements of the form $({\theta}_{i}{\theta}_{j})\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}2\pi $, for each $\{i,j\}\in E$. This problem arises in a variety of applications, including computer vision, time synchronization of distributed networks, and ranking from preference relationships. In this paper, we consider a generalization to the setting where there exist $k$ unknown groups of angles ${\theta}_{l,1},\cdots ,{\theta}_{l,n}$, for $l=1,\cdots ,k$. For each $\{i,j\}\in E$, we are given noisy pairwise measurements of the form ${\theta}_{\ell ,i}{\theta}_{\ell ,j}$ for an unknown $\ell \in \{1,2,...,k\}$. This can be thought of as a natural extension of the angular synchronization problem to the heterogeneous setting of multiple groups of angles, where the measurement graph has an unknown edgedisjoint decomposition $G={G}_{1}\cup {G}_{2}...\cup {G}_{k}$, where the ${G}_{i}$'s denote the subgraphs of edges corresponding to each group. We propose a probabilistic generative model for this problem, along with a spectral algorithm for which we provide a detailed theoretical analysis in terms of robustness against both sampling sparsity and noise. The theoretical findings are complemented by a comprehensive set of numerical experiments, showcasing the efficacy of our algorithm under various parameter regimes. Finally, we consider an application of bisynchronization to the graph realization problem, and provide along the way an iterative graph disentangling procedure that uncovers the subgraphs ${G}_{i}$, $i=1,...,k$ which is of independent interest, as it is shown to improve the final recovery accuracy across all the experiments considered.
This is joint work with Mihai Cucuringu (University of Oxford, United Kingdom) and is currently under review in an international journal.
7.15 Axis 1&2: Clustering on Multilayer Graphs with Missing Values
Participants: Christophe Biernacki, Guillaume Braun, Hemant Tyagi.
Multilayer graphs clustering have gained increasing interest this last decade due to numerous applications in various fields. Several clustering methods have been proposed, but they rely all on the assumption that the network is fully observed. We propose a statistical framework to handle nodes that are missing on some layers as well as a method to estimate the model parameters and to impute missing edge values.
This PhD work has recently begun and has let to a national conference paper with international audience 34. An extended version has been submitted and accepted to an international conference for 2021.
7.16 Axis 2: Denoising modulo samples: kNN regression and tightness of SDP relaxation
Participants: Hemant Tyagi.
Many modern applications involve the acquisition of noisy modulo samples of a function $f$, with the goal being to recover estimates of the original samples of $f$. For a Lipschitz function $f:{[0,1]}^{d}\to \mathbb{R}$, suppose we are given the samples ${y}_{i}=(f\left({x}_{i}\right)+{\eta}_{i})\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}1;\phantom{\rule{1.em}{0ex}}i=1,\cdots ,n$ where ${\eta}_{i}$ denotes noise. Assuming ${\eta}_{i}$ are zeromean i.i.d Gaussian's, and ${x}_{i}$'s form a uniform grid, we derive a twostage algorithm that recovers estimates of the samples $f\left({x}_{i}\right)$ with a uniform error rate $O\left({\left(\frac{logn}{n}\right)}^{\frac{1}{d+2}}\right)$ holding with high probability. The first stage involves embedding the points on the unit complex circle, and obtaining denoised estimates of $f\left({x}_{i}\right)\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}1$ via a $k$NN (nearest neighbor) estimator. The second stage involves a sequential unwrapping procedure which unwraps the denoised mod 1 estimates from the first stage.
Recently, Cucuringu and Tyagi proposed an alternative way of denoising modulo 1 data which works with their representation on the unit complex circle. They formulated a smoothness regularized least squares problem on the product manifold of unit circles, where the smoothness is measured with respect to the Laplacian of a proximity graph $G$ involving the ${x}_{i}$'s. This is a nonconvex quadratically constrained quadratic program (QCQP) hence they proposed solving its semidefinite program (SDP) based relaxation. We derive sufficient conditions under which the SDP is a tight relaxation of the QCQP. Hence under these conditions, the global solution of QCQP can be obtained in polynomial time.
This is joint work with Michael Fanuel (KU Leuven). It is currently under review in an international journal and is undergoing revision.
7.17 Axis 2: Error analysis for denoising smooth modulo signals on a graph
Participants: Hemant Tyagi.
In many applications, we are given access to noisy modulo samples of a smooth function with the goal being to robustly unwrap the samples, i.e. to estimate the original samples of the function. In a recent work, Cucuringu and Tyagi proposed denoising the modulo samples by first representing them on the unit complex circle and then solving a smoothness regularized least squares problem – the smoothness measured w.r.t. the Laplacian of a suitable proximity graph $G$ – on the product manifold of unit circles. This problem is a quadratically constrained quadratic program (QCQP) which is nonconvex, hence they proposed solving its sphererelaxation leading to a trust region subproblem (TRS). In terms of theoretical guarantees, ${\ell}_{2}$ error bounds were derived for (TRS). These bounds are however weak in general and do not really demonstrate the denoising performed by (TRS).
In this work, we analyse the (TRS) as well as an unconstrained relaxation of (QCQP). For both these estimators we provide a refined analysis in the setting of Gaussian noise and derive noise regimes where they provably denoise the modulo observations w.r.t. the ${\ell}_{2}$ norm. The analysis is performed in a general setting where $G$ is any connected graph.
This is currently under review in an international journal, and is undergoing revision.
7.18 Axis 2: Multikernel unmixing and superresolution using the Modified Matrix Pencil method
Participants: Hemant Tyagi.
Consider $L$ groups of point sources or spike trains, with the ${l}^{th}$ group represented by ${x}_{l}\left(t\right)$. For a function $g:R\to R$, let ${g}_{l}\left(t\right)=g(t/{\mu}_{l})$ denote a point spread function with scale ${\mu}_{l}>0$, and with ${\mu}_{1}<\cdots <{\mu}_{L}$. With $y\left(t\right)={\sum}_{l=1}^{L}({g}_{l}\u2606{x}_{l})\left(t\right)$, our goal is to recover the source parameters given samples of $y$, or given the Fourier samples of $y$. This problem is a generalization of the usual superresolution setup wherein $L=1$; we call this the multikernel unmixing superresolution problem. Assuming access to Fourier samples of $y$, we derive an algorithm for this problem for estimating the source parameters of each group, along with precise nonasymptotic guarantees. Our approach involves estimating the group parameters sequentially in the order of increasing scale parameters, i.e. from group 1 to $L$. In particular, the estimation process at stage $1\le l\le L$ involves (i) carefully sampling the tail of the Fourier transform of $y$, (ii) a deflation step wherein we subtract the contribution of the groups processed thus far from the obtained Fourier samples, and (iii) applying Moitra's modified Matrix Pencil method on a deconvolved version of the samples in (ii).
This is joint work with Stephane Chretien (National Physical Laboratory, United Kingdom & Alan Turing Institute, London) and was mostly done while Hemant Tyagi was affiliated to the Alan Turing Institute. It has now been published in an international journal 17.
7.19 Axis 2: Provably robust estimation of modulo 1 samples of a smooth function with applications to phase unwrapping
Participants: Hemant Tyagi.
Consider an unknown smooth function $f:{[0,1]}^{d}\to R$, and assume we are given $n$ noisy mod 1 samples of $f$, i.e. ${y}_{i}=(f\left({x}_{i}\right)+{\eta}_{i})\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}1$, for ${x}_{i}\in {[0,1]}^{d}$, where ${\eta}_{i}$ denotes the noise. Given the samples ${({x}_{i},{y}_{i})}_{i=1}^{n}$, our goal is to recover smooth, robust estimates of the clean samples $f\left({x}_{i}\right)\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}1$. We formulate a natural approach for solving this problem, which works with angular embeddings of the noisy mod 1 samples over the unit circle, inspired by the angular synchronization framework. This amounts to solving a smoothness regularized leastsquares problem – a quadratically constrained quadratic program (QCQP) – where the variables are constrained to lie on the unit circle. Our proposed approach is based on solving its relaxation, which is a trustregion subproblem and hence solvable efficiently. We provide theoretical guarantees demonstrating its robustness to noise for adversarial, as well as random Gaussian and Bernoulli noise models. To the best of our knowledge, these are the first such theoretical results for this problem. We demonstrate the robustness and efficiency of our proposed approach via extensive numerical simulations on synthetic data, along with a simple leastsquares based solution for the unwrapping stage, that recovers the original samples of $f$ (up to a global shift). It is shown to perform well at high levels of noise, when taking as input the denoised modulo 1 samples. Finally, we also consider two other approaches for denoising the modulo 1 samples that leverage tools from Riemannian optimization on manifolds, including a BurerMonteiro approach for a semidefinite programming relaxation of our formulation. For the twodimensional version of the problem, which has applications in synthetic aperture radar interferometry (InSAR), we are able to solve instances of realworld data with a million sample points in under 10 seconds, on a personal laptop.
This is joint work with Mihai Cucuringu (University of Oxford, United Kingdom) and was mostly done while Hemant Tyagi was affiliated to the Alan Turing Institute. It has now been published in an international journal 18.
7.20 Axis 2: PseudoBayesian learning with kernel Fourier transform as prior
Participants: Pascal Germain.
We revisit the kernel random Fourier features (RFF) method through the lens of the PACBayesian theory. While the primary goal of RFF is to approximate a kernel, we look at the Fourier transform as a prior distribution over trigonometric hypotheses. It naturally suggests learning a posterior on these hypotheses. We derive generalization bounds that are optimized by learning a pseudoposterior obtained from a closedform expression, and corresponding learning algorithms.
This joint work with Emilie Morvant from Université Jean Monnet de SaintEtienne, and Gaël Letarte from Laval University (Québec, Canada) has been initiated in 2018 when Gaël Letarte was doing an internship at Inria, and led to a publication in the proceedings of AISTATS 2019 conference The same work has been prensented as a poster in the “Workshop on Machine Learning with guarantees @ NeurIPS 2019”.
An extension of this work, coauthored with Léo Gautheron, Amaury Habrard, Marc Sebban, and Valentina Zantedeschi – all from Université Jean Monnet de SaintEtienne – has been presented at the national conference CAp 2019 It is also the topic of a technical report
7.21 Axis 2: Improved PACBayesian Bounds for Linear Regression
Participants: Pascal Germain, Vera Shalaeva.
We improve the PACBayesian error bound for linear regression provided in the literature. The improvements are twofold. First, the proposed error bound is tighter, and converges to the generalization loss with a wellchosen temperature parameter. Second, the error bound also holds for training data that are not independently sampled. In particular, the error bound applies to certain time series generated by wellknown classes of dynamical models, such as ARX models.
It is a joint work with Mihaly Petreczky and Alireza Fakhrizadeh Esfahani from Université de Lille. It has been accepted for publication as part of the AAAI 2020 conference 41.
7.22 Axis 2: Multiview Boosting by controlling the diversity and the accuracy of viewspecific voters
Participants: Pascal Germain.
We present a comprehensive study of multilayer neural networks with binary activation, relying on the PACBayesian We propose a boosting based multiview learning algorithm which iteratively learns i) weights over viewspecific voters capturing viewspecific information; and ii) weights over views by optimizing a PACBayes multiview CBound that takes into account the accuracy of viewspecific classifiers and the diversity between the views. We derive a generalization bound for this strategy following the PACBayes theory which is a suitable tool to deal with models expressed as weighted combination over a set of voters.
It is a joint work with Emilie Morvant from Université Jean Monnet de SaintEtienne and with MassihReza Amini of Université GrenobleAlpes, and with Anil Goyal affiliated to both institutions. This work has been published in the journal Neurocomputing
7.23 Axis 2: PACBayes and Domain Adaptation
Participants: Pascal Germain.
In machine learning, Domain Adaptation (DA) arises when the distribution generating the test (target) data differs from the one generating the learning (source) data. It is well known that DA is a hard task even under strong assumptions, among which the covariateshift where the source and target distributions diverge only in their marginals, i.e. they have the same labeling function. Another popular approach is to consider a hypothesis class that moves closer the two distributions while implying a lowerror for both tasks. This is a VCdim approach that restricts the complexity of a hypothesis class in order to get good generalization. Instead, we propose a PACBayesian approach that seeks for suitable weights to be given to each hypothesis in order to build a majority vote. We prove a new DA bound in the PACBayesian context. This leads us to design the first DAPACBayesian algorithm based on the minimization of the proposed bound. Doing so, we seek for a $\rho $weighted majority vote that takes into account a tradeoff between three quantities. The first two quantities being, as usual in the PACBayesian approach, (a) the complexity of the majority vote (measured by a KullbackLeibler divergence) and (b) its empirical risk (measured by the $\rho $average errors on the source sample). The third quantity is (c) the capacity of the majority vote to distinguish some structural difference between the source and target samples.
This work has been published in the journal Neurocomputing 24.
It is a joint work with Emilie Morvant and Amaury Habrard from Université Jean Monnet de SaintEtienne (France), and with François Laviolette from Laval University (Québec, Canada).
7.24 Axis 2: Interpreting Neural Networks as Majority Votes through the PACBayesian Theory
Participants: Pascal Germain, Paul Viallard.
We propose a PACBayesian theoretical study of the twophase learning procedure of a neural network introduced by Kawaguchi et al. 84. In this procedure, a network is expressed as a weighted combination of all the paths of the network (from the input layer to the output one), that we reformulate as a PACBayesian majority vote. Starting from this observation, their learning procedure consists in (1) learning “prior” network for fixing some parameters, then (2) learning a “posterior” network by only allowing a modification of the weights over the paths of the prior network. This allows us to derive a PACBayesian generalization bound that involves the empirical individual risks of the paths (known as the Gibbs risk) and the empirical diversity between pairs of paths. Note that similarly to classical PACBayesian bounds, our result involves a KLdivergence term between a “prior” network and the “posterior” network. We show that this term is computable by dynamic programming without assuming any distribution on the network weights.
This early result has been accepted as a poster presentation in the international workshop “Workshop on Machine Learning with guarantees @ NeurIPS 2019”
This is a joint work with researchers from Université Jean Monnet de SaintEtienne: Amaury Habrard, Emilie Morvant, and Rémi Emonet.
7.25 Axis 2: PACBayesian Bound for the Conditional Value at Risk
Participant: Benjamin Guedj
Participants: Benjamin Guedj.
Conditional Value at Risk (CVaR) is a family of “coherent risk measures” which generalize the traditional mathematical expectation. Widely used in mathematical finance, it is garnering increasing interest in machine learning, e.g. as an alternate approach to regularization, and as a means for ensuring fairness. This paper presents a generalization bound for learning algorithms that minimize the CVaR of the empirical loss. The bound is of PACBayesian type and is guaranteed to be small when the empirical CVaR is small. We achieve this by reducing the problem of estimating CVaR to that of merely estimating an expectation. This then enables us, as a byproduct, to obtain concentration inequalities for CVaR even when the random variable in question is unbounded.
Joint work with Mhammedi Zakaria (Australian National University) and Robert Williamson. Published: 39
7.26 Axis 2: PACBayesian Contrastive Unsupervised Representation Learning
Participants: Benjamin Guedj, Pascal Germain.
Contrastive unsupervised representation learning (CURL) is the stateoftheart technique to learn representations (as a set of features) from unlabelled data. While CURL has collected several empirical successes recently, theoretical understanding of its performance was still missing. In a recent work, Arora et al. 86 provide the first generalisation bounds for CURL, relying on a Rademacher complexity. We extend their framework to the flexible PACBayes setting, allowing to deal with the noniid setting. We present PACBayesian generalisation bounds for CURL, which are then used to derive a new representation learning algorithm. Numerical experiments on reallife datasets illustrate that our algorithm achieves competitive accuracy, and yields generalisation bounds with nonvacuous values.
Joint work with Kento Nozawa (University of Tokyo & RIKEN, Japan). Published: 40
7.27 Axis 2: Revisiting clustering as matrix factorisation on the Stiefel manifold.
Participants: Benjamin Guedj.
This work studies clustering for possibly high dimensional data (e.g. images, time series, gene expression data, and many other settings), and rephrase it as low rank matrix estimation in the PACBayesian framework. Our approach leverages the well known BurerMonteiro factorisation strategy from large scale optimisation, in the context of low rank estimation. Moreover, our BurerMonteiro factors are shown to lie on a Stiefel manifold. We propose a new generalized Bayesian estimator for this problem and prove novel prediction bounds for clustering. We also devise a componentwise Langevin sampler on the Stiefel manifold to compute this estimator.
Joint work with Stéphane Chrétien (Université Lyon 2). Published: 35
7.28 Axis 2:KernelBased Ensemble Learning in Python
Participants: Benjamin Guedj.
We propose a new supervised learning algorithm for classification and regression problems where two or more preliminary predictors are available. We introduce KernelCobra, a nonlinear learning strategy for combining an arbitrary number of initial predictors. KernelCobra builds on the COBRA algorithm which combined estimators based on a notion of proximity of predictions on the training data. While the COBRA algorithm used a binary threshold to declare which training data were close and to be used, we generalise this idea by using a kernel to better encapsulate the proximity information. Such a smoothing kernel provides more representative weights to each of the training points which are used to build the aggregate and final predictor, and KernelCobra systematically outperforms the COBRA algorithm. While COBRA is intended for regression, KernelCobra deals with classification and regression. KernelCobra is included as part of the open source Python package Pycobra (0.2.4 and onward). Numerical experiments were undertaken to assess the performance (in terms of pure prediction and computational complexity) of KernelCobra on reallife and synthetic datasets.
Published: 25
7.29 Axis 2: Nonlinear aggregation of filters to improve image denoising.
Participants: Benjamin Guedj.
We introduce a novel aggregation method to efficiently perform image denoising. Preliminary filters are aggregated in a nonlinear fashion, using a new metric of pixel proximity based on how the pool of filters reaches a consensus. We provide a theoretical bound to support our aggregation scheme, its numerical performance is illustrated and we show that the aggregate significantly outperforms each of the preliminary filters.
Joint work with Juliette Rengot, Ecole de Ponts, ParisTech.
Published: 37
7.30 Axis 2: Multiple changepoints detection with reproducing kernels
Participants: Alain Celisse.
We tackle the changepoint problem with data belonging to a general set. We build a penalty for choosing the number of changepoints in the kernelbased method of Harchaoui and Cappé 83. This penalty generalizes the one proposed by Lebarbier 85 for a onedimensional signal changing only through its mean. We prove a nonasymptotic oracle inequality for the proposed method, thanks to a new concentration result for some function of Hilbertspace valued random variables. Experiments on synthetic and real data illustrate the accuracy of our method, showing that it can detect changes in the whole distribution of data, even when the mean and variance are constant.
Joint work with Sylvain Arlot (Orsay) and Zaïd Harchaoui (Seattle). This work has been accepted in JMLR
7.31 Axis 2: Analysis of early stopping rules based on discrepancy principle
Participants: Alain Celisse.
We describe a general unified framework for analyzing the statistical performance of early stopping rules based on the minimum discrepancy principle (DP). Finitesample bounds such as deviation or oracle inequalities are derived with high probability. Since it turns out that DP suffers some deficiencies when estimating smooth functions, refinements involving smoothing of the residuals are introduced and analyzed. Theoretical bounds established in the fixed design setting under mild assumptions such as the boundedness of the kernel. When focusing on the smoothed discrepancy principle, such bounds are even extended to the random design setting by means of a new changeofnorm argument
Joint work with Markus Reiß(Humboldt) and Martin Wahl (Humboldt). This work has been already presented several times in seminars.
7.32 Axis 3: Shortterm air temperature forecasting using Nonparametric Functional Data Analysis and SARMA models
Participants: Sophie DaboNiang.
Air temperature is a significant meteorological variable that affects social activities and economic sectors. In this paper, a nonparametric and a parametric approach are used to forecast hourly air temperature up to 24 h in advance. The former is a regression model in the Functional Data Analysis framework. The nonlinear regression operator is estimated using a kernel function. The smoothing parameter is obtained by a crossvalidation procedure and used for the selection of the optimal number of closest curves. The other method applied is a Seasonal Autoregressive Moving Average (SARMA) model, the order of which is determined by the Bayesian Information Criterion. The obtained forecasts are combined using weights calculated based on the forecast errors. The results show that SARMA has a better performance for the first 6 forecasted hours, after which the NonParametric Functional Data Analysis (NPFDA) model provides superior results. Forecast pooling improves the accuracy of the forecasts.
It is a joint work with Stelian Curceac (Rothamsted Research, United Kingdom) Camille Ternynck (CERIM, Université de Lille) Taha B.M.J. Ouarda (INRS, Québec, Canada) Fateh Chebana (INRS, Québec, Canada). This work has been published in the journal Environmental Modelling and Software
7.33 Axis 3: Mathematical Modeling and Study of Random or Deterministic Phenomena
Participants: Sophie DaboNiang.
In order to identify mathematical modeling (including functional data analysis) and interdisciplinary research issues in evolutionary biology, epidemiology, epistemology, environmental and social sciences encountered by researchers in Mayotte, the first international conference on mathematical modeling (CIMOM’18) was held in Dembéni, Mayotte, from November 15 to 17, 2018, at the Centre Universitaire de Formation et de Recherche. The objective was to focus on mathematical research with interdisciplinarity. This contribution is a book discusses key aspects of recent developments in applied mathematical analysis and modeling. It was written after the international conference on mathematical modeling in Mayotte, where a call for chapters of the book was made. They were written in the form of journal articles, with new results extending the talks given during the conference and were reviewed by independent reviewers and book publishers It highlights a wide range of applications in the fields of biological and environmental sciences, epidemiology and social perspectives. Each chapter examines selected research problems and presents a balanced mix of theory and applications on some selected topics. Particular emphasis is placed on presenting the fundamental developments in mathematical analysis and modeling and highlighting the latest developments in different fields of probability and statistics. The chapters are presented independently and contain enough references to allow the reader to explore the various topics presented.
It is a joint work with Solym ManouAbi and JeanJacques Salone (Centre Universitaire de Mayotte). This book is to appear at Wiley (ISTE)
7.34 Axis 3: Categorical functional data analysis
Participants: Cristian Preda, Quentin Grimonprez, Vincent Vandewalle.
The research on functional data analysis is very actual. The R package “fda” is the most famous one implementing methodology for functional data. To the best of our knowledge, and quite surprisingly, there is no recent researches devoted to categorical functional data despite its ability to model real situations in different fields of applications: health and medicine (status of a patient over time), economy (status of the market), sociology (evolution of social status), and so on. We have developed the methodology to visualize, do dimension reduction and extract feature from categorical functional data. For this, the cfda R package has been developed. This has led to the preprint 72 that will be submitted in an international journal.
7.35 Axis 3: Scan Statistics
Participants: Cristian Preda, Alexandru Amarioarei.
The one dimensional discrete scan statistic is considered over sequences of random variables generated by block factor dependence models. Viewed as a maximum of an 1dependent stationary sequence, the scan statistics distribution is approximated with accuracy and sharp bounds are provided. The longest increasing run statistics is related to the scan statistics and its distribution is studied. The moving average process is a particular case of block factor and the distribution of the associated scan statistics is approximated. Numerical results are presented.
7.36 Axis 3: Clustering categorical functional data
Participants: Cristian Preda, Vincent Vandewalle, Vlad Stefan Barbu.
The objective of this research direction was: (i) to propose possible modelling approaches of categorical functional data and (ii) to investigate the identifiability problem of such models. A first modelling framework is to consider that an observed functional data path represents a sample path of Markov process and thus $n$ sample paths come from several, say $K$, different processes. Consequently, we have here a mixture of $K$ different Markov processes. A second modelling framework is to consider that the observed sample path come from several semiMarkov processes. The parameter estimation is obtained through techniques based on the EM algorithm, while the selection of the number of classes is based on information criteria. An important problem is to determine the class membership for each sample paths, but our main concern that we have started to investigate is related to the identifiability problem. As far as we have studied, it seems that the identifiability of this type of models cannot be obtained in general, but only by imposing restrictions on the parameters of the model, cf. 81, 82. Our work in progress is related to finding sufficiently general conditions that guarantee this identifiability.
7.37 Axis 3: Estimation of rightcensored categorical functional data
Participants: Cristian Preda, Vincent Vandewalle, Vlad Stefan Barbu.
As mentioned in Section 7.36, we are interested in modelling categorical functional data by means of semiMarkov processes. These processes generalize Markov processes, in the sense that the sojourn time in a state can be arbitrarily distributed, as opposed to the Markov case. For this reason, semiMarkov processes are flexible tools, more adapted to concrete applications as compared to Markov processes 80. As in any modelling framework, it is clear that one crucial point is to obtain reliable estimators of the parameters of the model. A very important feature in many applications (e.g. survival analysis, reliability, etc.) is to take into account censored data. In the presence of rightcensored sample paths, the estimation of semiMarkov processes in continuous time is still an open problem, while for discretetime semiMarkov we have only an existing research in a nonparametric setting 87. For this framework, we have already established the main setting, and derived the form of the $Q$function for the EM algorithm. Several choices have to be made, that open different research paths: parametric versus nonparametric estimation for the sojourn time distributions, types of semiMarkov processes, considering mixtures for the sojourn time distributions, considering mixtures for the semiMarkov processes, etc. The next step of our work is to implement this estimation algorithm and to investigate, calibrate, adapt the algorithm. Another feature that we have not yet considered, but could be of great importance in some applications, is to investigate data that are censored under other censoring schemes, like censoring at the beginning of the sample path or interval censoring.
7.38 Axis 4: Statistical analysis of highthroughput proteomic data
Participants: Guillemette Marot, Vincent Vandewalle, Wilfried Heyse.
Since November 2019, Wilfried Heyse has started a PhD thesis granted by INSERM and supervised by Christophe Bauters, Guillemette Marot and Vincent Vandewalle. The aim is to identify earlier after myocardial infarction (MI) patients at high risk of developing left ventricular remodelling (LVR) that is quantified by imaging one year after MI or to identify patients with high risk of death. For that purpose, high throughput proteomic approach is used. This technology allows the measurement of 5000 proteins simultaneously. In parallel to these measures corresponding to the concentration of a protein in a plasma sample collected from one patient at a specific time, echocardiographic and clinical information have been collected on each of the 200 patients. One of the main challenge is to take into account the variations of the biomarkers according to the time (several measurement times), in order to improve the understanding of biological mechanisms involved on LVR or survival of the patient. Preliminary results have been presented in 38, 79.
This is a joint work with Florence Pinet and Christophe Bauters from INSERM.
7.39 Axis 4: Reject Inference Methods in Credit Scoring
Participants: Christophe Biernacki, Adrien Ehrhardt, Philippe Heinrich, Vincent Vandewalle.
The granting process of all credit institutions rejects applicants having a low credit score. Developing a scorecard, i.e. a correspondence table between a client’s characteristics and his score, requires a learning dataset in which the target variable good/bad borrower is known. Rejected applicants are de facto excluded from the process. This biased learning population might have deep consequences on the scorecard relevance. Some works, mostly empirical ones, try to exploit rejected applicants in the scorecard building process. This work proposes a rational criterion to evaluate the quality of a scoring model for the existing Reject Inference methods and dig out their implicit mathematical hypotheses. It is shown that, up to now, no such Reject Inference method can guarantee a better credit scorecard. These conclusions are illustrated on simulated and real data from the french branch of Crédit Agricole Consumer Finance (CACF). This has led to the preprint 63 which is now in revision in an international journal.
This is a joint work with Sébastien Beben of Crédit Agricole Consumer Finance.
7.40 Axis 4: Usability study
Participants: Vincent Vandewalle.
Since 2018, Vincent Vandewalle is working with Alexandre Caron and Benoît Dervaux, on issues of estimating the number of problems and the value of information in the field of usability. Based on usability study of a medical device the objective is to determine the number of possible problems linked to the use of a medical device (e.g. insulin pump) as well as their respective occurrence probabilities. Estimating this number and the different probabilities is essential to determine whether or not an additional usability study should be conducted, and to determine the number of users to be included in this study to maximize the expected benefits.
The discovery process can be modeled by a binary matrix, a matrix whose number of columns depends on the number of defects discovered by users. In this framework, they have proposed a probabilistic modeling of this matrix. They have included this modeling in a Bayesian framework where the number of problems and the probabilities of discovery are considered as random variables. In this framework, the article 31 as been published. It shows the interest of the approach compared to the approaches proposed in the state of the art in usability. The approach beyond point estimation also makes it possible to obtain the distribution of the number of problems and their respective probabilities given the discovery matrix.
The proposed model also allows to implement an approach aiming at measuring the value of additional information in relation to the discovery process. In this framework, they are writing a second paper and developing the R package useval available soon. This work has been presented in a conference 48.
This is a joint work with Alexandre Caron and Benoît Dervaux both from ULR 2694: METRICS.
7.41 Axis 4: Artificial intelligence for aviation
Participants: Florent Dewez, Benjamin Guedj, Arthur Talpaert, Vincent Vandewalle.
Since November 2018, Benjamin Guedj and Vincent Vandewalle have been participating in the European PERFAI project (European PERFAI project: Enhance Aircraft Performance and Optimization through the utilization of Artificial Intelligence) in partnership with the company Safety Line. In particular, using data collected during flights involves developing Machine Learning models to optimize the aircraft's trajectory concerning fuel consumption, for example. In this context they have hired Florent Dewez (postdoctoral researcher) and Arthur Talpaert (engeneer).
The article 21 is now published. It explains how, using flight recording data, it is possible to implement learning models on variables that have not been directly observed, and in particular to predict the drag and lift coefficients as a function of the angle and speed of the aircraft.
A second article is being to be submitted about the optimization of the aircraft's trajectory based on a consumption model learned from the data, and is available as a preprint 62. The originality of the approach consists in decomposing the trajectory on a functional basis, and thus carrying out the optimization on the coefficients of the decomposition on this basis, rather than approaching the problem from the angle of optimal control. Furthermore, to guarantee compliance with aeronautical constraints, we have proposed an approach penalized by a deviation term from reference flights. A generic Python module (PyRotor) to solve such optimization problems in conjunction with the proposed approach has been developed.
7.42 Axis 4: Domain Adaptation from a Pretrained Source Model
Participants: Christophe Biernacki, Pascal Germain, Luxin Zhang.
Traditional statistical learning paradigm assumes the consistency between train and test data distributions. This rarely holds in many reallife applications. The domain adaptation paradigm proposes a variety of techniques to overcome this issue. Most of the works in this area seek either for a latent space where source and target data share the same distribution, or for a transformation of the source distribution to match the target one. Both strategies require learning a model on the transformed source data. An original scenario is studied where one is given a model that has been constructed using expertise on the source data that is not accessible anymore. To use directly this model on target data, we propose to learn a transformation from the target domain to the source domain. Up to our knowledge, this is a new perspective on domain adaptation. This learning problem is introduced and formalized. We study the assumptions and the sufficient conditions mandatory to guarantee a good accuracy when using the source model directly on transformed target data. By pursuing this idea, a new domain adaptation method based on optimal transport is proposed. We experiment our method on a fraud detection problem. This work has been accepted to an international conference 42.
It is a joint work with Yacine Kessaci from Worldline company.
7.43 Other: Projection Under Pairwise Control
Participants: Christophe Biernacki.
Visualization of highdimensional and possibly complex (noncontinuous for instance) data onto a lowdimensional space may be difficult. Several projection methods have been already proposed for displaying such highdimensional structures on a lowerdimensional space, but the information lost is not always easy to use. Here, a new projection paradigm is presented to describe a nonlinear projection method that takes into account the projection quality of each projected point in the reduced space, this quality being directly available in the same scale as this reduced space. More specifically, this novel method allows a straightforward visualization data in R2 with a simple reading of the approximation quality, and provides then a novel variant of dimensionality reduction. This work has now been accepted in an international journal 13.
It is a joint work with Hiba Alawieh and Nicolas Wicker, both from Université de Lille.
7.44 Other: On the Local and Global Properties of the Gravitational Spheres of Influence
Participants: Christophe Biernacki.
We revisit the concept of sphere of gravitational activity, to which we give both a geometrical and physical meaning. This study aims to refine this concept in a much broader context that could, for instance, be applied to exoplanetary problems (in a Galactic stellar discStarPlanets system) to define a first order “border” of a planetary system. The methods used in this paper rely on classical Celestial Mechanics and develop the equations of motion in the framework of the 3body problem (e.g. StarPlanetSatellite System. We start with the basic definition of planet’s sphere of activity as the region of space in which it is feasible to assume a planet as the central body and the Sun as the perturbing body when computing perturbations of the satellite’s motion. We then investigate the geometrical properties and physical meaning of the ratios of Solar accelerations (central and perturbing) and planetary accelerations (central and perturbing), and the boundaries they define. We clearly distinguish throughout the paper between the sphere of activity, the Chebotarev sphere (a particular case of the sphere of activity), Laplace sphere, and the Hill sphere. The last two are often wrongfully thought to be one and the same. Furthermore, taking a closer look and comparing the ratio of the star’s accelerations (central/perturbing) to that of the planetary acceleration (central/perturbing) as a function of the planetocentric distance, we have identified different dynamical regimes which are presented in the semianalytical analysis. This work has been published in an international journal 30.
This a joint work with Damya Souami from Observatoire de Paris and with Jacky Cresson from Université de Pau et des Pays de l’Adour.
8 Bilateral contracts and grants with industry
8.1 Bilateral contracts with industry
COLAS company
Participants: Christophe Biernacki.
COLAS is a world leader in the construction and maintenance of transport infrastructure. This bilateral contract aims at classifying mixed data obtained with sensors coming from a study of the aging of road surfacing. The challenge is to deal with many missing (sensors failures) and correlated data (sensors proximity).PAYBACK company
Participants: Christophe Biernacki.
PAYBACK Group is an audit firm specializing in the analysis and reliability of transactions. This bilateral contract aims at predicting store sales both from past sales (times series) and also by exploiting external covariates (of different types).ADULM
Participants: Sophie DaboNiang, Cristian Preda.
The main goal of this projet with Lille Metropole Urban Development and Planning Agency (ADULM) is to design a tool for Territorial Coherence Scheme (SCoT) to monitor urban developments and develop territorial observation8.2 Bilateral grants with industry
Worldline
Participants: Christophe Biernacki.
Worldline is the new worldclass leader in the payments and transactional services industry, with a global reach. A PhD began in Feb. 2019 with Luxing Gang under the supervision of Christophe Biernacki, Pascal Germain (Laval University, Canada) and Yacine Kessaci (Worldline) on the topic of the domain adaptation from a pretrained source model (with application to fraud detection in electronic payments).ADEO
Participants: Christophe Biernacki, Vincent Vandewalle.
Adeo is No. 1 in Europe and No. 3 worldwide in the DIY market. A PhD began in Dec. 2020 with Axel Potier under the supervision of Christophe Biernacki, Vincent Vandewalle, Matthieu Marbac (ENSAI) and Julien Favre (ADEO) on the topic of sales forecasting concerning “slow movers” items (equivalent to item sold in low quantities).EITSysbooster: Nokia  Apsys/Airbus
Participants: Alain Celisse.
Nokia and Airbus are two worldwide known companies respectively working in communications and transport areas. The purpose of this contract is to perform root cause analysis to reduce (at the end) the number of failures.9 Partnerships and cooperations
9.1 International initiatives
9.1.1 Inria International Labs
6PAC
Participants: Benjamin Guedj.
 Title: Making Probably Approximately Correct Learning Active, Sequential, Structureaware, Efficient, Ideal and Safe
 Duration: 2018–2022
 Partners: Machine Learning Group, CWI (The Netherlands)
 Summary: This project roots in statistical learning theory, which can be viewed as the theoretical foundations of machine learning. The most common framework is a setup in which one is given n training examples, and the goal is to build a predictor that would be efficient on new (similar) data. This efficiency should be supported by PAC (Probably Approximately Correct) guarantees, e.g. upper bounds on the excess risk of a predictor that hold with high probability. Such guarantees however often hold under stringent assumptions which are typically never met in reallife application, e.g. independent, identically distributed data. More realistic modelling of data has triggered many research efforts in several directions: first, accommodating possible data (e.g. dependent, heavytailed), and second, in the direction of sequential learning, in which the predictor can be built on the fly, while new data is gathered. We believe that an ever more realistic paradigm is active learning, a setup in which the learner actively requests data (possibly facing constraints, such as storage, velocity, cost, etc.) and adapts its queries to optimize its performance. The 3years objective of 6PAC (where 6 stands for Sequential, Active, Efficient, Structured, Ideal, Safe — the six research directions we intend to contribute to) is to pave the way to new PAC generalization and samplecomplexity upper and lower bounds beyond batch learning. Our ambition is to contribute to several learning setups, ranging from sequential learning (where data streams are collected) to adaptive and active learning (where data streams are requested by the learning algorithm).
9.1.2 Inria international partners
Benjamin Guedj leads The Inria London Programme, an initiative from Inria to increase the volume of scientific collaborations with the UK and in particular with the London region, with the prime partnership with University College London (United Kingdom).
More details at https://
9.2 International research visitors
9.2.1 Visits of international scientists
 Apoorv Vikram Singh (IISc Bangalore, India) visited Hemant Tyagi from Oct 2019 to Jan 2020 to work on a project related to clustering of signed networks. This was partially funded by the Turing Institute, London. Apoorv worked under the joint supervision of Hemant Tyagi and Mihai Cucuringu (University of Oxford, United Kingdom) during this period.
 Déborah Sulem (PhD student, University of Oxford, United Kingdom) visited Hemant Tyagi on January 13–15, 2020.
9.3 European initiatives
9.3.1 FP7 & H2020 Projects
H2020 FAIR
Participants: Guillemette Marot.
 Acronym: FAIR
 Project title: Flagellin aerosol therapy as an immunomodulatory adjunct to the antibiotic treatment of drugresistant bacterial pneunomia
 Coordinator: JC. Sirard (Inserm, CIIL)
 Duration: 4 years (2020–2023)
 Partners: Inserm, Université de Lille, Free University of Berlin (Germany), Epithelix (Switzerland), Aerogen (Ireland), Statens Serum Institute (Denmark), CHRU Tours, Academic Medical Center of the University of Amsterdam (The Netherlands), University of Southampton (United Kingdom), European Respiratory Society (Switzerland)
 Abstract: The FAIR project aims at evaluating an alternative adjunct strategy to standard of care antibiotics for treating pneumonia caused by antibioticresistant bacteria: activation of the innate immune system in the airways. Guillemette Marot is involved in this H2020 project as scientific head of bilille platform, and will supervise 1 year engineer on integration of omic data.
H2020 PERFAI
Participants: Florent Dewez, Benjamin Guedj, Arthur Talpaert, Vincent Vandewalle.
 Acronym: PERFAI
 Project title: Enhance Aircraft Performance and Optimisation through utilisation of Artificial Intelligence
 Coordinator: Pierre Jouniaux (SafetyLine)
 Duration: 2 years (2018–2020)
 Partners: SafetyLine

Abstract: PERFAI will apply Machine Learning techniques on flight data (parametric & nonparametric approaches) to accurately measure actual aircraft performance throughout its lifecycle.
Within current airline operations, both at flight preparation (onground) & at flight management (inair) levels, the trajectory is first planned, then managed by the Flight Management System (FMS) using a single manufacturer’s performance model that is the same for every aircraft of the same type, & also on weather forecast that is computed long before the flight. It induces a lack of accuracy during the planning phase with a flight route preestablished at specific altitudes & speeds to optimize fuel burn, from takeoff to landing using aircraft performances that are not those of the real aircraft. Also, the actual flight will usually shift from the original plan because of Air Traffic Control (ATC) constraints, adverse weather, wind changes & tactical rerouting, without possibility for the flight crew, either using the FMS or through connected services to tactically recompute the trajectory in order to continuously optimize the flight path. This is in particular due to the limitations of the performance databases that the current systems are using.
Hence, PERFAI is focusing on identifying adequate machine learning algorithms, testing their accuracy & capability to perform flight data statistical analysis & developing mathematical models to optimize real flight trajectories with respect to the actual aircraft performance, thus, minimizing fuel consumption throughout the flight.
The consortium consists of SafetyLine & Inria, having full expertise at Aircraft Performance & Data Science, hence, able to fully propose, test & validate different statistical models that will allow to accurately solve some optimization challenges & implement them in an operational environment.
PERFAI total grant request to the CSJU is 568 550 € with total project duration of 24 months.
9.4 National initiatives
COVIDOM project
During the 1st lockdown in France, Christophe Biernacki supervised a task force composed of three Inria research teams (MODAL, STATIFY, TAU) for analysing data coming from the medical database COVIDOM of APHP concerning suspected COVID19 patients. This project was included in the overall national Inria “mission COVID” initiative.
Programme of Investments for the Future (PIA)
Bilille is a member of the PIA “Infrastructures en biologiesanté”
IFB, French Institute of Bioinformatics (https://
RHU PreciNASH
Participants: Guillemette Marot.
 Acronym: PreciNASH
 Project title: Nonalcoholic steatohepatitis (NASH) from disease stratification to novel therapeutic approaches
 Coordinator: François Pattou (Université de Lille, Inserm, CHRU Lille)
 Duration: 5 years
 Partners: FHU Integra and Sanofi
 Abstract: PreciNASH, project coordinated by Pr. F. Pattou (UMR 859, EGID), aims at better understanding non alcoholic stratohepatitis (NASH) and improving its diagnosis and care. In this RHU, Guillemette Marot supervises a 2 years postdoc, as her team ULR 2694 METRICS is a member of the FHU Integra. METRICS is involved in the WP1 for the development of a clinicalbiological model for the prediction of NASH. Other partners of the FHU are UMR 859, UMR 1011 and UMR 8199, these last three teams being part of the labex EGID (European Genomic Institute for Diabetes). Sanofi is the main industrial partner of the RHU PreciNASH. The whole project will last 5 years (2016–2021).
CNRS PEPS Blanc –– BayesRealForRNN project
Participants: Pascal Germain, Vera Shalaeva.
 Acronym: BayesRealForRNN
 Project title: PACBayesian theory for recurrent neural networks: a control theoretic approach
 Coordinator: Mihaly Petreczky (CNRS, UMR 9189 CRIStAL, Université de Lille)
 Year: 2019
 Abstract: The project proposes to analyze the mathematical correctness of deep learning algorithms by combining techniques from control theory and PACBayesian statistical theory. More precisely, the project proposes to concentrate on recurrent neural networks (RNNs), develop their structure theory using techniques from contol theory, and then apply this structure theory to derive PACBayesian error bounds for RNNs.
CNRS AMIES PEPS 2 — DiagChange project
Participants: Cristian Preda, Quentin Grimonprez.
 Acronym: DiagChange
 Year: 2019
 Abstract: The project proposes to study the topic of change detection distribution for multivariate signal in a industrial context. The project is in collaboration with the DiagRAMS startup.
CNRS AMIES PEPS 1 — PIVISCoT
Participants: Sophie DaboNiang, Cristian Preda.
 Year: 2020
 Abstract: The project aims to create a software for Territorial Coherence Scheme (SCoT) in Lille in order to monitor urban developments and develop territorial observation.
AMIES PEPS 2 — MadiPa
Participants: Stéphane Girard, Serge Iovleff.
 Acronym: MadiPa
 Project title: Modèles Autoassociatifs pour la Dispersion de Polluants dans l’Atmosphère
 Duration: 18 month (start in december 2019)

Partners: Société Phimeca http://
phimeca. , Mistis team Inria Grenoble RhôneAlpescom/  Abstract: Our goal is to develop a method for predicting the dispersion of pollutants in the atmosphere from an initial emission map and meteorological data. A map of the probabilities of exceeding a critical threshold of pollutants will be estimated thanks to the construction of a metamodel: the large dimension of the problem is reduced by the use of autoassociative models, a nonlinear extension of the Principal Components Analysis.
9.4.1 ANR
APRIORI
Participants: Benjamin Guedj, Pascal Germain, Hemant Tyagi, Vera Shalaeva.
 Type: ANR PRC
 Acronym: APRIORI
 Project title: PACBayesian theory and algorithms for deep learning and representation learning
 Coordinator: Emilie Morvant (Université Jean Monnet)
 Duration: 2019–2023
 Funding: 300k EUR
 Partners: MODAL, Laboratoire Hubert Curien (UMR CNRS 5516)
BEAGLE
Participants: Benjamin Guedj, Pascal Germain.
 Type: ANR JCJC
 Acronym: BEAGLE
 Duration: 2019–2023
 Project title: PACBayesian theory and algorithms for agnostic learning
 Funding: 180k EUR
 Partners: Pierre Alquier (RIKEN AIP, Japan), Peter Grünwald (CWI, The Netherlands), Rémi Bardenet (UMR CRIStAL 9189)
SMILE
Participants: Christophe Biernacki, Vincent Vandewalle.
 Acronym: SMILE
 Duration: 2018–2022
 Project title: Statistical Modeling and Inference for unsupervised Learning at LargEScale)
 Coordinator: Faicel Chamroukhi (LMNO, Université de Caen)
 Partners: MODAL, LMNO UMR CNRS 6139 (Caen), LMRS UMR CNRS 6085 (Rouen), LIS UMR CNRS 7020 (Toulon)
TheraSCUD2022
Participants: Guillemette Marot.
 Acronym: TheraSCUD2022
 Project title: Targeting the IL20/IL22 balance to restore pulmonary, intestinal and metabolic homeostasis after cigarette smoking and unhealthy diet
 Coordinator: P. Gosset (Institut Pasteur de Lille)
 Duration: 3 years (2017–2020)
 Partners: CIIL Institut Pasteur de Lille and UMR 1019 INRA ClermontFerrand
 Abstract: The TheraSCUD2022 project studies inflammatory disorders associated with cigarette smoking and unhealthy diet (SCUD). Guillemette Marot is involved in this ANR project as head of bilille platform, and will supervise 1 year engineer on integration of omic data.
9.4.2 Working groups
 Sophie DaboNiang belongs to the following working groups:
 STAFAV (STatistiques pour l'Afrique Francophone et Applications au Vivant)
 ERCIM Working Group on computational and Methodological Statistics, Nonparametric Statistics Team
 FrancoAfrican IRN (International Research Network) in Mathematics, funded by CNRS
 ONCOLille (Cancer Research Institute in Lille)
 Benjamin Guedj belongs to the following working groups (GdR) of CNRS:
 ISIS (local referee for Inria Lille  Nord Europe)
 MaDICS
 MASCOTNUM (local referee for Inria Lille  Nord Europe)
 Guillemette Marot belongs to the StatOmique working group
9.5 Regional initiatives
9.5.1 bilille, the bioinformatics platform of Lille
Participants: Guillemette Marot, Maxime Brunin, Iheb Eladib.
bilille, the bioinformatics platform of Lille officially integrated UMS 2014/US 41 PLBS (Plateformes Lilloises en Biologie Santé) in January 2020. In 2020, Guillemette Marot coheaded the platform with Hélène Touzet (CNRS, CRIStAL). Inria employed 2 engineers for this platform: M. Brunin, who participated in the development of the visCorVar tool, a tool to facilitate multiblock analysis for statistical integration of omics data and participated to the analyses of the TheraSCUD2022 ANR project.
 I. Eladib, who participated in the development of tools for bilille cloud, in order to simplify and optimize its use.
More information about the platform is available at
https://
Collaborations of the year linked to bilille
Participants: Guillemette Marot.
Guillemette Marot has supervised the data analysis part or support in biostatistics tools testing for the following research projects involving engineers from bilille (only the names of the principal investigators of the project are given even if several partners are sometimes involved in the project): CIIL, L. Poulin, InflammReg
 Infinite, V. Sobanski, Evapass
 U1011, Y. Sebti, Circaregen
 U1011, D. Dombrowicz, DeconImmunMetab
10 Dissemination
10.1 Promoting scientific activities
10.1.1 Scientific events: organisation
General chair, scientific chair
 Benjamin Guedj has been appointed (March 2020) general local chair of COLT 2022 to be held in London
 Hemant Tyagi is the organizer of the MODAL team scientific seminar
 Sophie DaboNiang is cochair of the group Statistics, applied math and computer science of PanAfrican Scientific Research Council, funded by Princeton University (USA)
Member of the organizing committees
 Sophie DaboNiang is cochair of the Organizing Committee of the Workshop 3rd Conference on Econometrics for Environment, December 2020, Lille.
10.1.2 Scientific events: selection
Christophe Biernacki has been president of the scientific comitee of JdS 2020, the annual national meeting the French staticial society (SFdS).
Reviewer
 Sophie DaboNiang has reviewed several papers for several journals during 2020 including Spatial Statistics, JSPI, Metrika, JRSS C
 Benjamin Guedj has served as reviewer for most toptier machine learning conferences, including AISTATS, ALT, COLT, ICML, NeurIPS
 Hemant Tyagi has reviewed for the following conferences during 2020: International Conference on Learning Representations (ICLR), International Conference on Machine Learning (ICML) and Symposium on Computational Geometry (SoCG)
 Christophe Biernacki has reviewed for the Cap2020 (Conférence sur l'Apprentissage Automatique) and also for several journals (IMAIAI, STCO, LSSP, SAM, GSCS, TNNLS, ESWA, JMIV, JCGS)
10.1.3 Journal
Member of the editorial boards
 Sophie DaboNiang is member of the editorial board of: Revista Colombiana de EstadísticaJournal Of Statistical Modeling and Analytics
 Benjamin Guedj is a member of the Editorial Board of reviewers for the Journal of Machine Learning Research (JMLR), since June 2020 and an Associate Editor and member of the Editorial Board for the journal Information and Inference (Oxford), since March 2020
 Christophe Biernacki is an Associate Editor of the NorthWestern European Journal of Mathematics (NWEJM) and a Guest Editor for the Special Issue on Innovations in ModelBased Clustering and Classification of the journal Advances Data Analysis and Classification (ADAC)
 Cristian Preda is an Associate Editor for
Methodology and Computing in Applied Probability Journal
(https://
www. ) and Romanian Journal of Mathematics and Computer Science (http://springer. com/ journal/ 11009 www. )rjmcs. ro
Reviewing activities
 Hemant Tyagi has reviewed for the following journals during 2020: Journal of the Royal Statistical Society (JRSS), IEEE Open Journal of Signal Processing, Mathematical reviews.
 Vincent Vandewalle has reviewed for the following journals during 2020: JCGS, Spatial Statistics, Methodology & Computing in Applied Probability.
10.1.4 Invited talks
Benjamin Guedj has given a number of scientific talks in seminars, including at
 Oxford University (United Kingdom)
 UCL (United Kingdom)
 The Alan Turing Institute (United Kingdom)
 RIKEN (Japan)
Sophie DaboNiang bas been invited to:
 NEF (Next Enstein Forum) 2020, December 810, 2020. Panel on The contribution of Mathematical Sciences in supporting robust disease prevention and modelling in Africa.
 AIMS SouthAfrica webinar, November 4, 2020. Statistical modeling of Spatial Big data and Applications.
Hemant Tyagi:
 Cafe de Sciences, Inria Lille, January 2020.
 STADIUS seminar, KU Leuven, February 2020.
 Séminaire SAMM : Statistique, Analyse et Modélisation Multidisciplinaire, Université Paris 1, November 2020.
10.1.5 Leadership within the scientific community
Sophie DaboNiang is:
 Chair of Committee for Developing Countries (CDC) of EMS (European Mathematical Society), 20192022. CDC
 Member of the executif committee and scientif officer of CIMPA
Guillemette Marot is scientific head of bilille, the bioinformatics platform of Lille. More information about the platform is available at
https://
10.1.6 Scientific expertise
Sophie DaboNiang is expert of
 L'Oreal Women in Science Awards
 HCERES
10.2 Teaching  Supervision  Juries
10.2.1 Teaching
 Pascal Germain taught
 Master: Introduction aux réseaux de neurones, 15 heures, M2, Université de Lille, France
 Hemant Tyagi is teaching
 Master: Statistics I, 24h, M1, Centrale Lille, France (Nov. 2020  7 Jan. 2021)
 Master: Statistics II, 24h, M1, Centrale Lille, France (11 Jan. 2021  18 March 2021)
 Sophie DaboNiang is teaching
 Master: Spatial Statistics, 24h, M2, Université de Lille, France
 Master: Advanced Statistics, 24h, M2, Université de Lille, France
 Master: Multivariate Data Analyses, 24h, M2, Université de Lille, France
 Licence: Probability, 24h, L2, Université de Lille, France
 Licence: Multivariate Statistics, 24h, L3, Université de Lille, France
 Guillemette Marot is teaching
 Licence: Biostatistics, 15h, L1, Université de Lille (Faculty of Medicine), France
 Master: Biostatistics, 62h, M1, Université de Lille (Faculty of Medicine), France
 Master: Supervised classification, 34h, M1, Polytech'Lille, France
 Master: Biostatistics, 20h, M1, Université de Lille (Departments of Computer Science and Biology), France
 Master: Statistical analysis of omics data, 22h, M2, Université de Lille (Department of Mathematics), France
 Doctorat: Artificial intelligence and health, 7h, Université de Lille (Faculty of Medicine), France
 Cristian Preda is teaching
 Polytech'Lille engineer school: Linear Models, 48h.
 Polytech'Lille engineer school: Advanced statistics, 48h.
 Polytech'Lille engineer school: Biostatistics, 10h.
 Polytech'Lille engineer school: Supervised clustering, 24h. France
 Christophe Biernacki is teaching
 New Master Data Science: Statistics, 24h, M1, Université de Lille, France
 Benjamin Guedj is teaching
 Advanced machine learning (M2, 6h), University College London, United Kingdom
 Serge Iovleff is teaching
 Licence: Analyse et méthodes numériques, 56h, Université de Lille, DUT Informatique
 Licence: R.O. et aide à la décision, 32h, Université de Lille, DUT Informatique
 Vincent Vandewalle is teaching
 Licence: Probability, 60h, Université de Lille, DUT STID
 Licence: Case study in statistics, 45h, Université de Lille, DUT STID
 Licence: R programming, 45h, Université de Lille, DUT STID
 Licence: Supervised clustering, 32h, Université de Lille, DUT STID
 Licence: Analysis, 24h, Université de Lille, DUT STID
10.2.2 Supervision
PhD defense:
 Arthur Leroy, December 9th 2020 on “Apprentissage de donneées fonctionnelles par modèles multitâches : application à la prédiction de performances sportives”
 Yaroslav Averyanov, December 15th 2020, supervised by Alain Celisse and Cristian Preda on “Designing and analyzing new early stopping rules for saving computational resources”
 Margot Selosse, November 13th 2020, supervised by Christophe Biernacki and Julien Jacques on “Introducing parsimony to analyse complex data with modelbased clustering”
PhD in progress:
 Axel Potier, Sale prediction for low turnover products, November 2020, Christophe Biernacki, Matthieu Marbac, Vincent Vandewalle
 Felix Biggs, Generative models and kernels, University College London (United Kingdom), Sep 2019, Benjamin Guedj
 Antoine Vendeville, Learning on graph to stop the propagation of fake news, University College London (United Kingdom), Sep 2019, Benjamin Guedj
 Luxin Zhang, Domain adaptation from a pretrained source model – Application to fraud detection in electronic payments, February 2019, Christophe Biernacki, Pascal Germain, Yacine Kessac
 Paul Viallard, Interpreting representation learning through PACBayes theory, September 2019, Amaury Habrard, Emilie Morvant, Pascal Germain
 Dang Khoi Pham, Planning and replanning of nurses in an oncology department using a multiobjective and interdisciplinary approach, September 2016, Sophie DaboNiang
 Solange Doumun, Performance evaluation and contribution to the development of multispectral image analysis strategies for automatic and rapid diagnosis of malaria, December 2018, Sophie DaboNiang
 Alaa Ali Ayad, Statistical modeling of large spatial data and its applications in health, September 2018, Sophie DaboNiang
 Wilfried Heyse, Prise en compte de la structure temporelle dans l'analyse statistique de données protéomiques à haut débit, October 2019, Christophe Bauters, Guillemette Marot and Vincent Vandewalle
 Margot Selosse, October 2017, Christophe Biernacki and Julien Jacques
 Filippo Antonazzo, October 2019, Christophe Biernacki and Christine Keribin
 Eglantine Karle, November 2020, Hemant Tyagi and Cristian Preda
 Guillaume Braun, January 2020, Christophe Biernacki and Hemant Tyagi
 Rajeev Bopche, September 2020, Christophe Biernacki and Martine Vaxillaire
 Antonin Schrab, September 2020, cosupervised by Arthur Gretton and Benjamin Guedj, University College London (United Kingdom)
 Reuben Adams, Septembre 2020, cosupervised by John ShaweTaylor and Benjamin Guedj, University College London (United Kingdom)
10.2.3 Juries
 Sophie DaboNiang acted as a reviewer and an examinator for PhD theses
 Benjamin Guedj has been the discussion leader for the licentiate thesis of Fredrik Hellström on December 16th, 2020, at Chalmers University (Sweden)
 Benjamin Guedj has been a member of 2 hiring panels for Inria permanent researchers
 Guillemette Marot acted as an examinator for the PhD thesis of Audrey Hulot, Nov 2020 (Université ParisSaclay) and in a research engineer (IR) jury, Oct 2020 (Université de Lille)
 Christophe Biernacki acted as a reviewer for four PhD theses and as an examinator for two HdR defenses
 Vincent Vandewalle participated in a MC jury Université d'Avignon, May 2020
 Cristian Preda acted as a referee for the HDR defense of Christophe Crambes, Université de Montpellier 2, June 30, 2020
 Cristian Preda acted as a referee for the HDR defense of Dan Lascu, November 19, 2020, Universitatea Ovidiu, Constanta (Romania)
11 Scientific production
11.1 Major publications
 1 article Simpler PACBayesian Bounds for Hostile Data Machine Learning 2018
 2 article An R Package and C++ library for Latent block models: Theory, usage and applications Journal of Statistical Software 2016
 3 article Unifying Data Units and Models in (Co)Clustering Advances in Data Analysis and Classification 12 41 May 2018
 4 articleOptimal crossvalidation in density estimation with the L2lossThe Annals of Statistics4252014, 18791910
 5 articleNonparametric prediction in the multivariate spatial contextJournal of Nonparametric Statistics2822016, 428458
 6 articleThe logic of transcriptional regulator recruitment architecture at cis regulatory modules controlling liver functionsGenome Research276June 2017, 985996
 7 inproceedings Dichotomize and Generalize: PACBayesian Binary Activated Deep Neural Networks NeurIPS 2019 Vancouver, Canada December 2019
 8 article Modelbased clustering of Gaussian copulas for mixed data Communications in Statistics  Theory and Methods December 2016
 9 articleParametrizations, fixed and random effectsJournal of Multivariate Analysis154February 2017, 162176
 10 article Learning general sparse additive models from point queries in high dimensions Constructive Approximation January 2019
11.2 Publications of the year
International journals
 11 article Semiparametric estimation with spatially correlated recurrent events Scandinavian Journal of Statistics June 2020
 12 article Partially Linear Spatial Probit Models Annales de l'ISUP December 2020
 13 article Projection under pairwise distance controls Communications in Statistics  Theory and Methods 2020
 14 articleChange point detection of flood events using a functional data frameworkAdvances in Water Resources137March 2020, 103522
 15 article Gaussian Based Visualization of Gaussian and NonGaussian Based Clustering Journal of Classification July 2020
 16 article A kernel discriminant analysis for spatially dependent data Distributed and Parallel Databases August 2020
 17 article Multikernel unmixing and superresolution using the Modified Matrix Pencil method Journal of Fourier Analysis and Applications 26 18 January 2020
 18 articleProvably robust estimation of modulo 1 samples of a smooth function with applications to phase unwrappingJournal of Machine Learning Research2132January 2020, 1−77
 19 articleRanking and synchronization from pairwise measurements via SVDJournal of Machine Learning Research2219February 2021, 163
 20 articleKernel regression estimation with errorsinvariables for random fieldsAfrika Matematika312020, 29–56
 21 article From industrywide parameters to aircraftcentric onflight inference: improving aeronautics performance prediction with machine learning DataCentric Engineering October 2020
 22 article Detection and segmentation of erythrocytes in multispectral labelfree blood smear images for automatic cell counting Journal of Spectral Imaging 9 Article ID a10 September 2020
 23 articleA novel laserbased method to measure the adsorption energy on carbonaceous surfacesCarbon173March 2021, 540556
 24 articlePACBayes and Domain AdaptationNeurocomputing3792020, 379397
 25 articleKernelBased Ensemble Learning in PythonInformation112February 2020, 63
 26 articleOne Dimensional Discrete Scan Statistics for Dependent Models and Some Related ProblemsMathematics 84April 2020, 576
 27 articleModelbased coclustering for mixed type dataComputational Statistics and Data Analysis1442020, 106866
 28 article Textual data summarization using the SelfOrganized CoClustering model Pattern Recognition February 2020
 29 article ordinalClust: An R Package to Analyze Ordinal Data The R Journal 12 2 January 2021
 30 articleOn the local and global properties of the gravitational spheres of influenceMonthly Notices of the Royal Astronomical Society4964June 2020, 4287–429
 31 article Estimating the number of usability problems affecting medical devices: modelling the discovery matrix BMC Medical Research Methodology 20 1 September 2020
 32 articleMultiPartitions Subspace ClusteringMathematics 84April 2020, 597
International peerreviewed conferences
 33 inproceedings A binned technique for scalable modelbased clustering on huge datasets MBC2  Models and Learning for Clustering and Classification Journal ADAC  Advances in Data Analysis and Classification, Catania, Italy September 2020
 34 inproceedings Clustering on multilayer graphs with missing values Journée de Statistique de la SFdS Nice, France May 2020
 35 inproceedings Revisiting clustering as matrix factorisation on the Stiefel manifold LOD 2020  the Sixth International Conference on Machine Learning, Optimisation and Data Science Siena, Italy July 2020

36
inproceedings
Online
$k$ means Clustering' AISTATS 2021  The 24th International Conference on Artificial Intelligence and Statistics Virtual, France 2021  37 inproceedings Nonlinear aggregation of filters to improve image denoising Computing Conference 2020 London, United Kingdom July 2020
 38 inproceedings Proteomic signature for early diagnosis of left ventricular remodeling after myocardial infarction Printemps de la cardiologie 2020 Grenoble, France October 2020
 39 inproceedings PACBayesian Bound for the Conditional Value at Risk NeurIPS 2020 Vancouver / Virtual, Canada December 2020
 40 inproceedings PACBayesian Contrastive Unsupervised Representation Learning UAI 2020  Conference on Uncertainty in Artificial Intelligence Toronto, Canada August 2020
 41 inproceedings Improved PACBayesian Bounds for Linear Regression AAAI 2020  ThirtyFourth AAAI Conference on Artificial Intelligence New York, United States February 2020
 42 inproceedings Target to Source Coordinatewise Adaptation of Pretrained Models ECML PKDD 2020  The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases Ghent / Virtual, Belgium September 2020
National peerreviewed Conferences
 43 inproceedings Estimation of univariate Gaussian mixtures for huge raw datasets by using binned datasets JDS 2020  52ème Journées de Statistiques de la Société Française de Statistique Nice, France May 2020
Conferences without proceedings
 44 inproceedings Scan statistics for some dependent models.Applications. STATMOD2020 Statistical Modeling with Applications Bucharest, Romania February 2021
 45 inproceedings The contribution of Mathematical Sciences in supporting robust disease prevention and modelling in Africa The contribution of Mathematical Sciences in supporting robust disease prevention and modelling in Africa Virtual Meeting, South Africa December 2020
 46 inproceedings A bumpy journey: exploring deep Gaussian mixture models I Can't Believe It's Not Better @ NeurIPS 2020 Vancouver, Canada December 2020
 47 inproceedings Coclustering contraint pour le résumé de matrices documentterme JdS 2020  52èmes Journées de Statistique de la Société Française de Statistique Nice, France May 2020
 48 inproceedings Estimation du nombre de problèmes et détermination du nombre de sujets nécessaires dans les études d’utilisabilité : une approche bayésienne Journées Biostatistiques 2020  GDR « Statistiques & santé » Paris, France October 2020
Scientific book chapters
 49 inbook Clustering spatial functional data Geostatistical Functional Data Analysis : Theory and Methods. Editors: Jorge Mateu, Ramon Giraldo Geostatistical Functional Data Analysis : Theory and Methods John Wiley and Sons, Chichester. ISBN : 9781119387848 January 2021
Edition (books, proceedings, special issue of a journal)
 50 book Mathematical Modeling and Study of Random or Deterministic Phenomena Wiley February 2020
Doctoral dissertations and habilitation theses
 51 thesis Designing and analyzing new early stopping rules for saving computational resources Université de Lille; Inria December 2020
 52 thesis Contribution to modelbased clustering of heterogeneous data Université de Lille January 2021
Reports & preprints
 53 misc Differentiable PACBayes Objectives with Partially Aggregated Neural Networks June 2020
 54 report Indicateurs de suivi de l'activité scientifique de l'Inria Inria December 2020
 55 misc A PACBayesian Perspective on Structured Prediction with Implicit Loss Embeddings December 2020
 56 misc Analyzing the discrepancy principle for kernelized spectral filter learning algorithms April 2020
 57 misc Regularized spectral methods for clustering signed networks January 2021
 58 misc An extension of the angular synchronization problem to the heterogeneous setting January 2021
 59 misc A Novel Unstained Blood Smears Multispectral Images Normalization. Application to Unstained Malaria Infected Blood Smear. February 2021
 60 misc Functional spatial principal Component Analysis and Application to demography February 2021
 61 misc Clustering DNA sequences for phylogenetic trees using a functional data framework February 2021
 62 misc An endtoend datadriven optimisation framework for constrained trajectories November 2020
 63 misc Reject Inference Methods in Credit Scoring: A rational review December 2020
 64 misc Denoising modulo samples: kNN regression and tightness of SDP relaxation January 2021
 65 misc ParameterWise CoClustering for HighDimensional Data September 2020
 66 misc PACBayes unleashed: generalisation bounds with unbounded losses June 2020
 67 misc Upper and Lower Bounds on the Performance of Kernel PCA December 2020
 68 misc Block clustering of Binary Data with Gaussian Covariables October 2020
 69 misc ClusterSpecific Predictions with MultiTask Gaussian Processes November 2020
 70 misc MAGMA: Inference and Prediction with MultiTask Gaussian Processes July 2020
 71 misc Simultaneous semiparametric estimation of clustering and regression December 2020
 72 misc cfda: an R Package for Categorical Functional Data Analysis October 2020
 73 misc Estimation of extreme tail index for β−mixing random fields February 2021
 74 misc An asymptotic approximation for the extended Bass diffusion model and application to pandemic outbreaks February 2021
 75 misc Error analysis for denoising smooth modulo signals on a graph January 2021
 76 misc Forecasting elections results via the voter model with stubborn nodes September 2020
 77 misc How opinions crystallise: an analysis of polarisation in the voter model June 2020
Other scientific publications
 78 misc Label switching in mixtures Glasgow, United Kingdom, France July 2021
 79 misc Proteomic signature for early diagnosis of left ventricular remodeling after myocardial infarction Grenoble / Virtual, France October 2020
11.3 Cited publications
 80 incollectionReliability theory for discretetime semiMarkov systemsSemiMarkov Chains and Hidden SemiMarkov Models toward ApplicationsSpringer2008, 130
 81 articleEstimation in the Mixture of Markov Chains Moving With Different SpeedsJournal of the American Statistical Association1004712005, 10461053
 82 inproceedingsOn mixtures of Markov chainsProceedings of the 30th International Conference on Neural Information Processing SystemsCiteseer2016, 34493457
 83 inproceedingsRetrospective Mutiple ChangePoint Estimation with Kernels2007 IEEE/SP 14th Workshop on Statistical Signal Processing2007, 768772
 84 articleGeneralization in Deep LearningCoRRabs/1710.054682017, URL: http://arxiv.org/abs/1710.05468
 85 articleDetecting multiple changepoints in the mean of Gaussian process by model selectionSignal Processing8542005, 717736URL: https://www.sciencedirect.com/science/article/pii/S0165168404003196
 86 inproceedingsA Theoretical Analysis of Contrastive Unsupervised Representation LearningProceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA2019, 56285637URL: http://proceedings.mlr.press/v97/saunshi19a.html
 87 articleExact MLE and asymptotic properties for nonparametric semiMarkov modelsJournal of Nonparametric Statistics2332011, 719739