Keywords
Computer Science and Digital Science
- A3.1.4. Uncertain data
- A3.2.3. Inference
- A3.3.2. Data mining
- A3.3.3. Big data analysis
- A3.4.1. Supervised learning
- A3.4.2. Unsupervised learning
- A3.4.5. Bayesian methods
- A3.4.7. Kernel methods
- A5.2. Data visualization
- A5.9.2. Estimation, modeling
- A6.2.3. Probabilistic methods
- A6.2.4. Statistical methods
- A6.3.3. Data processing
- A9.2. Machine learning
Other Research Topics and Application Domains
- B2.2.3. Cancer
- B9.5.6. Data science
- B9.6.3. Economy, Finance
- B9.6.5. Sociology
1 Team members, visitors, external collaborators
Research Scientists
- Christophe Biernacki [Team leader, Inria, Senior Researcher, team leader until Nov 2020, HDR]
- Benjamin Guedj [Inria, Researcher]
- Hemant Tyagi [Inria, Researcher]
Faculty Members
- Cristian Preda [Team leader, Université de Lille, Professor, team leader from Dec 2020, HDR]
- Vlad Barbu [Université de Rouen, Associate Professor, until Feb 2020, HDR]
- Alain Celisse [Université de Lille, Associate Professor, HDR]
- Sophie Dabo-Niang [Université de Lille, Professor, HDR]
- Philippe Heinrich [Université de Lille, Associate Professor]
- Serge Iovleff [Université de Lille, Associate Professor]
- Guillemette Marot [Université de Lille, Associate Professor, HDR]
- Vincent Vandewalle [Université de Lille, Associate Professor, HDR]
Post-Doctoral Fellows
- Florent Dewez [Inria]
- Vera Shalaeva [Inria, until Jun 2020]
PhD Students
- Reuben Adams [University College London, United Kingdom, from Sep 2020]
- Filippo Antonazzo [Inria]
- Yaroslav Averyanov [Inria]
- Felix Biggs [University College London, United Kingdom]
- Rajeev Bopche [Inria, from Oct 2020]
- Guillaume Braun [Insee]
- Wilfried Heyse [Inserm]
- Eglantine Karlé [Inria, from Nov 2020]
- Etienne Kronert [Wordline, from Jul 2020]
- Arthur Leroy [Université Paris-Descartes, until Sep 2020]
- Issam Ali Moindjie [Inria, from Oct 2020]
- Axel Potier [Inria, from Jul 2020]
- Antonin Schrab [University College London, United Kingdom, from Sep 2020]
- Antoine Vendeville [University College London, United Kingdom]
- Luxin Zhang [Wordline, CIFRE]
Technical Staff
- Maxime Brunin [Inria, Engineer, from Jul 2020]
- Iheb Eladib [Inria, Engineer, until Feb 2020]
- Quentin Grimonprez [Inria, Engineer, until Sep 2020]
- Etienne Kronert [Inria, Engineer, from Feb 2020 until Jun 2020]
- Issam Ali Moindjie [Inria, Engineer, until Sep 2020]
- Arthur Talpaert [Inria, Engineer, until Sep 2020]
Interns and Apprentices
- Theophile Cantelobre [Inria, from Feb 2020 until Jul 2020]
- Issa Dabo [Inria, from Jun 2020 until Aug 2020]
- Cadmos Kahale-Abdou [Inria, from Jul 2020 until Oct 2020]
- Komlan Midodzi Noukpoape [Inria, from Apr 2020 until Sep 2020]
Administrative Assistant
- Anne Rejl [Inria]
Visiting Scientist
- Apoorv Vikram Singh [Indian Institute of Science, Bangalore, India, until Jan 2020]
External Collaborators
- Jean-Francois Bouin [DiagRAMS Technologies, until Mar 2020]
- Margot Correard [DiagRAMS Technologies, until Mar 2020]
2 Overall objectives
2.1 Context
In several respects, modern society has strengthened the need for statistical analysis both from applied and theoretical point of view. The genesis comes from the easier availability of data thanks to technological breakthroughs (storage, transfer, computing), and are now so widespread that they are no longer limited to large human organizations. The more or less conscious goal of such data availability is the expectation of improving the quality of “since the dawn of time” statistical stories which are namely discovering new knowledge or doing better predictions. These both central tasks can be referred respectively as unsupervised learning or supervised learning, even if it is not limited to them or other names exist depending on communities. Somewhere, it pursues the following hope: “more data for better quality and more numerous results”.
However, today's data are increasingly complex. They gather mixed type features (for instance continuous data mixed with categorical data), missing or partially missing items (like intervals) and numerous variables (high dimensional situation). As a consequence, the target “better quality and more numerous results” of the previous adage (both words are important: “better quality” and also “more numerous”) could not be reached through a somewhat “manual” way, but should inevitably rely on some theoretical formalization and guarantee. Indeed, data can be so numerous and so complex (data can live in quite abstract spaces) that the “empirical” statistician is quickly outdated. However, data being subject by nature to randomness, the probabilistic framework is a very sensible theoretical environment to serve as a general guide for modern statistical analysis.
2.2 Goals
Modal is a project-team working on today's complex data sets (mixed data, missing data, high-dimensional data), for classical statistical targets (unsupervised learning, supervised learning, regression etc.) with approaches relying on the probabilistic framework. This latter can be tackled through both model-based methods (as mixture models for a generic tool) and model-free methods (as probabilistic bounds on empirical quantities). Furthermore, Modal is connected to the real world by applications, typically with biological ones (some members have this skill) but many other ones are also considered since the application coverage of the Modal methodology is very large. It is also important to note that, in return, applications are often real opportunities for initiating academic questioning for the statistician (case of some projects treated by bilille platform and some bilateral contracts of the team).
From the academic communities point of view, Modal can be seen as belonging simultaneously to both the statistical learning and machine learning ones, as attested by its publications. Somewhere it is the opportunity to make a bridge between these two stochastic communities around a common but large probabilistic framework.
3 Research program
3.1 Research axis 1: Unsupervised learning
Scientific locks related to unsupervised learning are numerous, concerning the clustering outcome validity, the ability to manage different kinds of data, the missing data questioning, the dimensionality of the data set etc. Many of them are addressed by the team, leading to publication achievements, often with a specific package delivery (sometimes upgraded as a software or even as a platform grouping several software). Because of the variety of the scope, it involves nearly all the permanent team members, often with PhD students and some engineers. The related works are always embedded inside a probabilistic framework, typically model-based approaches but also model-free ones like PAC-Bayes (PAC stands for Probably Approximately Correct), because such a mathematical environment offers both a well-posed problem and a rigorous answer.
3.2 Research axis 2: Performance assessment
One main concern of the Modal team is to provide theoretical justifications on the procedures which are designed. Such guarantees are important to avoid misleading conclusions resulting from any unsuitable use. For example, one ingredient in proving these guarantees is the use of the PAC framework, leading to finite-sample concentration inequalities. More precisely, contributions to PAC learning rely on the classical empirical process theory and the PAC-Bayesian theory. The Modal team exploits such non-asymptotic tools to analyze the performance of iterative algorithms (such as gradient descent), cross-validation estimators, online change-point detection procedures, ranking algorithms, matrix factorization techniques and clustering methods, for instance. The team also develops some expertise on the formal dynamic study of algorithms related to mixture models (important models used in the previous unsupervised setting), like degeneracy for EM algorithm or also label switching for Gibbs algorithm.
3.3 Research axis 3: Functional data
Mainly due to technological advances, functional data are more and more widespread in many application domains. Functional data analysis (FDA) is concerned with the modeling of data, such as curves, shapes, images or a more complex mathematical object, though as smooth realizations of a stochastic process (an infinite dimensional data object valued in a space of eventually infinite dimension; space of squared integrable functions etc.). Time series are an emblematic example even if it should not be limited to them (spectral data, spatial data etc.). Basically, FDA considers that data correspond to realizations of stochastic processes, usually assumed to be in a metric, semi-metric, Hilbert or Banach space. One may consider, functional independent or dependent (in time or space) data objects of different types (qualitative, quantitative, ordinal, multivariate, time-dependent, spatial-dependent etc.). The last decade saw a dynamic literature on parametric or non-parametric FDA approaches for different types of data and applications to various domains, such as principal component analysis, clustering, regression and prediction.
3.4 Research axis 4: Applications motivating research
The fourth axis consists in translating real application issues into statistical problems raising new (academic) challenges for models developed in Modal team. Cifre PhDs in industry and interdisciplinary projects with research teams in Health and Biology are at the core of this objective. The main originality of this objective lies in the use of statistics with complex data, including in particular ultra-high dimension problems. We focus on real applications which cannot be solved by classical data analysis.
4 Application domains
4.1 Economic world
The Modal team applies it research to the economic world through CIFRE PhD supervision such as CACF (credit scoring), A-Volute (expert in 3D sound), Meilleur Taux (insurance comparator), Worldline. It also has several contracts with companies such as COLAS, Nokia-Apsys/Airbus, Safety Line (through the PERF-AI consortium).
4.2 Biology
The second main application domain of the team is the biology. Members of the team are involved in the supervision and scientific animation of bilille, the bioinformatics platform of Lille, and of OncoLille Institute.
5 Highlights of the year
- Christophe Biernacki is now Deputy Scientific Director at Inria in charge of the national scientific domain “applied mathematics, computation and simulation”.
- Christophe Biernacki has been president of the scientific comitee of the JdS 2020.
- Benjamin Guedj has led the emerging Inria London Programme since 2019 and was appointed Scientific Director of the programme in September 2020. The partnernship involves Inria and University College London (United Kingdom) as of February 1, 2021 and the official kickoff.
- Sophie Dabo-Niang has been nominated in 2020 member of the Committee of Diversity of International Mathematical Union (IMU).
- DiagRAMS Technologies, a software editor dedicated to predictive maintenance, has been created this year. This start-up relies on the research of the MODAL team, developing a data analysis solution to anticipate breakdowns and malfunctions on industrial equipment.
- Cristian Preda is the new head of the MODAL team since December 2020. Vincent Vandewalle is the deputy director of the team.
5.1 Awards
Wilfried Heyse has been awarded at Spring of Cardiology prize for the best oral presentation of his poster 79.
Benjamin Guedj has obtained a best reviewer award (top 10% of the reviewers) for NeurIPS 2020.
Benjamin Guedj has co-authored a paper at NeurIPS 2020 which was selected for an oral presentation (top 3%) 39.
6 New software and platforms
6.1 New software
6.1.1 pycobra
- Keywords: Statistics, Data visualization, Machine learning
-
Scientific Description:
pycobra is a python library for ensemble learning, which serves as a toolkit for regression, classification, and visualisation. It is scikit-learn compatible and fits into the existing scikit-learn ecosystem.
pycobra offers a python implementation of the COBRA algorithm introduced by Biau et al. (2016) for regression.
Another algorithm implemented is the EWA (Exponentially Weighted Aggregate) aggregation technique (among several other references, you can check the paper by Dalalyan and Tsybakov (2007).
Apart from these two regression aggregation algorithms, pycobra implements a version of COBRA for classification. This procedure has been introduced by Mojirsheibani (1999).
pycobra also offers various visualisation and diagnostic methods built on top of matplotlib which lets the user analyse and compare different regression machines with COBRA. The Visualisation class also lets you use some of the tools (such as Voronoi Tesselations) on other visualisation problems, such as clustering.
-
Functional Description:
pycobra is a python library for ensemble learning, which serves as a toolkit for regression, classification, and visualisation. It is scikit-learn compatible and fits into the existing scikit-learn ecosystem.
pycobra offers a python implementation of the COBRA algorithm introduced by Biau et al. (2016) for regression.
Another algorithm implemented is the EWA (Exponentially Weighted Aggregate) aggregation technique (among several other references, you can check the paper by Dalalyan and Tsybakov (2007).
Apart from these two regression aggregation algorithms, pycobra implements a version of COBRA for classification. This procedure has been introduced by Mojirsheibani (1999).
pycobra also offers various visualisation and diagnostic methods built on top of matplotlib which lets the user analyse and compare different regression machines with COBRA. The Visualisation class also lets you use some of the tools (such as Voronoi Tesselations) on other visualisation problems, such as clustering.
-
URL:
https://
github. com/ bhargavvader/ pycobra - Publication: hal-01514059
- Contact: Benjamin Guedj
- Participants: Bhargav Srinivasa Desikan, Benjamin Guedj
6.1.2 MixtComp.V4
- Keywords: Clustering, Statistics, Missing data, Mixed data
- Functional Description: MixtComp (Mixture Computation) is a model-based clustering package for mixed data originating from the Modal team (Inria Lille). It has been engineered around the idea of easy and quick integration of all new univariate models, under the conditional independence assumption. New models will eventually be available from researches, carried out by the Modal team or by other teams. Currently, central architecture of MixtComp is built and functionality has been field-tested through industry partnerships. Five basic models (Gaussian, Multinomial, Poisson, Weibull, NegativeBinomial) are implemented, as well as two advanced models (Functional and Rank). MixtComp has the ability to natively manage missing data (completely or by interval). MixtComp is used as an R package, but its internals are coded in C++ using state of the art libraries for faster computation.
- Release Contributions: - New I/O system - Replacement of regex library - Improvement of initialization - Criteria for stopping the algorithm - Added management of partially missing data for several models - User documentation - Adding user features in R
- Contact: Christophe Biernacki
- Participants: Christophe Biernacki, Vincent Kubicki, Matthieu Marbac-Lourdelle, Serge Iovleff, Quentin Grimonprez, Etienne Goffinet
- Partners: Université de Lille, CNRS
6.1.3 MASSICCC
- Name: Massive Clustering with Cloud Computing
- Keywords: Statistic analysis, Big data, Machine learning, Web Application
- Scientific Description: The web application let users use several software packages developed by INRIA directly in a web browser. Mixmod is a classification library for continuous and categorical data. MixtComp allows for missing data and a larger choice of data types. BlockCluster is a library for co-clustering of data. When using the web application, the user can first upload a data set, then configure a job using one of the libraries mentioned and start the execution of the job on a cluster. The results are then displayed directly in the browser allowing for rapid understanding and interactive visualisation.
- Functional Description: The MASSICCC web application offers a simple and dynamic interface for analysing heterogeneous data with a web browser. Various software packages for statistical analysis are available (Mixmod, MixtComp, BlockCluster) which allow for supervised and supervised classification of large data sets.
-
URL:
https://
massiccc. lille. inria. fr - Contact: Christophe Biernacki
6.1.4 cfda
- Name: Categorical functional data analysis
- Keyword: Functional data
- Functional Description: The R package cfda performs: - descriptive statistics for categorical functional data - dimension reduction end optimal encoding of states (correspondance multiple analyses towards functional data)
-
URL:
https://
github. com/ modal-inria/ cfda - Contact: Cristian Preda
- Participants: Cristian Preda, Quentin Grimonprez, Vincent Vandewalle
- Partner: Université de Lille
6.1.5 PyRotor
- Name: Python Route Trajectory Optimiser
- Keywords: Optimization, Machine learning, Trajectory Modeling
-
Scientific Description:
PyRotor is a Python implementation of the trajectory optimisation method introduced in the paper: “An end-to-end data-driven optimisation framework for constrained trajectories”
The method proposes trajectories optimizing a given criterion. Unlike classical approaches (such as optimal control), the method is based on the information contained in the available data. This permits to restrict the search area to a neighborhood of the observed trajectories and incorporates the correlations estimated from the data. This is achieved by means of a regularization term in the cost function. An iterative approach is also developed to verify additional constraints.
- Functional Description: PyRotor leverages available trajectory data to focus the search space and to estimate some properties which are then incorporated in the optimisation problem. This constraints in a natural and simple way the optimisation problem whose solution inherits realistic patterns from the data. In particular PyRotor does not require any knowledge on the dynamics of the system.
- News of the Year: Methodology development and implementation of the first results
-
URL:
https://
pypi. org/ project/ pyrotor/ - Publication: hal-03024720
- Contact: Florent Dewez
- Participants: Florent Dewez, Benjamin Guedj, Arthur Talpaert, Vincent Vandewalle
6.2 New platforms
6.2.1 MASSICCC Platform
MASSICCC is a demonstration platform giving access through a SaaS (service as a software) concept to data analysis libraries developed at Inria. It allows obtaining results either directly through a website specific display (specific and interactive visual outputs) or through an R data object download. It started in October 2015 for two years and is common to the Modal team (Inria Lille) and the Select team (Inria Saclay). In 2016, two packages have been integrated: Mixmod and MixtComp (see the specific section about MixtComp). In 2017, the BlockCluster package has been integrated and also a particular attention to provide meaningful graphical outputs (for Mixmod, MixtComp and BlockCluster) directly in the web platform itself has led to some specific developments. In 2019, a new version of the MixtComp software has been developed. From 2020, Julien Vandaele joined the MODAL team as a research engineer for upgrading both the MixtComp software and the MASSICCC platform.
7 New results
7.1 Axis 1: Model-based Co-clustering for Ordinal Data of Different Dimensions
Participants: Christophe Biernacki.
This work has been motivated by a psychological survey on women affected by a breast tumor. Patients replied at different moments of their treatment to questionnaires with answers on ordinal scale. The questions relate to aspects of their life called dimensions. To assist the psychologists in analyzing the results, it is useful to emphasize a structure in the dataset. The clustering method achieves that by creating groups of individuals that are depicted by a representative of the group. From a psychological position, it is also useful to observe how questions may be grouped. This is why a clustering should also be performed on the features, which is called a co-clustering problem. However, gathering questions that are not related to the same dimension does not make sense from a psychologist stance. Therefore, the present work corresponds to perform a constrained co-clustering method aiming to prevent questions from different dimensions from getting assembled in a same column-cluster. In addition, evolution of co-clusters along time has been investigated. The method relies on a constrained Latent Block Model embedding a probability distribution for ordinal data. Parameter estimation relies on a Stochastic EM-algorithm associated to a Gibbs sampler, and the ICL-BIC criterion is used for selecting the numbers of co-clusters. The resulting work has been accepted in an international journal in 2019 and the related R package ordinalClust has been accepted this year in another international journal 29.
This is a joint work with Margot Selosse (PhD student) and Julien Jacques, both from Université de Lyon 2, and Florence Cousson-Gélie from Université Paul Valéry Montpellier 3.
7.2 Axis 1: Model-based Co-clustering for Mixed Type Data
Participants: Christophe Biernacki.
Over decades, a lot of studies have shown the importance of clustering to emphasize groups of observations. More recently, due to the emergence of high-dimensional datasets with a huge number of features, co-clustering techniques have emerged and proposed several methods for simultaneously producing groups of observations and features. By synthesizing the dataset in blocks (the crossing of a row-cluster and a column-cluster), this technique can sometimes summarize better the data and its inherent structure. The Latent Block Model (LBM) is a well-known method for performing a co-clustering. However, recently, contexts with features of different types (here called mixed type datasets) are becoming more common. Unfortunately, the LBM is not directly applicable on this kind of dataset. The present work extends the usual LBM to the so-called Multiple Latent Block Model (MLBM) which is able to handle mixed type datasets. The inference is done through a Stochastic EM-algorithm embedding a Gibbs sampler and model selection criterion is defined to choose the number of row and column clusters. This method was successfully used on simulated and real datasets. This work is now accepted in an international journal 27.
This is joint work with Margot Selosse (PhD student) and Julien Jacques, both from Université de Lyon 2.
7.3 Axis 1: Relaxing the Identically Distributed Assumption in Gaussian Co-clustering for High Dimensional Data
Participants: Christophe Biernacki.
A co-clustering model for continuous data that relaxes the identically distributed assumption within blocks of traditional co-clustering is presented. The proposed model, although allowing more flexibility, still maintains the very high degree of parsimony achieved by traditional co-clustering. A stochastic EM algorithm along with a Gibbs sampler is used for parameter estimation and an ICL criterion is used for model selection. Simulated and real datasets are used for illustration and comparison with traditional co-clustering. This work has been submitted to an international journal 65.
This is a joint work with Michael Gallaugher (PhD student) and Paul McNicholas, both from McMaster University (Canada). Michael Gallaugher visited Modal for three months in 2018.
7.4 Axis 1: Gaussian-based Visualization of Gaussian and non-Gaussian Model-based Clustering
Participants: Christophe Biernacki, Vincent Vandewalle.
A generic method is introduced to visualize in a Gaussian-like way, and onto
This is a joint work with Matthieu Marbac from ENSAI.
7.5 Axis 1: Dealing with Missing Data in Model-based Clustering through a MNAR Model
Participants: Christophe Biernacki.
Since the 90s, model-based clustering is largely used to classify data. Nowadays, with the increase of available data, missing values are more frequent. Traditional ways to deal with them consist in obtaining a filled data set, either by discarding missing values or by imputing them. In the first case, some information is lost; in the second case, the final clustering purpose is not taken into account through the imputation step. Thus, both solutions risk to blur the clustering estimation result. Alternatively, we defend the need to embed the missingness mechanism directly within the clustering modeling step. There exists three types of missing data: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). In all situations logistic regression is proposed as a natural and flexible candidate model. In particular, its flexibility property allows us to design some meaningful parsimonious variants, as dependency on missing values or dependency on the cluster label. In this unified context, standard model selection criteria can be used to select between such different missing data mechanisms, simultaneously with the number of clusters. Practical interest of our proposal is illustrated on data derived from medical studies suffering from many missing data. Currently, a preprint is being finalized for submission to an international journal.
It is a joint work with Claire Boyer from Sorbonne Université, Gilles Celeux from Inria Saclay, Julie Josse from Inria Montpellier, Fabien Laporte from Institut Pasteur and Matthieu Marbac from ENSAI.
7.6 Axis 1: Organized Co-clustering for Textual Data Synthesis
Participants: Christophe Biernacki.
Recently, different studies have demonstrated the interest of co-clustering, which simultaneously produces clusters of lines and columns. The present work introduces a novel co-clustering model for parsimoniously summarizing textual data in documents × terms format. Besides highlighting homogeneous coclusters - as other existing algorithms do - we also distinguish noisy coclusters from significant ones, which is particularly useful for sparse documents × term matrices. Furthermore, our model proposes a structure among the significant coclusters and thus obtains a better interpretability to the user. By forcing a structure through row-clusters and column-clusters, this approach is competitive in terms of documents clustering, and offers user-friendly results. The algorithm derived for the proposed method is a Stochastic EM algorithm embedding a Gibbs sampling step and the Poisson distribution. A paper has now been accepted in an international journal 28 and also in a national conference with international audience 47.
This is joint work with Margot Selosse (PhD student) and Julien Jacques, both from Université de Lyon 2.
7.7 Axis 1: Model-Based Co-clustering with Co-variables
Participants: Serge Iovleff.
This work has been motivated by an epidemiological and genetic survey of malaria disease in Senegal. Data were collected between 1990 and 2008. It is based on a latent block model taking into account the problem of grouping variables and clustering individuals by integrating information given by a set of co-variables. Numerical experiments on simulated data sets and an application on real genetic data highlight the interest of this approach. An article has been submitted to Journal of Classification and should incorporate “Major Revisions”.7.8 Axis 1: Predictive Clustering
Participants: Christophe Biernacki, Vincent Vandewalle.
Many data, for instance in biostatistics, contain some sets of variables which permit evaluating unobserved traits of the subjects (e.g. we ask question about how many pizzas, hamburgers, chips etc. are eaten to know how healthy are the food habits of the subjects). Moreover, we often want to measure the relations between these unobserved traits and some target variables (e.g. obesity). Thus, a two-steps procedure is often used: first, a clustering of the observations is performed on the sets of variables related to the same topic; second, the predictive model is fitted by plugging the estimated partitions as covariates. Generally, the estimated partitions are not exactly equal to the true ones. We investigate the impact of these measurement errors on the estimators of the regression parameters, and we explain when this two-steps procedure is consistent. We also present a specific EM algorithm which simultaneously estimates the parameters of the clustering and predictive models. This has led to the preprint 71 now submitted to an international journal.
It is a joint work with Matthieu Marbac from ENSAI and Mohammed Sedki from Université Paris-Saclay.
7.9 Axis 1: A Binned Technique for Scalable Model-based Clustering on Huge Datasets
Participants: Filippo Antonazzo, Christophe Biernacki.
Clustering is impacted by the regular increase of sample sizes which provides opportunity to reveal information previously out of scope. However, the volume of data leads to some issues related to the need of many computational resources and also to high energy consumption. Resorting to binned data depending on an adaptive grid is expected to give proper answer to such green computing issues while not harming the quality of the related estimation. After a brief review of existing methods, a first application in the context of univariate model-based clustering is provided, with a numerical illustration of its advantages. Finally, an initial formalization of the multivariate extension is done, highlighting both issues and possible strategies. This work has been accepted to a national conference with international audience 43 and also to an international conference 33.
It is a joint work with Christine Keribin from Université Paris-Saclay.
7.10 Axis 1: A Bumpy Journey: Exploring Deep Gaussian Mixture Models
Participants: Christophe Biernacki.
The deep Gaussian mixture model (DGMM) is a framework directly inspired by the finite mixture of factor analysers model (MFA) and the deep learning architecture composed of multiple layers. The MFA is a generative model that considers a data point as arising from a latent variable (termed the score) which is sampled from a standard multivariate Gaussian distribution and then transformed linearly. The linear transformation matrix (termed the loading matrix) is specific to a component in the finite mixture. The DGMM consists of stacking MFA layers, in the sense that the latent scores are no longer assumed to be drawn from a standard Gaussian, but rather are drawn from a mixture of factor analysers model. Thus the latent scores are at one point considered to be the input of an MFA and also to have latent scores themselves. The latent scores of the DGMM’s last layer only are considered to be drawn from a standard multivariate Gaussian distribution. In recent years, the DGMM has gained prominence in the literature: intuitively, this model should be able to capture complex distributions more precisely than a simple Gaussian mixture model. We show in this work that while the DGMM is an original and novel idea, in certain cases it is challenging to infer its parameters. In addition, we give some insights to the probable reasons of this difficulty. Experimental results are provided on github: https://
This is a joint work with Margot Selosse (PhD student) and Julien Jacques, both from Université de Lyon 2, and also Isobel Claire Gormley from University College Dubin (Ireland).
7.11 Axis 1: Multiple partition clustering subspaces
Participants: Vincent Vandewalle.
In model based clustering, it is often supposed that only one clustering latent variable explains the heterogeneity of the whole dataset. However, in many cases several latent variables could explain the heterogeneity of the data at hand. Finding such class variables could result in a richer interpretation of the data. In the continuous data setting, a multi-partition model based clustering is proposed. It assumes the existence of several latent clustering variables, each one explaining the heterogeneity of the data with respect to some clustering subspace. It allows to simultaneously find the multi-partitions and the related subspaces. Parameters of the model are estimated through an EM algorithm relying on a probabilistic reinterpretation of the factorial discriminant analysis. A model choice strategy relying on the BIC criterion is proposed to select to number of subspaces and the number of clusters by subspace. The obtained results are thus several projections of the data, each one conveying its own clustering of the data.
This work in now published in 32.
7.12 Axis 1: Ranking and synchronization from pairwise measurements via SVD
Participants: Hemant Tyagi.
Given a measurement graphThis is joint work with Alexandre d'Aspremont (CNRS & ENS, Paris) and Mihai Cucuringu (University of Oxford, United Kingdom) and has now been published in an international journal 19.
7.13 Axis 1: Regularized spectral methods for clustering signed networks
Participants: Hemant Tyagi.
We study the problem of
This is joint work with Mihai Cucuringu (University of Oxford, United Kingdom), Apoorv Vikram Singh (NYU), Deborah Sulem (University of Oxford, United Kingdom). It was initiated when Apoorv Vikram Singh visited the MODAL team to work with Hemant Tyagi from Oct 2019-Jan 2020. It is currently under review in an international journal. A summary of the results was presented at the GCLR (Graphs and more Complex structures for Learning and Reasoning) workshop at AAAI 2021 (https://
7.14 Axis 1: An extension of the angular synchronization problem to the heterogeneous setting
Participants: Hemant Tyagi.
Given an undirected measurement graph
This is joint work with Mihai Cucuringu (University of Oxford, United Kingdom) and is currently under review in an international journal.
7.15 Axis 1&2: Clustering on Multilayer Graphs with Missing Values
Participants: Christophe Biernacki, Guillaume Braun, Hemant Tyagi.
Multilayer graphs clustering have gained increasing interest this last decade due to numerous applications in various fields. Several clustering methods have been proposed, but they rely all on the assumption that the network is fully observed. We propose a statistical framework to handle nodes that are missing on some layers as well as a method to estimate the model parameters and to impute missing edge values.
This PhD work has recently begun and has let to a national conference paper with international audience 34. An extended version has been submitted and accepted to an international conference for 2021.
7.16 Axis 2: Denoising modulo samples: k-NN regression and tightness of SDP relaxation
Participants: Hemant Tyagi.
Many modern applications involve the acquisition of noisy modulo samples of a function
Recently, Cucuringu and Tyagi proposed an alternative way of denoising modulo 1 data which works with their representation on the unit complex circle. They formulated a smoothness regularized least squares problem on the product manifold of unit circles, where the smoothness is measured with respect to the Laplacian of a proximity graph
This is joint work with Michael Fanuel (KU Leuven). It is currently under review in an international journal and is undergoing revision.
7.17 Axis 2: Error analysis for denoising smooth modulo signals on a graph
Participants: Hemant Tyagi.
In many applications, we are given access to noisy modulo samples of a smooth function with the goal being to robustly unwrap the samples, i.e. to estimate the original samples of the function. In a recent work, Cucuringu and Tyagi proposed denoising the modulo samples by first representing them on the unit complex circle and then solving a smoothness regularized least squares problem – the smoothness measured w.r.t. the Laplacian of a suitable proximity graph
In this work, we analyse the (TRS) as well as an unconstrained relaxation of (QCQP). For both these estimators we provide a refined analysis in the setting of Gaussian noise and derive noise regimes where they provably denoise the modulo observations w.r.t. the
This is currently under review in an international journal, and is undergoing revision.
7.18 Axis 2: Multi-kernel unmixing and super-resolution using the Modified Matrix Pencil method
Participants: Hemant Tyagi.
Consider
This is joint work with Stephane Chretien (National Physical Laboratory, United Kingdom & Alan Turing Institute, London) and was mostly done while Hemant Tyagi was affiliated to the Alan Turing Institute. It has now been published in an international journal 17.
7.19 Axis 2: Provably robust estimation of modulo 1 samples of a smooth function with applications to phase unwrapping
Participants: Hemant Tyagi.
Consider an unknown smooth function
This is joint work with Mihai Cucuringu (University of Oxford, United Kingdom) and was mostly done while Hemant Tyagi was affiliated to the Alan Turing Institute. It has now been published in an international journal 18.
7.20 Axis 2: Pseudo-Bayesian learning with kernel Fourier transform as prior
Participants: Pascal Germain.
We revisit the kernel random Fourier features (RFF) method through the lens of the PAC-Bayesian theory. While the primary goal of RFF is to approximate a kernel, we look at the Fourier transform as a prior distribution over trigonometric hypotheses. It naturally suggests learning a posterior on these hypotheses. We derive generalization bounds that are optimized by learning a pseudo-posterior obtained from a closed-form expression, and corresponding learning algorithms.
This joint work with Emilie Morvant from Université Jean Monnet de Saint-Etienne, and Gaël Letarte from Laval University (Québec, Canada) has been initiated in 2018 when Gaël Letarte was doing an internship at Inria, and led to a publication in the proceedings of AISTATS 2019 conference The same work has been prensented as a poster in the “Workshop on Machine Learning with guarantees @ NeurIPS 2019”.
An extension of this work, co-authored with Léo Gautheron, Amaury Habrard, Marc Sebban, and Valentina Zantedeschi – all from Université Jean Monnet de Saint-Etienne – has been presented at the national conference CAp 2019 It is also the topic of a technical report
7.21 Axis 2: Improved PAC-Bayesian Bounds for Linear Regression
Participants: Pascal Germain, Vera Shalaeva.
We improve the PAC-Bayesian error bound for linear regression provided in the literature. The improvements are two-fold. First, the proposed error bound is tighter, and converges to the generalization loss with a well-chosen temperature parameter. Second, the error bound also holds for training data that are not independently sampled. In particular, the error bound applies to certain time series generated by well-known classes of dynamical models, such as ARX models.
It is a joint work with Mihaly Petreczky and Alireza Fakhrizadeh Esfahani from Université de Lille. It has been accepted for publication as part of the AAAI 2020 conference 41.
7.22 Axis 2: Multiview Boosting by controlling the diversity and the accuracy of view-specific voters
Participants: Pascal Germain.
We present a comprehensive study of multilayer neural networks with binary activation, relying on the PAC-Bayesian We propose a boosting based multiview learning algorithm which iteratively learns i) weights over view-specific voters capturing view-specific information; and ii) weights over views by optimizing a PAC-Bayes multiview C-Bound that takes into account the accuracy of view-specific classifiers and the diversity between the views. We derive a generalization bound for this strategy following the PAC-Bayes theory which is a suitable tool to deal with models expressed as weighted combination over a set of voters.
It is a joint work with Emilie Morvant from Université Jean Monnet de Saint-Etienne and with Massih-Reza Amini of Université Grenoble-Alpes, and with Anil Goyal affiliated to both institutions. This work has been published in the journal Neurocomputing
7.23 Axis 2: PAC-Bayes and Domain Adaptation
Participants: Pascal Germain.
In machine learning, Domain Adaptation (DA) arises when the distribution generating the test (target) data differs from the one generating the learning (source) data. It is well known that DA is a hard task even under strong assumptions, among which the covariate-shift where the source and target distributions diverge only in their marginals, i.e. they have the same labeling function. Another popular approach is to consider a hypothesis class that moves closer the two distributions while implying a low-error for both tasks. This is a VC-dim approach that restricts the complexity of a hypothesis class in order to get good generalization. Instead, we propose a PAC-Bayesian approach that seeks for suitable weights to be given to each hypothesis in order to build a majority vote. We prove a new DA bound in the PAC-Bayesian context. This leads us to design the first DA-PAC-Bayesian algorithm based on the minimization of the proposed bound. Doing so, we seek for a
This work has been published in the journal Neurocomputing 24.
It is a joint work with Emilie Morvant and Amaury Habrard from Université Jean Monnet de Saint-Etienne (France), and with François Laviolette from Laval University (Québec, Canada).
7.24 Axis 2: Interpreting Neural Networks as Majority Votes through the PAC-Bayesian Theory
Participants: Pascal Germain, Paul Viallard.
We propose a PAC-Bayesian theoretical study of the two-phase learning procedure of a neural network introduced by Kawaguchi et al. 84. In this procedure, a network is expressed as a weighted combination of all the paths of the network (from the input layer to the output one), that we reformulate as a PAC-Bayesian majority vote. Starting from this observation, their learning procedure consists in (1) learning “prior” network for fixing some parameters, then (2) learning a “posterior” network by only allowing a modification of the weights over the paths of the prior network. This allows us to derive a PAC-Bayesian generalization bound that involves the empirical individual risks of the paths (known as the Gibbs risk) and the empirical diversity between pairs of paths. Note that similarly to classical PAC-Bayesian bounds, our result involves a KL-divergence term between a “prior” network and the “posterior” network. We show that this term is computable by dynamic programming without assuming any distribution on the network weights.
This early result has been accepted as a poster presentation in the international workshop “Workshop on Machine Learning with guarantees @ NeurIPS 2019”
This is a joint work with researchers from Université Jean Monnet de Saint-Etienne: Amaury Habrard, Emilie Morvant, and Rémi Emonet.
7.25 Axis 2: PAC-Bayesian Bound for the Conditional Value at Risk
Participant: Benjamin Guedj
Participants: Benjamin Guedj.
Conditional Value at Risk (CVaR) is a family of “coherent risk measures” which generalize the traditional mathematical expectation. Widely used in mathematical finance, it is garnering increasing interest in machine learning, e.g. as an alternate approach to regularization, and as a means for ensuring fairness. This paper presents a generalization bound for learning algorithms that minimize the CVaR of the empirical loss. The bound is of PAC-Bayesian type and is guaranteed to be small when the empirical CVaR is small. We achieve this by reducing the problem of estimating CVaR to that of merely estimating an expectation. This then enables us, as a by-product, to obtain concentration inequalities for CVaR even when the random variable in question is unbounded.
Joint work with Mhammedi Zakaria (Australian National University) and Robert Williamson. Published: 39
7.26 Axis 2: PAC-Bayesian Contrastive Unsupervised Representation Learning
Participants: Benjamin Guedj, Pascal Germain.
Contrastive unsupervised representation learning (CURL) is the state-of-the-art technique to learn representations (as a set of features) from unlabelled data. While CURL has collected several empirical successes recently, theoretical understanding of its performance was still missing. In a recent work, Arora et al. 86 provide the first generalisation bounds for CURL, relying on a Rademacher complexity. We extend their framework to the flexible PAC-Bayes setting, allowing to deal with the non-iid setting. We present PAC-Bayesian generalisation bounds for CURL, which are then used to derive a new representation learning algorithm. Numerical experiments on real-life datasets illustrate that our algorithm achieves competitive accuracy, and yields generalisation bounds with non-vacuous values.
Joint work with Kento Nozawa (University of Tokyo & RIKEN, Japan). Published: 40
7.27 Axis 2: Revisiting clustering as matrix factorisation on the Stiefel manifold.
Participants: Benjamin Guedj.
This work studies clustering for possibly high dimensional data (e.g. images, time series, gene expression data, and many other settings), and rephrase it as low rank matrix estimation in the PAC-Bayesian framework. Our approach leverages the well known Burer-Monteiro factorisation strategy from large scale optimisation, in the context of low rank estimation. Moreover, our Burer-Monteiro factors are shown to lie on a Stiefel manifold. We propose a new generalized Bayesian estimator for this problem and prove novel prediction bounds for clustering. We also devise a componentwise Langevin sampler on the Stiefel manifold to compute this estimator.
Joint work with Stéphane Chrétien (Université Lyon 2). Published: 35
7.28 Axis 2:Kernel-Based Ensemble Learning in Python
Participants: Benjamin Guedj.
We propose a new supervised learning algorithm for classification and regression problems where two or more preliminary predictors are available. We introduce KernelCobra, a non-linear learning strategy for combining an arbitrary number of initial predictors. KernelCobra builds on the COBRA algorithm which combined estimators based on a notion of proximity of predictions on the training data. While the COBRA algorithm used a binary threshold to declare which training data were close and to be used, we generalise this idea by using a kernel to better encapsulate the proximity information. Such a smoothing kernel provides more representative weights to each of the training points which are used to build the aggregate and final predictor, and KernelCobra systematically outperforms the COBRA algorithm. While COBRA is intended for regression, KernelCobra deals with classification and regression. KernelCobra is included as part of the open source Python package Pycobra (0.2.4 and onward). Numerical experiments were undertaken to assess the performance (in terms of pure prediction and computational complexity) of KernelCobra on real-life and synthetic datasets.
Published: 25
7.29 Axis 2: Non-linear aggregation of filters to improve image denoising.
Participants: Benjamin Guedj.
We introduce a novel aggregation method to efficiently perform image denoising. Preliminary filters are aggregated in a non-linear fashion, using a new metric of pixel proximity based on how the pool of filters reaches a consensus. We provide a theoretical bound to support our aggregation scheme, its numerical performance is illustrated and we show that the aggregate significantly outperforms each of the preliminary filters.
Joint work with Juliette Rengot, Ecole de Ponts, ParisTech.
Published: 37
7.30 Axis 2: Multiple change-points detection with reproducing kernels
Participants: Alain Celisse.
We tackle the change-point problem with data belonging to a general set. We build a penalty for choosing the number of change-points in the kernel-based method of Harchaoui and Cappé 83. This penalty generalizes the one proposed by Lebarbier 85 for a one-dimensional signal changing only through its mean. We prove a non-asymptotic oracle inequality for the proposed method, thanks to a new concentration result for some function of Hilbert-space valued random variables. Experiments on synthetic and real data illustrate the accuracy of our method, showing that it can detect changes in the whole distribution of data, even when the mean and variance are constant.
Joint work with Sylvain Arlot (Orsay) and Zaïd Harchaoui (Seattle). This work has been accepted in JMLR
7.31 Axis 2: Analysis of early stopping rules based on discrepancy principle
Participants: Alain Celisse.
We describe a general unified framework for analyzing the statistical performance of early stopping rules based on the minimum discrepancy principle (DP). Finite-sample bounds such as deviation or oracle inequalities are derived with high probability. Since it turns out that DP suffers some deficiencies when estimating smooth functions, refinements involving smoothing of the residuals are introduced and analyzed. Theoretical bounds established in the fixed design setting under mild assumptions such as the boundedness of the kernel. When focusing on the smoothed discrepancy principle, such bounds are even extended to the random design setting by means of a new change-of-norm argument
Joint work with Markus Reiß(Humboldt) and Martin Wahl (Humboldt). This work has been already presented several times in seminars.
7.32 Axis 3: Short-term air temperature forecasting using Nonparametric Functional Data Analysis and SARMA models
Participants: Sophie Dabo-Niang.
Air temperature is a significant meteorological variable that affects social activities and economic sectors. In this paper, a non-parametric and a parametric approach are used to forecast hourly air temperature up to 24 h in advance. The former is a regression model in the Functional Data Analysis framework. The nonlinear regression operator is estimated using a kernel function. The smoothing parameter is obtained by a cross-validation procedure and used for the selection of the optimal number of closest curves. The other method applied is a Seasonal Autoregressive Moving Average (SARMA) model, the order of which is determined by the Bayesian Information Criterion. The obtained forecasts are combined using weights calculated based on the forecast errors. The results show that SARMA has a better performance for the first 6 forecasted hours, after which the Non-Parametric Functional Data Analysis (NPFDA) model provides superior results. Forecast pooling improves the accuracy of the forecasts.
It is a joint work with Stelian Curceac (Rothamsted Research, United Kingdom) Camille Ternynck (CERIM, Université de Lille) Taha B.M.J. Ouarda (INRS, Québec, Canada) Fateh Chebana (INRS, Québec, Canada). This work has been published in the journal Environmental Modelling and Software
7.33 Axis 3: Mathematical Modeling and Study of Random or Deterministic Phenomena
Participants: Sophie Dabo-Niang.
In order to identify mathematical modeling (including functional data analysis) and interdisciplinary research issues in evolutionary biology, epidemiology, epistemology, environmental and social sciences encountered by researchers in Mayotte, the first international conference on mathematical modeling (CIMOM’18) was held in Dembéni, Mayotte, from November 15 to 17, 2018, at the Centre Universitaire de Formation et de Recherche. The objective was to focus on mathematical research with interdisciplinarity. This contribution is a book discusses key aspects of recent developments in applied mathematical analysis and modeling. It was written after the international conference on mathematical modeling in Mayotte, where a call for chapters of the book was made. They were written in the form of journal articles, with new results extending the talks given during the conference and were reviewed by independent reviewers and book publishers It highlights a wide range of applications in the fields of biological and environmental sciences, epidemiology and social perspectives. Each chapter examines selected research problems and presents a balanced mix of theory and applications on some selected topics. Particular emphasis is placed on presenting the fundamental developments in mathematical analysis and modeling and highlighting the latest developments in different fields of probability and statistics. The chapters are presented independently and contain enough references to allow the reader to explore the various topics presented.
It is a joint work with Solym Manou-Abi and Jean-Jacques Salone (Centre Universitaire de Mayotte). This book is to appear at Wiley (ISTE)
7.34 Axis 3: Categorical functional data analysis
Participants: Cristian Preda, Quentin Grimonprez, Vincent Vandewalle.
The research on functional data analysis is very actual. The R package “fda” is the most famous one implementing methodology for functional data. To the best of our knowledge, and quite surprisingly, there is no recent researches devoted to categorical functional data despite its ability to model real situations in different fields of applications: health and medicine (status of a patient over time), economy (status of the market), sociology (evolution of social status), and so on. We have developed the methodology to visualize, do dimension reduction and extract feature from categorical functional data. For this, the cfda R package has been developed. This has led to the preprint 72 that will be submitted in an international journal.
7.35 Axis 3: Scan Statistics
Participants: Cristian Preda, Alexandru Amarioarei.
The one dimensional discrete scan statistic is considered over sequences of random variables generated by block factor dependence models. Viewed as a maximum of an 1-dependent stationary sequence, the scan statistics distribution is approximated with accuracy and sharp bounds are provided. The longest increasing run statistics is related to the scan statistics and its distribution is studied. The moving average process is a particular case of block factor and the distribution of the associated scan statistics is approximated. Numerical results are presented.
7.36 Axis 3: Clustering categorical functional data
Participants: Cristian Preda, Vincent Vandewalle, Vlad Stefan Barbu.
The objective of this research direction was: (i) to propose possible modelling approaches of categorical functional data and (ii) to investigate the identifiability problem of such models. A first modelling framework is to consider that an observed functional data path represents a sample path of Markov process and thus
7.37 Axis 3: Estimation of right-censored categorical functional data
Participants: Cristian Preda, Vincent Vandewalle, Vlad Stefan Barbu.
As mentioned in Section 7.36, we are interested in modelling categorical functional data by means of semi-Markov processes. These processes generalize Markov processes, in the sense that the sojourn time in a state can be arbitrarily distributed, as opposed to the Markov case. For this reason, semi-Markov processes are flexible tools, more adapted to concrete applications as compared to Markov processes 80. As in any modelling framework, it is clear that one crucial point is to obtain reliable estimators of the parameters of the model. A very important feature in many applications (e.g. survival analysis, reliability, etc.) is to take into account censored data. In the presence of right-censored sample paths, the estimation of semi-Markov processes in continuous time is still an open problem, while for discrete-time semi-Markov we have only an existing research in a non-parametric setting 87. For this framework, we have already established the main setting, and derived the form of the
7.38 Axis 4: Statistical analysis of high-throughput proteomic data
Participants: Guillemette Marot, Vincent Vandewalle, Wilfried Heyse.
Since November 2019, Wilfried Heyse has started a PhD thesis granted by INSERM and supervised by Christophe Bauters, Guillemette Marot and Vincent Vandewalle. The aim is to identify earlier after myocardial infarction (MI) patients at high risk of developing left ventricular remodelling (LVR) that is quantified by imaging one year after MI or to identify patients with high risk of death. For that purpose, high throughput proteomic approach is used. This technology allows the measurement of 5000 proteins simultaneously. In parallel to these measures corresponding to the concentration of a protein in a plasma sample collected from one patient at a specific time, echocardiographic and clinical information have been collected on each of the 200 patients. One of the main challenge is to take into account the variations of the biomarkers according to the time (several measurement times), in order to improve the understanding of biological mechanisms involved on LVR or survival of the patient. Preliminary results have been presented in 38, 79.
This is a joint work with Florence Pinet and Christophe Bauters from INSERM.
7.39 Axis 4: Reject Inference Methods in Credit Scoring
Participants: Christophe Biernacki, Adrien Ehrhardt, Philippe Heinrich, Vincent Vandewalle.
The granting process of all credit institutions rejects applicants having a low credit score. Developing a scorecard, i.e. a correspondence table between a client’s characteristics and his score, requires a learning dataset in which the target variable good/bad borrower is known. Rejected applicants are de facto excluded from the process. This biased learning population might have deep consequences on the scorecard relevance. Some works, mostly empirical ones, try to exploit rejected applicants in the scorecard building process. This work proposes a rational criterion to evaluate the quality of a scoring model for the existing Reject Inference methods and dig out their implicit mathematical hypotheses. It is shown that, up to now, no such Reject Inference method can guarantee a better credit scorecard. These conclusions are illustrated on simulated and real data from the french branch of Crédit Agricole Consumer Finance (CACF). This has led to the preprint 63 which is now in revision in an international journal.
This is a joint work with Sébastien Beben of Crédit Agricole Consumer Finance.
7.40 Axis 4: Usability study
Participants: Vincent Vandewalle.
Since 2018, Vincent Vandewalle is working with Alexandre Caron and Benoît Dervaux, on issues of estimating the number of problems and the value of information in the field of usability. Based on usability study of a medical device the objective is to determine the number of possible problems linked to the use of a medical device (e.g. insulin pump) as well as their respective occurrence probabilities. Estimating this number and the different probabilities is essential to determine whether or not an additional usability study should be conducted, and to determine the number of users to be included in this study to maximize the expected benefits.
The discovery process can be modeled by a binary matrix, a matrix whose number of columns depends on the number of defects discovered by users. In this framework, they have proposed a probabilistic modeling of this matrix. They have included this modeling in a Bayesian framework where the number of problems and the probabilities of discovery are considered as random variables. In this framework, the article 31 as been published. It shows the interest of the approach compared to the approaches proposed in the state of the art in usability. The approach beyond point estimation also makes it possible to obtain the distribution of the number of problems and their respective probabilities given the discovery matrix.
The proposed model also allows to implement an approach aiming at measuring the value of additional information in relation to the discovery process. In this framework, they are writing a second paper and developing the R package useval available soon. This work has been presented in a conference 48.
This is a joint work with Alexandre Caron and Benoît Dervaux both from ULR 2694: METRICS.
7.41 Axis 4: Artificial intelligence for aviation
Participants: Florent Dewez, Benjamin Guedj, Arthur Talpaert, Vincent Vandewalle.
Since November 2018, Benjamin Guedj and Vincent Vandewalle have been participating in the European PERF-AI project (European PERF-AI project: Enhance Aircraft Performance and Optimization through the utilization of Artificial Intelligence) in partnership with the company Safety Line. In particular, using data collected during flights involves developing Machine Learning models to optimize the aircraft's trajectory concerning fuel consumption, for example. In this context they have hired Florent Dewez (post-doctoral researcher) and Arthur Talpaert (engeneer).
The article 21 is now published. It explains how, using flight recording data, it is possible to implement learning models on variables that have not been directly observed, and in particular to predict the drag and lift coefficients as a function of the angle and speed of the aircraft.
A second article is being to be submitted about the optimization of the aircraft's trajectory based on a consumption model learned from the data, and is available as a preprint 62. The originality of the approach consists in decomposing the trajectory on a functional basis, and thus carrying out the optimization on the coefficients of the decomposition on this basis, rather than approaching the problem from the angle of optimal control. Furthermore, to guarantee compliance with aeronautical constraints, we have proposed an approach penalized by a deviation term from reference flights. A generic Python module (PyRotor) to solve such optimization problems in conjunction with the proposed approach has been developed.
7.42 Axis 4: Domain Adaptation from a Pre-trained Source Model
Participants: Christophe Biernacki, Pascal Germain, Luxin Zhang.
Traditional statistical learning paradigm assumes the consistency between train and test data distributions. This rarely holds in many real-life applications. The domain adaptation paradigm proposes a variety of techniques to overcome this issue. Most of the works in this area seek either for a latent space where source and target data share the same distribution, or for a transformation of the source distribution to match the target one. Both strategies require learning a model on the transformed source data. An original scenario is studied where one is given a model that has been constructed using expertise on the source data that is not accessible anymore. To use directly this model on target data, we propose to learn a transformation from the target domain to the source domain. Up to our knowledge, this is a new perspective on domain adaptation. This learning problem is introduced and formalized. We study the assumptions and the sufficient conditions mandatory to guarantee a good accuracy when using the source model directly on transformed target data. By pursuing this idea, a new domain adaptation method based on optimal transport is proposed. We experiment our method on a fraud detection problem. This work has been accepted to an international conference 42.
It is a joint work with Yacine Kessaci from Worldline company.
7.43 Other: Projection Under Pairwise Control
Participants: Christophe Biernacki.
Visualization of high-dimensional and possibly complex (non-continuous for instance) data onto a low-dimensional space may be difficult. Several projection methods have been already proposed for displaying such high-dimensional structures on a lower-dimensional space, but the information lost is not always easy to use. Here, a new projection paradigm is presented to describe a non-linear projection method that takes into account the projection quality of each projected point in the reduced space, this quality being directly available in the same scale as this reduced space. More specifically, this novel method allows a straightforward visualization data in R2 with a simple reading of the approximation quality, and provides then a novel variant of dimensionality reduction. This work has now been accepted in an international journal 13.
It is a joint work with Hiba Alawieh and Nicolas Wicker, both from Université de Lille.
7.44 Other: On the Local and Global Properties of the Gravitational Spheres of Influence
Participants: Christophe Biernacki.
We revisit the concept of sphere of gravitational activity, to which we give both a geometrical and physical meaning. This study aims to refine this concept in a much broader context that could, for instance, be applied to exo-planetary problems (in a Galactic stellar disc-StarPlanets system) to define a first order “border” of a planetary system. The methods used in this paper rely on classical Celestial Mechanics and develop the equations of motion in the framework of the 3-body problem (e.g. Star-Planet-Satellite System. We start with the basic definition of planet’s sphere of activity as the region of space in which it is feasible to assume a planet as the central body and the Sun as the perturbing body when computing perturbations of the satellite’s motion. We then investigate the geometrical properties and physical meaning of the ratios of Solar accelerations (central and perturbing) and planetary accelerations (central and perturbing), and the boundaries they define. We clearly distinguish throughout the paper between the sphere of activity, the Chebotarev sphere (a particular case of the sphere of activity), Laplace sphere, and the Hill sphere. The last two are often wrongfully thought to be one and the same. Furthermore, taking a closer look and comparing the ratio of the star’s accelerations (central/perturbing) to that of the planetary acceleration (central/perturbing) as a function of the planeto-centric distance, we have identified different dynamical regimes which are presented in the semi-analytical analysis. This work has been published in an international journal 30.
This a joint work with Damya Souami from Observatoire de Paris and with Jacky Cresson from Université de Pau et des Pays de l’Adour.
8 Bilateral contracts and grants with industry
8.1 Bilateral contracts with industry
COLAS company
Participants: Christophe Biernacki.
COLAS is a world leader in the construction and maintenance of transport infrastructure. This bilateral contract aims at classifying mixed data obtained with sensors coming from a study of the aging of road surfacing. The challenge is to deal with many missing (sensors failures) and correlated data (sensors proximity).PAY-BACK company
Participants: Christophe Biernacki.
PAY-BACK Group is an audit firm specializing in the analysis and reliability of transactions. This bilateral contract aims at predicting store sales both from past sales (times series) and also by exploiting external covariates (of different types).ADULM
Participants: Sophie Dabo-Niang, Cristian Preda.
The main goal of this projet with Lille Metropole Urban Development and Planning Agency (ADULM) is to design a tool for Territorial Coherence Scheme (SCoT) to monitor urban developments and develop territorial observation8.2 Bilateral grants with industry
Worldline
Participants: Christophe Biernacki.
Worldline is the new world-class leader in the payments and transactional services industry, with a global reach. A PhD began in Feb. 2019 with Luxing Gang under the supervision of Christophe Biernacki, Pascal Germain (Laval University, Canada) and Yacine Kessaci (Worldline) on the topic of the domain adaptation from a pre-trained source model (with application to fraud detection in electronic payments).ADEO
Participants: Christophe Biernacki, Vincent Vandewalle.
Adeo is No. 1 in Europe and No. 3 worldwide in the DIY market. A PhD began in Dec. 2020 with Axel Potier under the supervision of Christophe Biernacki, Vincent Vandewalle, Matthieu Marbac (ENSAI) and Julien Favre (ADEO) on the topic of sales forecasting concerning “slow movers” items (equivalent to item sold in low quantities).EIT-Sysbooster: Nokia - Apsys/Airbus
Participants: Alain Celisse.
Nokia and Airbus are two worldwide known companies respectively working in communications and transport areas. The purpose of this contract is to perform root cause analysis to reduce (at the end) the number of failures.9 Partnerships and cooperations
9.1 International initiatives
9.1.1 Inria International Labs
6PAC
Participants: Benjamin Guedj.
- Title: Making Probably Approximately Correct Learning Active, Sequential, Structure-aware, Efficient, Ideal and Safe
- Duration: 2018–2022
- Partners: Machine Learning Group, CWI (The Netherlands)
- Summary: This project roots in statistical learning theory, which can be viewed as the theoretical foundations of machine learning. The most common framework is a setup in which one is given n training examples, and the goal is to build a predictor that would be efficient on new (similar) data. This efficiency should be supported by PAC (Probably Approximately Correct) guarantees, e.g. upper bounds on the excess risk of a predictor that hold with high probability. Such guarantees however often hold under stringent assumptions which are typically never met in real-life application, e.g. independent, identically distributed data. More realistic modelling of data has triggered many research efforts in several directions: first, accommodating possible data (e.g. dependent, heavy-tailed), and second, in the direction of sequential learning, in which the predictor can be built on the fly, while new data is gathered. We believe that an ever more realistic paradigm is active learning, a setup in which the learner actively requests data (possibly facing constraints, such as storage, velocity, cost, etc.) and adapts its queries to optimize its performance. The 3-years objective of 6PAC (where 6 stands for Sequential, Active, Efficient, Structured, Ideal, Safe — the six research directions we intend to contribute to) is to pave the way to new PAC generalization and sample-complexity upper and lower bounds beyond batch learning. Our ambition is to contribute to several learning setups, ranging from sequential learning (where data streams are collected) to adaptive and active learning (where data streams are requested by the learning algorithm).
9.1.2 Inria international partners
Benjamin Guedj leads The Inria London Programme, an initiative from Inria to increase the volume of scientific collaborations with the UK and in particular with the London region, with the prime partnership with University College London (United Kingdom).
More details at https://
9.2 International research visitors
9.2.1 Visits of international scientists
- Apoorv Vikram Singh (IISc Bangalore, India) visited Hemant Tyagi from Oct 2019 to Jan 2020 to work on a project related to clustering of signed networks. This was partially funded by the Turing Institute, London. Apoorv worked under the joint supervision of Hemant Tyagi and Mihai Cucuringu (University of Oxford, United Kingdom) during this period.
- Déborah Sulem (PhD student, University of Oxford, United Kingdom) visited Hemant Tyagi on January 13–15, 2020.
9.3 European initiatives
9.3.1 FP7 & H2020 Projects
H2020 FAIR
Participants: Guillemette Marot.
- Acronym: FAIR
- Project title: Flagellin aerosol therapy as an immunomodulatory adjunct to the antibiotic treatment of drug-resistant bacterial pneunomia
- Coordinator: JC. Sirard (Inserm, CIIL)
- Duration: 4 years (2020–2023)
- Partners: Inserm, Université de Lille, Free University of Berlin (Germany), Epithelix (Switzerland), Aerogen (Ireland), Statens Serum Institute (Denmark), CHRU Tours, Academic Medical Center of the University of Amsterdam (The Netherlands), University of Southampton (United Kingdom), European Respiratory Society (Switzerland)
- Abstract: The FAIR project aims at evaluating an alternative adjunct strategy to standard of care antibiotics for treating pneumonia caused by antibiotic-resistant bacteria: activation of the innate immune system in the airways. Guillemette Marot is involved in this H2020 project as scientific head of bilille platform, and will supervise 1 year engineer on integration of omic data.
H2020 PERF-AI
Participants: Florent Dewez, Benjamin Guedj, Arthur Talpaert, Vincent Vandewalle.
- Acronym: PERF-AI
- Project title: Enhance Aircraft Performance and Optimisation through utilisation of Artificial Intelligence
- Coordinator: Pierre Jouniaux (Safety-Line)
- Duration: 2 years (2018–2020)
- Partners: Safety-Line
-
Abstract: PERF-AI will apply Machine Learning techniques on flight data (parametric & non-parametric approaches) to accurately measure actual aircraft performance throughout its lifecycle.
Within current airline operations, both at flight preparation (on-ground) & at flight management (in-air) levels, the trajectory is first planned, then managed by the Flight Management System (FMS) using a single manufacturer’s performance model that is the same for every aircraft of the same type, & also on weather forecast that is computed long before the flight. It induces a lack of accuracy during the planning phase with a flight route pre-established at specific altitudes & speeds to optimize fuel burn, from take-off to landing using aircraft performances that are not those of the real aircraft. Also, the actual flight will usually shift from the original plan because of Air Traffic Control (ATC) constraints, adverse weather, wind changes & tactical re-routing, without possibility for the flight crew, either using the FMS or through connected services to tactically recompute the trajectory in order to continuously optimize the flight path. This is in particular due to the limitations of the performance databases that the current systems are using.
Hence, PERF-AI is focusing on identifying adequate machine learning algorithms, testing their accuracy & capability to perform flight data statistical analysis & developing mathematical models to optimize real flight trajectories with respect to the actual aircraft performance, thus, minimizing fuel consumption throughout the flight.
The consortium consists of Safety-Line & Inria, having full expertise at Aircraft Performance & Data Science, hence, able to fully propose, test & validate different statistical models that will allow to accurately solve some optimization challenges & implement them in an operational environment.
PERF-AI total grant request to the CSJU is 568 550 € with total project duration of 24 months.
9.4 National initiatives
COVIDOM project
During the 1st lockdown in France, Christophe Biernacki supervised a task force composed of three Inria research teams (MODAL, STATIFY, TAU) for analysing data coming from the medical database COVIDOM of AP-HP concerning suspected COVID-19 patients. This project was included in the overall national Inria “mission COVID” initiative.
Programme of Investments for the Future (PIA)
Bilille is a member of the PIA “Infrastructures en biologie-santé”
IFB, French Institute of Bioinformatics (https://
RHU PreciNASH
Participants: Guillemette Marot.
- Acronym: PreciNASH
- Project title: Non-alcoholic steato-hepatitis (NASH) from disease stratification to novel therapeutic approaches
- Coordinator: François Pattou (Université de Lille, Inserm, CHRU Lille)
- Duration: 5 years
- Partners: FHU Integra and Sanofi
- Abstract: PreciNASH, project coordinated by Pr. F. Pattou (UMR 859, EGID), aims at better understanding non alcoholic stratohepatitis (NASH) and improving its diagnosis and care. In this RHU, Guillemette Marot supervises a 2 years post-doc, as her team ULR 2694 METRICS is a member of the FHU Integra. METRICS is involved in the WP1 for the development of a clinical-biological model for the prediction of NASH. Other partners of the FHU are UMR 859, UMR 1011 and UMR 8199, these last three teams being part of the labex EGID (European Genomic Institute for Diabetes). Sanofi is the main industrial partner of the RHU PreciNASH. The whole project will last 5 years (2016–2021).
CNRS PEPS Blanc –– BayesRealForRNN project
Participants: Pascal Germain, Vera Shalaeva.
- Acronym: BayesRealForRNN
- Project title: PAC-Bayesian theory for recurrent neural networks: a control theoretic approach
- Coordinator: Mihaly Petreczky (CNRS, UMR 9189 CRIStAL, Université de Lille)
- Year: 2019
- Abstract: The project proposes to analyze the mathematical correctness of deep learning algorithms by combining techniques from control theory and PAC-Bayesian statistical theory. More precisely, the project proposes to concentrate on recurrent neural networks (RNNs), develop their structure theory using techniques from contol theory, and then apply this structure theory to derive PAC-Bayesian error bounds for RNNs.
CNRS AMIES PEPS 2 — DiagChange project
Participants: Cristian Preda, Quentin Grimonprez.
- Acronym: DiagChange
- Year: 2019
- Abstract: The project proposes to study the topic of change detection distribution for multivariate signal in a industrial context. The project is in collaboration with the DiagRAMS start-up.
CNRS AMIES PEPS 1 — PIVISCoT
Participants: Sophie Dabo-Niang, Cristian Preda.
- Year: 2020
- Abstract: The project aims to create a software for Territorial Coherence Scheme (SCoT) in Lille in order to monitor urban developments and develop territorial observation.
AMIES PEPS 2 — MadiPa
Participants: Stéphane Girard, Serge Iovleff.
- Acronym: MadiPa
- Project title: Modèles Auto-associatifs pour la Dispersion de Polluants dans l’Atmosphère
- Duration: 18 month (start in december 2019)
-
Partners: Société Phimeca http://
phimeca. , Mistis team Inria Grenoble Rhône-Alpescom/ - Abstract: Our goal is to develop a method for predicting the dispersion of pollutants in the atmosphere from an initial emission map and meteorological data. A map of the probabilities of exceeding a critical threshold of pollutants will be estimated thanks to the construction of a meta-model: the large dimension of the problem is reduced by the use of auto-associative models, a non-linear extension of the Principal Components Analysis.
9.4.1 ANR
APRIORI
Participants: Benjamin Guedj, Pascal Germain, Hemant Tyagi, Vera Shalaeva.
- Type: ANR PRC
- Acronym: APRIORI
- Project title: PAC-Bayesian theory and algorithms for deep learning and representation learning
- Coordinator: Emilie Morvant (Université Jean Monnet)
- Duration: 2019–2023
- Funding: 300k EUR
- Partners: MODAL, Laboratoire Hubert Curien (UMR CNRS 5516)
BEAGLE
Participants: Benjamin Guedj, Pascal Germain.
- Type: ANR JCJC
- Acronym: BEAGLE
- Duration: 2019–2023
- Project title: PAC-Bayesian theory and algorithms for agnostic learning
- Funding: 180k EUR
- Partners: Pierre Alquier (RIKEN AIP, Japan), Peter Grünwald (CWI, The Netherlands), Rémi Bardenet (UMR CRIStAL 9189)
SMILE
Participants: Christophe Biernacki, Vincent Vandewalle.
- Acronym: SMILE
- Duration: 2018–2022
- Project title: Statistical Modeling and Inference for unsupervised Learning at LargE-Scale)
- Coordinator: Faicel Chamroukhi (LMNO, Université de Caen)
- Partners: MODAL, LMNO UMR CNRS 6139 (Caen), LMRS UMR CNRS 6085 (Rouen), LIS UMR CNRS 7020 (Toulon)
TheraSCUD2022
Participants: Guillemette Marot.
- Acronym: TheraSCUD2022
- Project title: Targeting the IL-20/IL-22 balance to restore pulmonary, intestinal and metabolic homeostasis after cigarette smoking and unhealthy diet
- Coordinator: P. Gosset (Institut Pasteur de Lille)
- Duration: 3 years (2017–2020)
- Partners: CIIL Institut Pasteur de Lille and UMR 1019 INRA Clermont-Ferrand
- Abstract: The TheraSCUD2022 project studies inflammatory disorders associated with cigarette smoking and unhealthy diet (SCUD). Guillemette Marot is involved in this ANR project as head of bilille platform, and will supervise 1 year engineer on integration of omic data.
9.4.2 Working groups
- Sophie Dabo-Niang belongs to the following working groups:
- STAFAV (STatistiques pour l'Afrique Francophone et Applications au Vivant)
- ERCIM Working Group on computational and Methodological Statistics, Nonparametric Statistics Team
- Franco-African IRN (International Research Network) in Mathematics, funded by CNRS
- ONCOLille (Cancer Research Institute in Lille)
- Benjamin Guedj belongs to the following working groups (GdR) of CNRS:
- ISIS (local referee for Inria Lille - Nord Europe)
- MaDICS
- MASCOT-NUM (local referee for Inria Lille - Nord Europe)
- Guillemette Marot belongs to the StatOmique working group
9.5 Regional initiatives
9.5.1 bilille, the bioinformatics platform of Lille
Participants: Guillemette Marot, Maxime Brunin, Iheb Eladib.
bilille, the bioinformatics platform of Lille officially integrated UMS 2014/US 41 PLBS (Plateformes Lilloises en Biologie Santé) in January 2020. In 2020, Guillemette Marot co-headed the platform with Hélène Touzet (CNRS, CRIStAL). Inria employed 2 engineers for this platform:- M. Brunin, who participated in the development of the visCorVar tool, a tool to facilitate multi-block analysis for statistical integration of omics data and participated to the analyses of the TheraSCUD2022 ANR project.
- I. Eladib, who participated in the development of tools for bilille cloud, in order to simplify and optimize its use.
More information about the platform is available at
https://
Collaborations of the year linked to bilille
Participants: Guillemette Marot.
Guillemette Marot has supervised the data analysis part or support in biostatistics tools testing for the following research projects involving engineers from bilille (only the names of the principal investigators of the project are given even if several partners are sometimes involved in the project):- CIIL, L. Poulin, InflammReg
- Infinite, V. Sobanski, Evapass
- U1011, Y. Sebti, Circaregen
- U1011, D. Dombrowicz, DeconImmunMetab
10 Dissemination
10.1 Promoting scientific activities
10.1.1 Scientific events: organisation
General chair, scientific chair
- Benjamin Guedj has been appointed (March 2020) general local chair of COLT 2022 to be held in London
- Hemant Tyagi is the organizer of the MODAL team scientific seminar
- Sophie Dabo-Niang is co-chair of the group Statistics, applied math and computer science of Pan-African Scientific Research Council, funded by Princeton University (USA)
Member of the organizing committees
- Sophie Dabo-Niang is co-chair of the Organizing Committee of the Workshop 3rd Conference on Econometrics for Environment, December 2020, Lille.
10.1.2 Scientific events: selection
Christophe Biernacki has been president of the scientific comitee of JdS 2020, the annual national meeting the French staticial society (SFdS).
Reviewer
- Sophie Dabo-Niang has reviewed several papers for several journals during 2020 including Spatial Statistics, JSPI, Metrika, JRSS C
- Benjamin Guedj has served as reviewer for most top-tier machine learning conferences, including AISTATS, ALT, COLT, ICML, NeurIPS
- Hemant Tyagi has reviewed for the following conferences during 2020: International Conference on Learning Representations (ICLR), International Conference on Machine Learning (ICML) and Symposium on Computational Geometry (SoCG)
- Christophe Biernacki has reviewed for the Cap2020 (Conférence sur l'Apprentissage Automatique) and also for several journals (IMAIAI, STCO, LSSP, SAM, GSCS, TNNLS, ESWA, JMIV, JCGS)
10.1.3 Journal
Member of the editorial boards
- Sophie Dabo-Niang is member of the editorial board of: Revista Colombiana de EstadísticaJournal Of Statistical Modeling and Analytics
- Benjamin Guedj is a member of the Editorial Board of reviewers for the Journal of Machine Learning Research (JMLR), since June 2020 and an Associate Editor and member of the Editorial Board for the journal Information and Inference (Oxford), since March 2020
- Christophe Biernacki is an Associate Editor of the North-Western European Journal of Mathematics (NWEJM) and a Guest Editor for the Special Issue on Innovations in Model-Based Clustering and Classification of the journal Advances Data Analysis and Classification (ADAC)
- Cristian Preda is an Associate Editor for
Methodology and Computing in Applied Probability Journal
(https://
www. ) and Romanian Journal of Mathematics and Computer Science (http://springer. com/ journal/ 11009 www. )rjm-cs. ro
Reviewing activities
- Hemant Tyagi has reviewed for the following journals during 2020: Journal of the Royal Statistical Society (JRSS), IEEE Open Journal of Signal Processing, Mathematical reviews.
- Vincent Vandewalle has reviewed for the following journals during 2020: JCGS, Spatial Statistics, Methodology & Computing in Applied Probability.
10.1.4 Invited talks
Benjamin Guedj has given a number of scientific talks in seminars, including at
- Oxford University (United Kingdom)
- UCL (United Kingdom)
- The Alan Turing Institute (United Kingdom)
- RIKEN (Japan)
Sophie Dabo-Niang bas been invited to:
- NEF (Next Enstein Forum) 2020, December 8-10, 2020. Panel on The contribution of Mathematical Sciences in supporting robust disease prevention and modelling in Africa.
- AIMS South-Africa webinar, November 4, 2020. Statistical modeling of Spatial Big data and Applications.
Hemant Tyagi:
- Cafe de Sciences, Inria Lille, January 2020.
- STADIUS seminar, KU Leuven, February 2020.
- Séminaire SAMM : Statistique, Analyse et Modélisation Multidisciplinaire, Université Paris 1, November 2020.
10.1.5 Leadership within the scientific community
Sophie Dabo-Niang is:
- Chair of Committee for Developing Countries (CDC) of EMS (European Mathematical Society), 2019-2022. CDC
- Member of the executif committee and scientif officer of CIMPA
Guillemette Marot is scientific head of bilille, the bioinformatics platform of Lille. More information about the platform is available at
https://
10.1.6 Scientific expertise
Sophie Dabo-Niang is expert of
- L'Oreal Women in Science Awards
- HCERES
10.2 Teaching - Supervision - Juries
10.2.1 Teaching
- Pascal Germain taught
- Master: Introduction aux réseaux de neurones, 15 heures, M2, Université de Lille, France
- Hemant Tyagi is teaching
- Master: Statistics I, 24h, M1, Centrale Lille, France (Nov. 2020 - 7 Jan. 2021)
- Master: Statistics II, 24h, M1, Centrale Lille, France (11 Jan. 2021 - 18 March 2021)
- Sophie Dabo-Niang is teaching
- Master: Spatial Statistics, 24h, M2, Université de Lille, France
- Master: Advanced Statistics, 24h, M2, Université de Lille, France
- Master: Multivariate Data Analyses, 24h, M2, Université de Lille, France
- Licence: Probability, 24h, L2, Université de Lille, France
- Licence: Multivariate Statistics, 24h, L3, Université de Lille, France
- Guillemette Marot is teaching
- Licence: Biostatistics, 15h, L1, Université de Lille (Faculty of Medicine), France
- Master: Biostatistics, 62h, M1, Université de Lille (Faculty of Medicine), France
- Master: Supervised classification, 34h, M1, Polytech'Lille, France
- Master: Biostatistics, 20h, M1, Université de Lille (Departments of Computer Science and Biology), France
- Master: Statistical analysis of omics data, 22h, M2, Université de Lille (Department of Mathematics), France
- Doctorat: Artificial intelligence and health, 7h, Université de Lille (Faculty of Medicine), France
- Cristian Preda is teaching
- Polytech'Lille engineer school: Linear Models, 48h.
- Polytech'Lille engineer school: Advanced statistics, 48h.
- Polytech'Lille engineer school: Biostatistics, 10h.
- Polytech'Lille engineer school: Supervised clustering, 24h. France
- Christophe Biernacki is teaching
- New Master Data Science: Statistics, 24h, M1, Université de Lille, France
- Benjamin Guedj is teaching
- Advanced machine learning (M2, 6h), University College London, United Kingdom
- Serge Iovleff is teaching
- Licence: Analyse et méthodes numériques, 56h, Université de Lille, DUT Informatique
- Licence: R.O. et aide à la décision, 32h, Université de Lille, DUT Informatique
- Vincent Vandewalle is teaching
- Licence: Probability, 60h, Université de Lille, DUT STID
- Licence: Case study in statistics, 45h, Université de Lille, DUT STID
- Licence: R programming, 45h, Université de Lille, DUT STID
- Licence: Supervised clustering, 32h, Université de Lille, DUT STID
- Licence: Analysis, 24h, Université de Lille, DUT STID
10.2.2 Supervision
PhD defense:
- Arthur Leroy, December 9th 2020 on “Apprentissage de donneées fonctionnelles par modèles multi-tâches : application à la prédiction de performances sportives”
- Yaroslav Averyanov, December 15th 2020, supervised by Alain Celisse and Cristian Preda on “Designing and analyzing new early stopping rules for saving computational resources”
- Margot Selosse, November 13th 2020, supervised by Christophe Biernacki and Julien Jacques on “Introducing parsimony to analyse complex data with model-based clustering”
PhD in progress:
- Axel Potier, Sale prediction for low turn-over products, November 2020, Christophe Biernacki, Matthieu Marbac, Vincent Vandewalle
- Felix Biggs, Generative models and kernels, University College London (United Kingdom), Sep 2019, Benjamin Guedj
- Antoine Vendeville, Learning on graph to stop the propagation of fake news, University College London (United Kingdom), Sep 2019, Benjamin Guedj
- Luxin Zhang, Domain adaptation from a pre-trained source model – Application to fraud detection in electronic payments, February 2019, Christophe Biernacki, Pascal Germain, Yacine Kessac
- Paul Viallard, Interpreting representation learning through PAC-Bayes theory, September 2019, Amaury Habrard, Emilie Morvant, Pascal Germain
- Dang Khoi Pham, Planning and re-planning of nurses in an oncology department using a multi-objective and interdisciplinary approach, September 2016, Sophie Dabo-Niang
- Solange Doumun, Performance evaluation and contribution to the development of multispectral image analysis strategies for automatic and rapid diagnosis of malaria, December 2018, Sophie Dabo-Niang
- Alaa Ali Ayad, Statistical modeling of large spatial data and its applications in health, September 2018, Sophie Dabo-Niang
- Wilfried Heyse, Prise en compte de la structure temporelle dans l'analyse statistique de données protéomiques à haut débit, October 2019, Christophe Bauters, Guillemette Marot and Vincent Vandewalle
- Margot Selosse, October 2017, Christophe Biernacki and Julien Jacques
- Filippo Antonazzo, October 2019, Christophe Biernacki and Christine Keribin
- Eglantine Karle, November 2020, Hemant Tyagi and Cristian Preda
- Guillaume Braun, January 2020, Christophe Biernacki and Hemant Tyagi
- Rajeev Bopche, September 2020, Christophe Biernacki and Martine Vaxillaire
- Antonin Schrab, September 2020, co-supervised by Arthur Gretton and Benjamin Guedj, University College London (United Kingdom)
- Reuben Adams, Septembre 2020, co-supervised by John Shawe-Taylor and Benjamin Guedj, University College London (United Kingdom)
10.2.3 Juries
- Sophie Dabo-Niang acted as a reviewer and an examinator for PhD theses
- Benjamin Guedj has been the discussion leader for the licentiate thesis of Fredrik Hellström on December 16th, 2020, at Chalmers University (Sweden)
- Benjamin Guedj has been a member of 2 hiring panels for Inria permanent researchers
- Guillemette Marot acted as an examinator for the PhD thesis of Audrey Hulot, Nov 2020 (Université Paris-Saclay) and in a research engineer (IR) jury, Oct 2020 (Université de Lille)
- Christophe Biernacki acted as a reviewer for four PhD theses and as an examinator for two HdR defenses
- Vincent Vandewalle participated in a MC jury Université d'Avignon, May 2020
- Cristian Preda acted as a referee for the HDR defense of Christophe Crambes, Université de Montpellier 2, June 30, 2020
- Cristian Preda acted as a referee for the HDR defense of Dan Lascu, November 19, 2020, Universitatea Ovidiu, Constanta (Romania)
11 Scientific production
11.1 Major publications
- 1 article Simpler PAC-Bayesian Bounds for Hostile Data Machine Learning 2018
- 2 article An R Package and C++ library for Latent block models: Theory, usage and applications Journal of Statistical Software 2016
- 3 article Unifying Data Units and Models in (Co-)Clustering Advances in Data Analysis and Classification 12 41 May 2018
- 4 articleOptimal cross-validation in density estimation with the L2-lossThe Annals of Statistics4252014, 1879--1910
- 5 articleNonparametric prediction in the multivariate spatial contextJournal of Nonparametric Statistics2822016, 428-458
- 6 articleThe logic of transcriptional regulator recruitment architecture at cis -regulatory modules controlling liver functionsGenome Research276June 2017, 985--996
- 7 inproceedings Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks NeurIPS 2019 Vancouver, Canada December 2019
- 8 article Model-based clustering of Gaussian copulas for mixed data Communications in Statistics - Theory and Methods December 2016
- 9 articleParametrizations, fixed and random effectsJournal of Multivariate Analysis154February 2017, 162--176
- 10 article Learning general sparse additive models from point queries in high dimensions Constructive Approximation January 2019
11.2 Publications of the year
International journals
- 11 article Semiparametric estimation with spatially correlated recurrent events Scandinavian Journal of Statistics June 2020
- 12 article Partially Linear Spatial Probit Models Annales de l'ISUP December 2020
- 13 article Projection under pairwise distance controls Communications in Statistics - Theory and Methods 2020
- 14 articleChange point detection of flood events using a functional data frameworkAdvances in Water Resources137March 2020, 103522
- 15 article Gaussian Based Visualization of Gaussian and Non-Gaussian Based Clustering Journal of Classification July 2020
- 16 article A kernel discriminant analysis for spatially dependent data Distributed and Parallel Databases August 2020
- 17 article Multi-kernel unmixing and super-resolution using the Modified Matrix Pencil method Journal of Fourier Analysis and Applications 26 18 January 2020
- 18 articleProvably robust estimation of modulo 1 samples of a smooth function with applications to phase unwrappingJournal of Machine Learning Research2132January 2020, 1−77
- 19 articleRanking and synchronization from pairwise measurements via SVDJournal of Machine Learning Research2219February 2021, 1-63
- 20 articleKernel regression estimation with errors-in-variables for random fieldsAfrika Matematika312020, 29–56
- 21 article From industry-wide parameters to aircraft-centric on-flight inference: improving aeronautics performance prediction with machine learning Data-Centric Engineering October 2020
- 22 article Detection and segmentation of erythrocytes in multispectral label-free blood smear images for automatic cell counting Journal of Spectral Imaging 9 Article ID a10 September 2020
- 23 articleA novel laser-based method to measure the adsorption energy on carbonaceous surfacesCarbon173March 2021, 540-556
- 24 articlePAC-Bayes and Domain AdaptationNeurocomputing3792020, 379-397
- 25 articleKernel-Based Ensemble Learning in PythonInformation112February 2020, 63
- 26 articleOne Dimensional Discrete Scan Statistics for Dependent Models and Some Related ProblemsMathematics 84April 2020, 576
- 27 articleModel-based co-clustering for mixed type dataComputational Statistics and Data Analysis1442020, 106866
- 28 article Textual data summarization using the Self-Organized Co-Clustering model Pattern Recognition February 2020
- 29 article ordinalClust: An R Package to Analyze Ordinal Data The R Journal 12 2 January 2021
- 30 articleOn the local and global properties of the gravitational spheres of influenceMonthly Notices of the Royal Astronomical Society4964June 2020, 4287–429
- 31 article Estimating the number of usability problems affecting medical devices: modelling the discovery matrix BMC Medical Research Methodology 20 1 September 2020
- 32 articleMulti-Partitions Subspace ClusteringMathematics 84April 2020, 597
International peer-reviewed conferences
- 33 inproceedings A binned technique for scalable model-based clustering on huge datasets MBC2 - Models and Learning for Clustering and Classification Journal ADAC - Advances in Data Analysis and Classification, Catania, Italy September 2020
- 34 inproceedings Clustering on multilayer graphs with missing values Journée de Statistique de la SFdS Nice, France May 2020
- 35 inproceedings Revisiting clustering as matrix factorisation on the Stiefel manifold LOD 2020 - the Sixth International Conference on Machine Learning, Optimisation and Data Science Siena, Italy July 2020
-
36
inproceedings
Online
-means Clustering' AISTATS 2021 - The 24th International Conference on Artificial Intelligence and Statistics Virtual, France 2021 - 37 inproceedings Non-linear aggregation of filters to improve image denoising Computing Conference 2020 London, United Kingdom July 2020
- 38 inproceedings Proteomic signature for early diagnosis of left ventricular remodeling after myocardial infarction Printemps de la cardiologie 2020 Grenoble, France October 2020
- 39 inproceedings PAC-Bayesian Bound for the Conditional Value at Risk NeurIPS 2020 Vancouver / Virtual, Canada December 2020
- 40 inproceedings PAC-Bayesian Contrastive Unsupervised Representation Learning UAI 2020 - Conference on Uncertainty in Artificial Intelligence Toronto, Canada August 2020
- 41 inproceedings Improved PAC-Bayesian Bounds for Linear Regression AAAI 2020 - Thirty-Fourth AAAI Conference on Artificial Intelligence New York, United States February 2020
- 42 inproceedings Target to Source Coordinate-wise Adaptation of Pre-trained Models ECML PKDD 2020 - The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases Ghent / Virtual, Belgium September 2020
National peer-reviewed Conferences
- 43 inproceedings Estimation of univariate Gaussian mixtures for huge raw datasets by using binned datasets JDS 2020 - 52ème Journées de Statistiques de la Société Française de Statistique Nice, France May 2020
Conferences without proceedings
- 44 inproceedings Scan statistics for some dependent models.Applications. STATMOD2020 Statistical Modeling with Applications Bucharest, Romania February 2021
- 45 inproceedings The contribution of Mathematical Sciences in supporting robust disease prevention and modelling in Africa The contribution of Mathematical Sciences in supporting robust disease prevention and modelling in Africa Virtual Meeting, South Africa December 2020
- 46 inproceedings A bumpy journey: exploring deep Gaussian mixture models I Can't Believe It's Not Better @ NeurIPS 2020 Vancouver, Canada December 2020
- 47 inproceedings Co-clustering contraint pour le résumé de matrices document-terme JdS 2020 - 52èmes Journées de Statistique de la Société Française de Statistique Nice, France May 2020
- 48 inproceedings Estimation du nombre de problèmes et détermination du nombre de sujets nécessaires dans les études d’utilisabilité : une approche bayésienne Journées Biostatistiques 2020 - GDR « Statistiques & santé » Paris, France October 2020
Scientific book chapters
- 49 inbook Clustering spatial functional data Geostatistical Functional Data Analysis : Theory and Methods. Editors: Jorge Mateu, Ramon Giraldo Geostatistical Functional Data Analysis : Theory and Methods John Wiley and Sons, Chichester. ISBN : 978-1-119-38784-8 January 2021
Edition (books, proceedings, special issue of a journal)
- 50 book Mathematical Modeling and Study of Random or Deterministic Phenomena Wiley February 2020
Doctoral dissertations and habilitation theses
- 51 thesis Designing and analyzing new early stopping rules for saving computational resources Université de Lille; Inria December 2020
- 52 thesis Contribution to model-based clustering of heterogeneous data Université de Lille January 2021
Reports & preprints
- 53 misc Differentiable PAC-Bayes Objectives with Partially Aggregated Neural Networks June 2020
- 54 report Indicateurs de suivi de l'activité scientifique de l'Inria Inria December 2020
- 55 misc A PAC-Bayesian Perspective on Structured Prediction with Implicit Loss Embeddings December 2020
- 56 misc Analyzing the discrepancy principle for kernelized spectral filter learning algorithms April 2020
- 57 misc Regularized spectral methods for clustering signed networks January 2021
- 58 misc An extension of the angular synchronization problem to the heterogeneous setting January 2021
- 59 misc A Novel Unstained Blood Smears Multispectral Images Normalization. Application to Unstained Malaria Infected Blood Smear. February 2021
- 60 misc Functional spatial principal Component Analysis and Application to demography February 2021
- 61 misc Clustering DNA sequences for phylogenetic trees using a functional data framework February 2021
- 62 misc An end-to-end data-driven optimisation framework for constrained trajectories November 2020
- 63 misc Reject Inference Methods in Credit Scoring: A rational review December 2020
- 64 misc Denoising modulo samples: k-NN regression and tightness of SDP relaxation January 2021
- 65 misc Parameter-Wise Co-Clustering for High-Dimensional Data September 2020
- 66 misc PAC-Bayes unleashed: generalisation bounds with unbounded losses June 2020
- 67 misc Upper and Lower Bounds on the Performance of Kernel PCA December 2020
- 68 misc Block clustering of Binary Data with Gaussian Co-variables October 2020
- 69 misc Cluster-Specific Predictions with Multi-Task Gaussian Processes November 2020
- 70 misc MAGMA: Inference and Prediction with Multi-Task Gaussian Processes July 2020
- 71 misc Simultaneous semi-parametric estimation of clustering and regression December 2020
- 72 misc cfda: an R Package for Categorical Functional Data Analysis October 2020
- 73 misc Estimation of extreme tail index for β−mixing random fields February 2021
- 74 misc An asymptotic approximation for the extended Bass diffusion model and application to pandemic outbreaks February 2021
- 75 misc Error analysis for denoising smooth modulo signals on a graph January 2021
- 76 misc Forecasting elections results via the voter model with stubborn nodes September 2020
- 77 misc How opinions crystallise: an analysis of polarisation in the voter model June 2020
Other scientific publications
- 78 misc Label switching in mixtures Glasgow, United Kingdom, France July 2021
- 79 misc Proteomic signature for early diagnosis of left ventricular remodeling after myocardial infarction Grenoble / Virtual, France October 2020
11.3 Cited publications
- 80 incollectionReliability theory for discrete-time semi-Markov systemsSemi-Markov Chains and Hidden Semi-Markov Models toward ApplicationsSpringer2008, 1--30
- 81 articleEstimation in the Mixture of Markov Chains Moving With Different SpeedsJournal of the American Statistical Association1004712005, 1046-1053
- 82 inproceedingsOn mixtures of Markov chainsProceedings of the 30th International Conference on Neural Information Processing SystemsCiteseer2016, 3449--3457
- 83 inproceedingsRetrospective Mutiple Change-Point Estimation with Kernels2007 IEEE/SP 14th Workshop on Statistical Signal Processing2007, 768--772
- 84 articleGeneralization in Deep LearningCoRRabs/1710.054682017, URL: http://arxiv.org/abs/1710.05468
- 85 articleDetecting multiple change-points in the mean of Gaussian process by model selectionSignal Processing8542005, 717--736URL: https://www.sciencedirect.com/science/article/pii/S0165168404003196
- 86 inproceedingsA Theoretical Analysis of Contrastive Unsupervised Representation LearningProceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA2019, 5628--5637URL: http://proceedings.mlr.press/v97/saunshi19a.html
- 87 articleExact MLE and asymptotic properties for nonparametric semi-Markov modelsJournal of Nonparametric Statistics2332011, 719--739