MODAL - 2020 - Annual activity report

MODAL

MODAL - 2020

2020

Activity report

Project-Team

MODAL

RNSR: 201020969D

Research center

Lille - Nord Europe

In partnership with:

CNRS, Université de Lille

MOdel for Data Analysis and Learning

In collaboration with:

Laboratoire Paul Painlevé (LPP)

Domain

Applied Mathematics, Computation and Simulation

Theme

Optimization, machine learning and statistical methods

Creation of the Team: 2010 September 01, updated into Project-Team: 2012 January 01

Keywords

Computer Science and Digital Science

A3.1.4. Uncertain data
A3.2.3. Inference
A3.3.2. Data mining
A3.3.3. Big data analysis
A3.4.1. Supervised learning
A3.4.2. Unsupervised learning
A3.4.5. Bayesian methods
A3.4.7. Kernel methods
A5.2. Data visualization
A5.9.2. Estimation, modeling
A6.2.3. Probabilistic methods
A6.2.4. Statistical methods
A6.3.3. Data processing
A9.2. Machine learning

Other Research Topics and Application Domains

B2.2.3. Cancer
B9.5.6. Data science
B9.6.3. Economy, Finance
B9.6.5. Sociology

Keywords: Statistics, Data visualization, Machine learning
Scientific Description:
pycobra is a python library for ensemble learning, which serves as a toolkit for regression, classification, and visualisation. It is scikit-learn compatible and fits into the existing scikit-learn ecosystem.

pycobra offers a python implementation of the COBRA algorithm introduced by Biau et al. (2016) for regression.

Another algorithm implemented is the EWA (Exponentially Weighted Aggregate) aggregation technique (among several other references, you can check the paper by Dalalyan and Tsybakov (2007).

Apart from these two regression aggregation algorithms, pycobra implements a version of COBRA for classification. This procedure has been introduced by Mojirsheibani (1999).

pycobra also offers various visualisation and diagnostic methods built on top of matplotlib which lets the user analyse and compare different regression machines with COBRA. The Visualisation class also lets you use some of the tools (such as Voronoi Tesselations) on other visualisation problems, such as clustering.
Functional Description:
pycobra is a python library for ensemble learning, which serves as a toolkit for regression, classification, and visualisation. It is scikit-learn compatible and fits into the existing scikit-learn ecosystem.

pycobra offers a python implementation of the COBRA algorithm introduced by Biau et al. (2016) for regression.

Another algorithm implemented is the EWA (Exponentially Weighted Aggregate) aggregation technique (among several other references, you can check the paper by Dalalyan and Tsybakov (2007).

Apart from these two regression aggregation algorithms, pycobra implements a version of COBRA for classification. This procedure has been introduced by Mojirsheibani (1999).

pycobra also offers various visualisation and diagnostic methods built on top of matplotlib which lets the user analyse and compare different regression machines with COBRA. The Visualisation class also lets you use some of the tools (such as Voronoi Tesselations) on other visualisation problems, such as clustering.
URL: https://github.com/bhargavvader/pycobra
Publication: hal-01514059
Contact: Benjamin Guedj
Participants: Bhargav Srinivasa Desikan, Benjamin Guedj

Name: Python Route Trajectory Optimiser
Keywords: Optimization, Machine learning, Trajectory Modeling
Scientific Description:
PyRotor is a Python implementation of the trajectory optimisation method introduced in the paper: “An end-to-end data-driven optimisation framework for constrained trajectories”

The method proposes trajectories optimizing a given criterion. Unlike classical approaches (such as optimal control), the method is based on the information contained in the available data. This permits to restrict the search area to a neighborhood of the observed trajectories and incorporates the correlations estimated from the data. This is achieved by means of a regularization term in the cost function. An iterative approach is also developed to verify additional constraints.
Functional Description: PyRotor leverages available trajectory data to focus the search space and to estimate some properties which are then incorporated in the optimisation problem. This constraints in a natural and simple way the optimisation problem whose solution inherits realistic patterns from the data. In particular PyRotor does not require any knowledge on the dynamics of the system.
News of the Year: Methodology development and implementation of the first results
URL: https://pypi.org/project/pyrotor/
Publication: hal-03024720
Contact: Florent Dewez
Participants: Florent Dewez, Benjamin Guedj, Arthur Talpaert, Vincent Vandewalle

Participants: Hemant Tyagi.

We study the problem of $k$ -way clustering in signed graphs. Considerable attention in recent years has been devoted to analyzing and modeling signed graphs, where the affinity measure between nodes takes either positive or negative values. Recently, Cucuringu et al. [CDGT 2019] proposed a spectral method, namely SPONGE (Signed Positive over Negative Generalized Eigenproblem), which casts the clustering task as a generalized eigenvalue problem optimizing a suitably defined objective function. This approach is motivated by social balance theory, where the clustering task aims to decompose a given network into disjoint groups, such that individuals within the same group are connected by as many positive edges as possible, while individuals from different groups are mainly connected by negative edges. Through extensive numerical simulations, SPONGE was shown to achieve state-of-the-art empirical performance. On the theoretical front, [CDGT 2019] analyzed SPONGE and the popular Signed Laplacian method under the setting of a Signed Stochastic Block Model (SSBM), for $k = 2$ equal-sized clusters, in the regime where the graph is moderately dense. In this work, we build on the results in [CDGT 2019] on two fronts for the normalized versions of SPONGE and the Signed Laplacian. Firstly, for both algorithms, we extend the theoretical analysis in [CDGT 2019] to the general setting of $k \geq 2$ unequal-sized clusters in the moderately dense regime. Secondly, we introduce regularized versions of both methods to handle sparse graphs – a regime where standard spectral methods underperform – and provide theoretical guarantees under the same SSBM model. To the best of our knowledge, regularized spectral methods have so far not been considered in the setting of clustering signed graphs. We complement our theoretical results with an extensive set of numerical experiments on synthetic data.

This is joint work with Mihai Cucuringu (University of Oxford, United Kingdom), Apoorv Vikram Singh (NYU), Deborah Sulem (University of Oxford, United Kingdom). It was initiated when Apoorv Vikram Singh visited the MODAL team to work with Hemant Tyagi from Oct 2019-Jan 2020. It is currently under review in an international journal. A summary of the results was presented at the GCLR (Graphs and more Complex structures for Learning and Reasoning) workshop at AAAI 2021 (https://sites.google.com/view/gclr2021/accepted-papers).

9.1 International initiatives

9.3 European initiatives

9.3.1 FP7 & H2020 Projects

H2020 PERF-AI

Participants: Florent Dewez, Benjamin Guedj, Arthur Talpaert, Vincent Vandewalle.

Acronym: PERF-AI
Project title: Enhance Aircraft Performance and Optimisation through utilisation of Artificial Intelligence
Coordinator: Pierre Jouniaux (Safety-Line)
Duration: 2 years (2018–2020)
Partners: Safety-Line
Abstract: PERF-AI will apply Machine Learning techniques on flight data (parametric & non-parametric approaches) to accurately measure actual aircraft performance throughout its lifecycle.

Within current airline operations, both at flight preparation (on-ground) & at flight management (in-air) levels, the trajectory is first planned, then managed by the Flight Management System (FMS) using a single manufacturer’s performance model that is the same for every aircraft of the same type, & also on weather forecast that is computed long before the flight. It induces a lack of accuracy during the planning phase with a flight route pre-established at specific altitudes & speeds to optimize fuel burn, from take-off to landing using aircraft performances that are not those of the real aircraft. Also, the actual flight will usually shift from the original plan because of Air Traffic Control (ATC) constraints, adverse weather, wind changes & tactical re-routing, without possibility for the flight crew, either using the FMS or through connected services to tactically recompute the trajectory in order to continuously optimize the flight path. This is in particular due to the limitations of the performance databases that the current systems are using.

Hence, PERF-AI is focusing on identifying adequate machine learning algorithms, testing their accuracy & capability to perform flight data statistical analysis & developing mathematical models to optimize real flight trajectories with respect to the actual aircraft performance, thus, minimizing fuel consumption throughout the flight.

The consortium consists of Safety-Line & Inria, having full expertise at Aircraft Performance & Data Science, hence, able to fully propose, test & validate different statistical models that will allow to accurately solve some optimization challenges & implement them in an operational environment.

PERF-AI total grant request to the CSJU is 568 550 € with total project duration of 24 months.

9.4.1 ANR

10.2.1 Teaching

Pascal Germain taught
- Master: Introduction aux réseaux de neurones, 15 heures, M2, Université de Lille, France
Hemant Tyagi is teaching
- Master: Statistics I, 24h, M1, Centrale Lille, France (Nov. 2020 - 7 Jan. 2021)
- Master: Statistics II, 24h, M1, Centrale Lille, France (11 Jan. 2021 - 18 March 2021)
Sophie Dabo-Niang is teaching
- Master: Spatial Statistics, 24h, M2, Université de Lille, France
- Master: Advanced Statistics, 24h, M2, Université de Lille, France
- Master: Multivariate Data Analyses, 24h, M2, Université de Lille, France
- Licence: Probability, 24h, L2, Université de Lille, France
- Licence: Multivariate Statistics, 24h, L3, Université de Lille, France
Guillemette Marot is teaching
- Licence: Biostatistics, 15h, L1, Université de Lille (Faculty of Medicine), France
- Master: Biostatistics, 62h, M1, Université de Lille (Faculty of Medicine), France
- Master: Supervised classification, 34h, M1, Polytech'Lille, France
- Master: Biostatistics, 20h, M1, Université de Lille (Departments of Computer Science and Biology), France
- Master: Statistical analysis of omics data, 22h, M2, Université de Lille (Department of Mathematics), France
- Doctorat: Artificial intelligence and health, 7h, Université de Lille (Faculty of Medicine), France
Cristian Preda is teaching
- Polytech'Lille engineer school: Linear Models, 48h.
- Polytech'Lille engineer school: Advanced statistics, 48h.
- Polytech'Lille engineer school: Biostatistics, 10h.
- Polytech'Lille engineer school: Supervised clustering, 24h. France
Christophe Biernacki is teaching
- New Master Data Science: Statistics, 24h, M1, Université de Lille, France
Benjamin Guedj is teaching
- Advanced machine learning (M2, 6h), University College London, United Kingdom
Serge Iovleff is teaching
- Licence: Analyse et méthodes numériques, 56h, Université de Lille, DUT Informatique
- Licence: R.O. et aide à la décision, 32h, Université de Lille, DUT Informatique
Vincent Vandewalle is teaching
- Licence: Probability, 60h, Université de Lille, DUT STID
- Licence: Case study in statistics, 45h, Université de Lille, DUT STID
- Licence: R programming, 45h, Université de Lille, DUT STID
- Licence: Supervised clustering, 32h, Université de Lille, DUT STID
- Licence: Analysis, 24h, Université de Lille, DUT STID

10.2.2 Supervision

PhD in progress:

Axel Potier, Sale prediction for low turn-over products, November 2020, Christophe Biernacki, Matthieu Marbac, Vincent Vandewalle
Felix Biggs, Generative models and kernels, University College London (United Kingdom), Sep 2019, Benjamin Guedj
Antoine Vendeville, Learning on graph to stop the propagation of fake news, University College London (United Kingdom), Sep 2019, Benjamin Guedj
Luxin Zhang, Domain adaptation from a pre-trained source model – Application to fraud detection in electronic payments, February 2019, Christophe Biernacki, Pascal Germain, Yacine Kessac
Paul Viallard, Interpreting representation learning through PAC-Bayes theory, September 2019, Amaury Habrard, Emilie Morvant, Pascal Germain
Dang Khoi Pham, Planning and re-planning of nurses in an oncology department using a multi-objective and interdisciplinary approach, September 2016, Sophie Dabo-Niang
Solange Doumun, Performance evaluation and contribution to the development of multispectral image analysis strategies for automatic and rapid diagnosis of malaria, December 2018, Sophie Dabo-Niang
Alaa Ali Ayad, Statistical modeling of large spatial data and its applications in health, September 2018, Sophie Dabo-Niang
Wilfried Heyse, Prise en compte de la structure temporelle dans l'analyse statistique de données protéomiques à haut débit, October 2019, Christophe Bauters, Guillemette Marot and Vincent Vandewalle
Margot Selosse, October 2017, Christophe Biernacki and Julien Jacques
Filippo Antonazzo, October 2019, Christophe Biernacki and Christine Keribin
Eglantine Karle, November 2020, Hemant Tyagi and Cristian Preda
Guillaume Braun, January 2020, Christophe Biernacki and Hemant Tyagi
Rajeev Bopche, September 2020, Christophe Biernacki and Martine Vaxillaire
Antonin Schrab, September 2020, co-supervised by Arthur Gretton and Benjamin Guedj, University College London (United Kingdom)
Reuben Adams, Septembre 2020, co-supervised by John Shawe-Taylor and Benjamin Guedj, University College London (United Kingdom)

MODAL - 2020

MODAL - 2020

Keywords

Computer Science and Digital Science

Other Research Topics and Application Domains

1 Team members, visitors, external collaborators

Research Scientists

Faculty Members

Post-Doctoral Fellows

PhD Students

Technical Staff

Interns and Apprentices

Administrative Assistant

Visiting Scientist

External Collaborators

2 Overall objectives

2.1 Context

2.2 Goals

3 Research program

3.1 Research axis 1: Unsupervised learning

3.2 Research axis 2: Performance assessment

3.3 Research axis 3: Functional data

3.4 Research axis 4: Applications motivating research

4 Application domains

4.1 Economic world

4.2 Biology

5 Highlights of the year

5.1 Awards

6 New software and platforms

6.1 New software

6.1.1 pycobra

6.1.2 MixtComp.V4

6.1.3 MASSICCC

6.1.4 cfda

6.1.5 PyRotor

6.2 New platforms

6.2.1 MASSICCC Platform

7 New results

7.1 Axis 1: Model-based Co-clustering for Ordinal Data of Different Dimensions

7.2 Axis 1: Model-based Co-clustering for Mixed Type Data

7.3 Axis 1: Relaxing the Identically Distributed Assumption in Gaussian Co-clustering for High Dimensional Data

7.4 Axis 1: Gaussian-based Visualization of Gaussian and non-Gaussian Model-based Clustering

7.5 Axis 1: Dealing with Missing Data in Model-based Clustering through a MNAR Model

7.6 Axis 1: Organized Co-clustering for Textual Data Synthesis

7.7 Axis 1: Model-Based Co-clustering with Co-variables

7.8 Axis 1: Predictive Clustering

7.9 Axis 1: A Binned Technique for Scalable Model-based Clustering on Huge Datasets

7.10 Axis 1: A Bumpy Journey: Exploring Deep Gaussian Mixture Models

7.11 Axis 1: Multiple partition clustering subspaces

7.12 Axis 1: Ranking and synchronization from pairwise measurements via SVD

7.13 Axis 1: Regularized spectral methods for clustering signed networks

7.14 Axis 1: An extension of the angular synchronization problem to the heterogeneous setting

7.15 Axis 1&2: Clustering on Multilayer Graphs with Missing Values

7.16 Axis 2: Denoising modulo samples: k-NN regression and tightness of SDP relaxation

7.17 Axis 2: Error analysis for denoising smooth modulo signals on a graph

7.18 Axis 2: Multi-kernel unmixing and super-resolution using the Modified Matrix Pencil method

7.19 Axis 2: Provably robust estimation of modulo 1 samples of a smooth function with applications to phase unwrapping

7.20 Axis 2: Pseudo-Bayesian learning with kernel Fourier transform as prior

7.21 Axis 2: Improved PAC-Bayesian Bounds for Linear Regression

7.22 Axis 2: Multiview Boosting by controlling the diversity and the accuracy of view-specific voters

7.23 Axis 2: PAC-Bayes and Domain Adaptation

7.24 Axis 2: Interpreting Neural Networks as Majority Votes through the PAC-Bayesian Theory

7.25 Axis 2: PAC-Bayesian Bound for the Conditional Value at Risk

7.26 Axis 2: PAC-Bayesian Contrastive Unsupervised Representation Learning

7.27 Axis 2: Revisiting clustering as matrix factorisation on the Stiefel manifold.

7.28 Axis 2:Kernel-Based Ensemble Learning in Python

7.29 Axis 2: Non-linear aggregation of filters to improve image denoising.

7.30 Axis 2: Multiple change-points detection with reproducing kernels

7.31 Axis 2: Analysis of early stopping rules based on discrepancy principle

7.32 Axis 3: Short-term air temperature forecasting using Nonparametric Functional Data Analysis and SARMA models

7.33 Axis 3: Mathematical Modeling and Study of Random or Deterministic Phenomena

7.34 Axis 3: Categorical functional data analysis

7.35 Axis 3: Scan Statistics

7.36 Axis 3: Clustering categorical functional data

7.37 Axis 3: Estimation of right-censored categorical functional data

7.38 Axis 4: Statistical analysis of high-throughput proteomic data

7.39 Axis 4: Reject Inference Methods in Credit Scoring

7.40 Axis 4: Usability study

7.41 Axis 4: Artificial intelligence for aviation

7.42 Axis 4: Domain Adaptation from a Pre-trained Source Model