MODAL - 2022 - Rapport annuel d'activité

MODAL

MODAL - 2022

2022

Activity report

Project-Team

MODAL

RNSR: 201020969D

Research center

Inria Center at the University of Lille

In partnership with:

CNRS, Université de Lille

MOdel for Data Analysis and Learning

In collaboration with:

Laboratoire Paul Painlevé (LPP)

Domain

Applied Mathematics, Computation and Simulation

Theme

Optimization, machine learning and statistical methods

Creation of the Project-Team: 2012 January 01

Participants: Benjamin Guedj.

A learning method is self-certified if it uses all available data to simultaneously learn a predictor and certify its quality with a tight statistical certificate that is valid with high confidence on any random data point. Self-certified learning promises to bring two major advantages to the machine learning community: First, it avoids the need to hold out data for validation and test purposes, both for certifying the model’s performance as well as for model selection. This could lead to a simplification of the machine learning data pipeline, while additionally, using all the available data for training could also lead to better representations of the underlying data distribution and ultimately lead to more accurate models. Secondly, self-certified learning focuses on delivering performance certificates that are valid with high confidence and are informative of the out-of-sample error, properties that are crucial for appropriately comparing machine learning models as well as setting performance standards for algorithmic governance of these models in the real world. In this paper, we assess how close we are to achieving self-certification in neural networks. In particular, recent work has shown that probabilistic neural networks trained by optimising PAC-Bayes generalisation bounds could bear promise towards achieving self-certified learning, since these can leverage all the available data to learn a posterior and simultaneously certify its risk with tight statistical performance certificates. In this work we empirically compare (on 4 classification datasets) test set generalisation bounds for deterministic predictors and a PAC-Bayes bound for randomised predictors obtained by a self-certified learning strategy (i.e. using all available data for training). We first show that both of these generalisation bounds are not too far from test set errors. We then show that in data small regimes, holding out data for the test set bounds adversely affects generalisation performance, while self-certified strategies based on PAC-Bayes bounds do not suffer from this drawback, showing that they might be a suitable choice for this small data regime. We also find that self-certified probabilistic neural networks learnt by PAC-Bayes inspired objectives lead to certificates that can be surprisingly competitive compared to commonly used test set bounds.

MODAL - 2022

MODAL - 2022

Keywords

Computer Science and Digital Science

Other Research Topics and Application Domains

1 Team members, visitors, external collaborators

Research Scientists

Faculty Members

Post-Doctoral Fellows

PhD Students

Technical Staff

Interns and Apprentices

Administrative Assistant

Visiting Scientists

External Collaborator

2 Overall objectives

2.1 Context

2.2 Goals

3 Research program

3.1 Research axis 1: Unsupervised learning

3.2 Research axis 2: Performance assessment

3.3 Research axis 3: Functional data

3.4 Research axis 4: Applications motivating research

4 Application domains

4.1 Economic world

4.2 Biology and health

5 Social and environmental responsibility

6 Highlights of the year

6.1 Awards

7 New software and platforms

7.1 New software

7.1.1 MixtComp.V4

7.1.2 cfda

7.1.3 ClusPred

7.1.4 visCorVar

7.1.5 metaRNASeq

7.1.6 HDSpatialScan

7.1.7 MLGL

7.2 New platforms

7.2.1 MASSICCC Platform

8 New results

8.1 Axis 1: Co-clustering as a (very) parsimonious clustering

8.2 Axis 1: Relaxing the identically distributed assumption in Gaussian co-clustering for high dimensional data

8.3 Axis 1: Dealing with Missing Data in Model-based Clustering through a MNAR Model

8.4 Axis 1: Predictive Clustering

8.5 Axis 1: A Binned Technique for Scalable Model-based Clustering on Huge Datasets

8.6 Axis 1: Forecasting elections results via the voter model with stubborn nodes

8.7 Axis 1: Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly

8.8 Axis 1&2: An iterative clustering algorithm for the Contextual Stochastic Block Model with optimality guarantees

8.9 Axis 1&2: Seeded graph matching for the correlated Wigner model via the projected power method

8.10 Axis 1&2: Dynamic Ranking and Translation Synchronization

8.11 Axis 1&2: Minimax Optimal Clustering of Bipartite Graphs with a Generalized Power Method

8.12 Axis 2: Asymptotic efficiency of some nonparametric tests for location on hyperspheres

8.13 Axis 2: k-nearest neighbors prediction and classification for spatial data

8.14 Axis 2: Progress in Self-Certified Neural Networks

8.15 Axis 2: MMD Aggregated Two-Sample Test

8.16 Axis 2: Learning PAC-Bayes Priors for Probabilistic Neural Networks

8.17 Axis 2: On Margins and Derandomisation in PAC-Bayes

8.18 Axis 2: Learning Stochastic Majority Votes by Minimizing a PAC-Bayes Generalization Bound

8.19 Axis 2: Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks

8.20 Axis 2: PAC-Bayes Unleashed: Generalisation Bounds with Unbounded Losses

8.21 Axis 2: Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds

8.22 Axis 3: Non-parametric statistical analysis of spatially distributed functional data

8.23 Axis 3: Clustering spatial functional data

8.24 Axis 3: Regression models for spatially distributed autoregressive functional data

8.25 Axis 3: Investigating spatial scan statistics for multivariate functional data

8.26 Axis 3: PLS regression approach for multivariate functional data with different domains

8.27 Axis 3: Group lasso regression for spatially dependent functional data

8.28 Axis 4: Statistical analysis of high-throughput proteomic data

8.29 Axis 4: Multi-layer group Lasso

8.30 Axis 4: Statistical analysis of transcriptomics data

8.31 Axis 4: Statistical analysis of proteomic data with empirical bayesian approaches

8.32 Axis 4: Multi-omics data analysis

8.33 Axis 4: Interpretable Domain Adaptation for Hidden Subdomain Alignment in the Context of Pre-trained Source Models

8.34 Axis 4: Interpretable Domain Adaptation Using Unsupervised Feature Selection on Pretrained Source Models

8.35 Axis 4: Single cell classification using statistical learning on mechanical properties measured by mems tweezers

8.36 Axis 4: Dimensionality Reduction and Bandwidth Selection for Spatial Kernel Discriminant Analysis

8.37 Axis 4: A kernel discriminant analysis for spatially dependent data

9 Bilateral contracts and grants with industry

9.1 Bilateral contracts with industry