Simpler PAC-Bayesian Bounds for Hostile Data

Christophe Biernacki Chercheur Inria, Senior Researcher oui Benjamin Guedj Chercheur Inria, Researcher Hemant Tyagi Chercheur Inria, Researcher Cristian Preda Enseignant Team leader, Université de Lille, Professor oui Sophie Dabo Enseignant Univ Henri Poincaré, Professor Guillemette Marot Enseignant Université de Lille, Associate Professor oui Vincent Vandewalle Enseignant Université de Lille, Associate Professor oui Ernesto Javier Araya Valdivia PostDoc Inria Florent Dewez PostDoc Inria, until Jan 2021 Valentina Zantedeschi PostDoc Inria Reuben Adams PhD University College London Filippo Antonazzo PhD Inria Felix Biggs PhD University College London Rajeev Bopche PhD Inria, until May 2021 Guillaume Braun PhD Institut national de la statistique et des études économiques Theophile Cantelobre PhD Inria, from Oct 2021 Maxime Haddouche PhD Université de Lille, from Oct 2021 Wilfried Heyse PhD INSERM Eglantine Karlé PhD Inria Etienne Kronert PhD Wordline Issam Ali Moindjie PhD Inria Axel Potier PhD Groupe Adeo, CIFRE Antonin Schrab PhD University College London Antoine Vendeville PhD University College London Luxin Zhang PhD Wordline, CIFRE Maxime Brunin Technique Inria, Engineer, until Mar 2021 Ismat Yahia Chaib Draa Technique Société Alicante à Seclin, Engineer, from Sep 2021 Florent Dewez Technique Inria, Engineer, from Feb 2021 until Apr 2021 Myriam Benbahlouli Stagiaire Inria, Apprentice, from Oct 2021 Claire Devisme Stagiaire Inria, from Apr 2021 until Aug 2021 Ahoua Jean Marc Ehile Stagiaire Inria, from Mar 2021 until Aug 2021 Elias Giraud-Audine Stagiaire NC, Jul 2021 Valentin Kilian Stagiaire École normale supérieure de Rennes, from May 2021 until Jul 2021 Ilyas Lebleu Stagiaire École Normale Supérieure de Paris, from Apr 2021 until Aug 2021 Cecilia Alejandra Rivera Martinez Stagiaire École Nationale Supérieure d'Arts et Métiers, from Mar 2021 until Aug 2021 Seydina Mouhamed Sow Stagiaire Inria, from Mar 2021 until Aug 2021 Anne Rejl Assistant Inria Alain Celisse CollaborateurExterieur Univ Panthéon Sorbonne oui HemantTyagi

We study the problem of $k$ -way clustering in signed graphs. Considerable attention in recent years has been devoted to analyzing and modeling signed graphs, where the affinity measure between nodes takes either positive or negative values. Recently, Cucuringu et al. [CDGT 2019] proposed a spectral method, namely SPONGE (Signed Positive over Negative Generalized Eigenproblem), which casts the clustering task as a generalized eigenvalue problem optimizing a suitably defined objective function. This approach is motivated by social balance theory, where the clustering task aims to decompose a given network into disjoint groups, such that individuals within the same group are connected by as many positive edges as possible, while individuals from different groups are mainly connected by negative edges. Through extensive numerical simulations, SPONGE was shown to achieve state-of-the-art empirical performance. On the theoretical front, [CDGT 2019] analyzed SPONGE and the popular Signed Laplacian method under the setting of a Signed Stochastic Block Model (SSBM), for $k = 2$ equal-sized clusters, in the regime where the graph is moderately dense. In this work, we build on the results in [CDGT 2019] on two fronts for the normalized versions of SPONGE and the Signed Laplacian. Firstly, for both algorithms, we extend the theoretical analysis in [CDGT 2019] to the general setting of $k \geq 2$ unequal-sized clusters in the moderately dense regime. Secondly, we introduce regularized versions of both methods to handle sparse graphs – a regime where standard spectral methods underperform – and provide theoretical guarantees under the same SSBM model. To the best of our knowledge, regularized spectral methods have so far not been considered in the setting of clustering signed graphs. We complement our theoretical results with an extensive set of numerical experiments on synthetic data.

This is joint work with Mihai Cucuringu (University of Oxford, United Kingdom), Apoorv Vikram Singh (NYU), Deborah Sulem (University of Oxford, United Kingdom). It was initiated when Apoorv Vikram Singh visited the MODAL team to work with Hemant Tyagi from Oct 2019-Jan 2020. This has now been accepted for publication in the journal: Journal of Machine Learning Research12. A summary of the results was presented at the GCLR (Graphs and more Complex structures for Learning and Reasoning) workshop at AAAI 2021.

BenjaminGuedj

A learning method is self-certified if it uses all available data to simultaneously learn a predictor and certify its quality with a tight statistical certificate that is valid with high confidence on any random data point. Self-certified learning promises to bring two major advantages to the machine learning community: First, it avoids the need to hold out data for validation and test purposes, both for certifying the model’s performance as well as for model selection. This could lead to a simplification of the machine learning data pipeline, while additionally, using all the available data for training could also lead to better representations of the underlying data distribution and ultimately lead to more accurate models. Secondly, self-certified learning focuses on delivering performance certificates that are valid with high confidence and are informative of the out-of-sample error, properties that are crucial for appropriately comparing machine learning models as well as setting performance standards for algorithmic governance of these models in the real world. In this paper, we assess how close we are to achieving self-certification in neural networks. In particular, recent work has shown that probabilistic neural networks trained by optimising PAC-Bayes generalisation bounds could bear promise towards achieving self-certified learning, since these can leverage all the available data to learn a posterior and simultaneously certify its risk with tight statistical performance certificates. In this work we empirically compare (on 4 classification datasets) test set generalisation bounds for deterministic predictors and a PAC-Bayes bound for randomised predictors obtained by a self-certified learning strategy (i.e. using all available data for training). We first show that both of these generalisation bounds are not too far from test set errors. We then show that in data small regimes, holding out data for the test set bounds adversely affects generalisation performance, while self-certified strategies based on PAC-Bayes bounds do not suffer from this drawback, showing that they might be a suitable choice for this small data regime. We also find that self-certified probabilistic neural networks learnt by PAC-Bayes inspired objectives lead to certificates that can be surprisingly competitive compared to commonly used test set bounds.

Accepted at the Bayesian Deep Learning workshop at NeurIPS 2021.

Partnerships and cooperations ANR RHU and FHU

A RHU (recherche hospitalo-universitaire) is an excellence programme funded by PIA (program of investment for the future) and selected by ANR. A FHU is a federative project and a label necessary to postulate for a RHU.

Regional initiatives

PhD in progress:

Eglantine Karle, November 2020, Hemant Tyagi and Cristian Preda

Guillaume Braun, January 2020, Christophe Biernacki and Hemant Tyagi

Wilfried Heyse, 2019, Guillemette Marot and Vincent Vandewalle

Axel Potier, Sale prediction for low turn-over products, November 2020, Christophe Biernacki, Matthieu Marbac, Vincent Vandewalle

Etienne Kronert, Détection d'anomalie à noyau reproduisant appliquée au domaine IT, Septembre 2020, Alain Celisse et Cristian Preda.

Issam Moindje, Analyse de données fonctionnelles pour l'identification des biomarqueurs dans l'EEG et le MEG chez les prématurés et les foetus, October 2020, Sophie Dabo, Cristian Preda.

Luxin Zhang, Model Agnostic Domain Adaptation: application to Fraud Detection, February 2019, Christophe Biernacki, Pascal Germain, Yacine Kecassi

Filippo Antonazzo, Frugal Gaussian clustering of huge imbalanced datasets through a bin-marginal approach, October 2019, Christophe Biernacki, Christine Keribin

Clarisse Boinay, anomaly detection and change point detection in contextual dynamic asynchronous graphs with applications in OT cybersecurity, December 2021, Christophe Biernacki, Cristian Preda

Felix Biggs, PAC-Bayes, deep neural networks and generative models. Started Sept 2019, University College London, supervisors Benjamin Guedj and John Shawe-Taylor.

Antoine Vendeville, Graph models for cybersecurity and information diffusion on networks. Started Sept 2019, University College London, supervisors Benjamin Guedj and Shi Zhou.

Antonin Schrab, PAC-Bayes, generative models and hypothesis testing. Started Sept 2020, University College London, supervisors Benjamin Guedj and Arthur Gretton.

Reuben Adams, PAC-Bayes theory and computational statistics. Started Sept 2020, University College London, supervisors Benjamin Guedj and John Shawe-Taylor.

Maxime Haddouche, PAC-Bayes, representation learning and online learning. Started Sept 2021, University College London, supervisors Benjamin Guedj and John Shawe-Taylor.

Théophile Cantelobre, PAC-Bayes, kernel methods and representation learning. Started Sept 2021, University College London, supervisors Benjamin Guedj, Alessandro Rudi and Carlo Ciliberto.

Mathieu Alain, PAC-Bayes and information theory. Started Sept 2021, University College London, supervisors Benjamin Guedj and Miguel Rodrigues.

Antoine Picard, Agrégation d'experts et apprentissage multi-tâches : application à la modélisation du processus de méthanisation pour l'optimisation de la gestion de déchêts organiques. Started Sept 2021, CIFRE Suez, supervisors Benjamin Guedj, Roman Moscoviz and Gilles Faÿ.