2024Activity reportTeamSIERRA
Inria teams are typically groups of researchers working on the definition of a common project, and objectives, with the goal to arrive at the creation of a project-team. Such project-teams may include other partners (universities or research institutions).
RNSR: 201120973D- Research center Inria Paris Centre
- In partnership with:Ecole normale supérieure de Paris, CNRS
- Team name: Machine Learning and Optimisation
- In collaboration with:Département d'Informatique de l'Ecole Normale Supérieure
- Domain:Applied Mathematics, Computation and Simulation
- Theme:Optimization, machine learning and statistical methods
Keywords
Computer Science and Digital Science
- A3.4. Machine learning and statistics
- A5.4. Computer vision
- A6.2. Scientific computing, Numerical Analysis & Optimization
- A7.1. Algorithms
- A8.2. Optimization
- A9.2. Machine learning
Other Research Topics and Application Domains
- B9.5.6. Data science
1 Team members, visitors, external collaborators
Research Scientists
- Francis Bach [Team leader, Inria, Senior Researcher, HDR]
- Michael Jordan [Fondation Inria]
- Alessandro Rudi [INRIA, Researcher, until Sep 2024]
- Umut Simsekli [INRIA, Researcher]
- Adrien Taylor [INRIA, Researcher]
- Alexandre d'Aspremont [CNRS, Senior Researcher]
Post-Doctoral Fellows
- Luc Brogat-Motte [Centrale Supelec, Post-Doctoral Fellow]
- Yurong Chen [INRIA, Post-Doctoral Fellow, from Oct 2024]
- Fajwel Fogel [ENS PARIS, Post-Doctoral Fellow]
- Maxime Haddouche [INRIA, Post-Doctoral Fellow, from Sep 2024]
- David Holzmuller [INRIA, Post-Doctoral Fellow]
- Frederik Kunstner [INRIA, Post-Doctoral Fellow, from Oct 2024]
- Anant Raj [INRIA, Post-Doctoral Fellow, until Sep 2024]
- Fabian Schaipp [INRIA, Post-Doctoral Fellow, from Sep 2024]
- Corbinian Schlosser [INRIA, Post-Doctoral Fellow]
- Yang Su [ENS PARIS, Post-Doctoral Fellow]
- Paul Viallard [INRIA, Post-Doctoral Fellow, until Jan 2024]
- Julien Weibel [INRIA, Post-Doctoral Fellow, from Dec 2024]
PhD Students
- Andrea Basteri [INRIA]
- Eugene Berta [INRIA, from May 2024]
- Gaspard Beugnot [INRIA, until Feb 2024]
- Pierre Boudart [INRIA]
- Nabil Boukir [INRIA, from Nov 2024]
- Sacha Braun [INRIA, from Sep 2024]
- Sarah Brood [ENS Paris]
- Arthur Calvi [CNRS]
- Theophile Cantelobre [INRIA, until Sep 2024]
- Aymeric Capitaine [CMAP]
- Benjamin Dupuis [INRIA]
- Bertille Follain [ENS PARIS, until Nov 2024]
- Alexandre Francois [INRIA, from Sep 2024]
- Etienne Gauthier [INRIA, from Sep 2024]
- Mahmoud Hegazy [CMAP, from Oct 2024]
- Marc Lambert [DGA]
- Clément Lezane [UNIV TWENTE]
- Simon Martin [ENS Paris]
- Céline Moucer [ENS PARIS-SACLAY]
- Benjamin Paul-Dubois-Taine [UNIV PARIS SACLAY]
- Antoine Scheid [CMAP]
- Dario Shariatian [INRIA]
- Lawrence Stewart [INRIA]
Interns and Apprentices
- Melih Barsbey [UNIV BOGAZICI, until Aug 2024]
- Daniel Einar Berg Thomsen [INRIA, from Nov 2024]
- Eliot Beyler [ENS PARIS, Intern, from Sep 2024]
- Eliot Beyler [ENS PARIS, Intern, from Apr 2024 until Jul 2024]
- Nabil Boukir [INRIA, Intern, from May 2024 until Sep 2024]
- Sacha Braun [INRIA, Intern, from Apr 2024 until Aug 2024]
- Clementine Chazal [INRIA, Intern, from Feb 2024 until Aug 2024]
- Léo Dana [ENS DE LYON, Intern, from Sep 2024]
- Alexandre Francois [INRIA, Intern, from Apr 2024 until Aug 2024]
- Etienne Gauthier [INRIA, Intern, from Apr 2024 until Aug 2024]
- Aaron Mishkin [INRIA, from Jul 2024]
Administrative Assistants
- Marina Kovacic [INRIA]
- Abigail Palma [INRIA, from Sep 2024]
Visiting Scientists
- Rayna Andreeva [UNIV EDIMBOURG, from Feb 2024 until Jun 2024]
- Ioan-Liviu Aolaritei [UNIV BERKELEY, from Oct 2024]
- Ioan-Liviu Aolaritei [ENS Paris, until Mar 2024]
- Eugene Berta [Inria, until Apr 2024]
- Laurent El Ghaoui [Inria International Chair, from Jun 2024 until Jun 2024]
- Sebastian Gregor Gruber [German Cancer Research Center (DKFZ), from Mar 2024 until May 2024]
- Steffen Grunewalder [UNIV NEWCASTLE, until Feb 2024]
- Antônio Horta Ribeiro [UNIV UPPSALA, from Nov 2024]
- Anne Rubbens [FNRS, from Apr 2024 until May 2024]
- Manu Upadhyaya [Lund University, from Sep 2024]
2 Overall objectives
2.1 Statement
Machine learning is a recent scientific domain, positioned between applied mathematics, statistics and computer science. Its goals are the optimization, control, and modeling of complex systems from examples. It applies to data from numerous engineering and scientific fields (e.g., vision, bioinformatics, neuroscience, audio processing, text processing, economy, finance, etc.), the ultimate goal being to derive general theories and algorithms allowing advances in each of these domains. Machine learning is characterized by the high quality and quantity of the exchanges between theory, algorithms and applications: interesting theoretical problems almost always emerge from applications, while theoretical analysis allows the understanding of why and when popular or successful algorithms do or do not work, and leads to proposing significant improvements.
Our academic positioning is exactly at the intersection between these three aspects—algorithms, theory and applications—and our main research goal is to make the link between theory and algorithms, and between algorithms and high-impact applications in various engineering and scientific fields.
3 Research program
Machine learning has emerged as its own scientific domain in the last 30 years, providing a good abstraction of many problems and allowing exchanges of best practices between data oriented scientific fields. Among its main research areas, there are currently probabilistic models, supervised learning (including neural networks), unsupervised learning, reinforcement learning, and statistical learning theory. All of these are represented in the SIERRA team, but the main goals of the team are mostly related to supervised learning and optimization, and their mutual interactions, as well as with interdisciplinary collaborations. One particularity of the team is the strong focus on optimization (in particular convex optimization, but with more works in the non-convex world recently), leading to contributions in optimization which go beyond the machine learning context. Moreover, we interact more and more with other disciplines of applied mathematics (e.g., numerical analysis, control), and economics.
We have divided our research effort in four axes.
- Optimization
- Statistical machine learning
- Machine learning in interaction
- Incentives and machine learning
4 Application domains
Machine learning research can be conducted from two main perspectives: the first one, which has been dominant in the last 30 years, is to design learning algorithms and theories which are as generic as possible, the goal being to make as few assumptions as possible regarding the problems to be solved and to let data speak for themselves. This has led to many interesting methodological developments and successful applications. However, we believe that this strategy has reached its limit for many application domains, such as computer vision, bioinformatics, neuro-imaging, text and audio processing, which leads to the second perspective our team is built on: Research in machine learning theory and algorithms should be driven by interdisciplinary collaborations, so that specific prior knowledge may be properly introduced into the learning process, in particular with the following fields:
- Computer vision: object recognition, object detection, image segmentation, image/video processing, computational photography. In collaboration with the Willow project-team.
- Bioinformatics: cancer diagnosis, protein function prediction, virtual screening.
- Text processing: document collection modeling, language models.
- Audio processing: source separation, speech/music processing.
- Climate science (satellite imaging).
5 Social and environmental responsibility
As one domain within applied mathematics and computer science, machine learning and artificial intelligence may contribute positively to the environment for example by measuring climate change effect or reducing the carbon footprint of other sciences and activities. But it may also contribute negatively, notably by the ever-increasing sizes of machine learning models. Within the team, we work on these two aspects through our work on climate science and on frugal algorithms.
6 Highlights of the year
The team was re-created!
6.1 Awards
- Adrien Taylor: ERC Starting Grant (project CASPER)
- Francis Bach, Anant Raj: Roberto Tempo best Conference on Decision and Control paper award
7 New software, platforms, open data
7.1 New software
7.1.1 PEPit
-
Name:
PEPit
-
Keyword:
Optimisation
-
Functional Description:
PEPit is a Python package aiming at simplifying the access to worst-case analyses of a large family of first-order optimization methods possibly involving gradient, projection, proximal, or linear optimization oracles, along with their approximate, or Bregman variants. In short, PEPit is a package enabling computer-assisted worst-case analyses of first-order optimization methods. The key underlying idea is to cast the problem of performing a worst-case analysis, often referred to as a performance estimation problem (PEP), as a semidefinite program (SDP) which can be solved numerically. For doing that, the package users are only required to write first-order methods nearly as they would have implemented them. The package then takes care of the SDP modelling parts, and the worst-case analysis is performed numerically via a standard solver.
This software is primarily based on the works on performance estimation problems by Adrien Taylor. Compared to other scientific software, its maintenance is relatively low cost (we can do it ourself, together with students involved in using those techniques). We plan to continue updating this software by incorporating recent advances of the community, and with the clear long term idea of making it a tool for teaching first-order optimization.
- URL:
-
Contact:
Adrien Taylor
8 New results
8.1 Nonlinear conjugate gradient methods: worst-case convergence rates via computer-assisted analyses
In this work,we propose a computer-assisted approach to the analysis of the worst-case convergence of nonlinear conjugate gradient methods (NCGMs). Those methods are known for their generally good empirical performances for large-scale optimization, while having relatively incomplete analyses. Using our computer-assisted approach, we establish novel complexity bounds for the Polak-Ribière-Polyak (PRP) and the Fletcher-Reeves (FR) NCGMs for smooth strongly convex minimization. In particular, we construct mathematical proofs that establish the first non-asymptotic convergence bound for FR (which is historically the first developed NCGM), and a much improved non-asymptotic convergence bound for PRP. Additionally, we provide simple adversarial examples on which these methods do not perform better than gradient descent with exact line search, leaving very little room for improvements on the same class of problems.
8.2 Automated tight Lyapunov analysis for first-order methods
In this work, we present a methodology for establishing the existence of quadratic Lyapunov inequalities for a wide range of first-order methods used to solve convex optimization problems. In particular, we consider (i) classes of optimization problems of finite-sum form with (possibly strongly) convex and possibly smooth functional components, (ii) first-order methods that can be written as a linear system on state-space form in feedback interconnection with the subdifferentials of the functional components of the objective function, and (iii) quadratic Lyapunov inequalities that can be used to draw convergence conclusions. We present a necessary and sufficient condition for the existence of a quadratic Lyapunov inequality within a predefined class of Lyapunov inequalities, which amounts to solving a small-sized semidefinite program. We showcase our methodology on several first-order methods that fit the framework. Most notably, our methodology allows us to significantly extend the region of parameter choices that allow for duality gap convergence in the Chambolle–Pock method when the linear operator is the identity mapping.
8.3 Generalization Bounds for Heavy-Tailed SDEs through the Fractional Fokker-Planck Equation
Understanding the generalization properties of heavy-tailed stochastic optimization algorithms has attracted increasing attention over the past years. While illuminating interesting aspects of stochastic optimizers by using heavy-tailed stochastic differential equations as proxies, prior works either provided expected generalization bounds, or introduced non-computable information theoretic terms. Addressing these drawbacks, in this work, we prove high-probability generalization bounds for heavy-tailed SDEs which do not contain any nontrivial information theoretic terms. To achieve this goal, we develop new proof techniques based on estimating the entropy flows associated with the so-called fractional Fokker-Planck equation (a partial differential equation that governs the evolution of the distribution of the corresponding heavy-tailed SDE). In addition to obtaining high-probability bounds, we show that our bounds have a better dependence on the dimension of parameters as compared to prior art. Our results further identify a phase transition phenomenon, which suggests that heavy tails can be either beneficial or harmful depending on the problem structure. We support our theory with experiments conducted in a variety of settings. Further information is in 7.
8.4 Implicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD
Neural network compression has been an increasingly important subject, not only due to its practical relevance, but also due to its theoretical implications, as there is an explicit connection between compressibility and generalization error. Recent studies have shown that the choice of the hyperparameters of stochastic gradient descent (SGD) can have an effect on the compressibility of the learned parameter vector. These results, however, rely on unverifiable assumptions and the resulting theory does not provide a practical guideline due to its implicitness. In this study, we propose a simple modification for SGD, such that the outputs of the algorithm will be provably compressible without making any nontrivial assumptions. We consider a one-hidden-layer neural network trained with SGD, and show that if we inject additive heavy-tailed noise to the iterates at each iteration, for any compression rate, there exists a level of overparametrization such that the output of the algorithm will be compressible with high probability. To achieve this result, we make two main technical contributions: (i) we prove a 'propagation of chaos' result for a class of heavy-tailed stochastic differential equations, and (ii) we derive error estimates for their Euler discretization. Our experiments suggest that the proposed approach not only achieves increased compressibility with various models and datasets, but also leads to robust test performance under pruning, even in more realistic architectures that lie beyond our theoretical setting.Further information is in 24.
8.5 Piecewise deterministic generative models
We introduce a novel class of generative models based on piecewise deterministic Markov processes (PDMPs), a family of non-diffusive stochastic processes consisting of deterministic motion and random jumps at random times. Similarly to diffusions, such Markov processes admit time reversals that turn out to be PDMPs as well. We apply this observation to three PDMPs considered in the literature: the Zig-Zag process, Bouncy Particle Sampler, and Randomised Hamiltonian Monte Carlo. For these three particular instances, we show that the jump rates and kernels of the corresponding time reversals admit explicit expressions depending on some conditional densities of the PDMP under consideration before and after a jump. Based on these results, we propose efficient training procedures to learn these characteristics and consider methods to approximately simulate the reverse process. Finally, we provide bounds in the total variation distance between the data distribution and the resulting distribution of our model in the case where the base distribution is the standard -dimensional Gaussian distribution. Promising numerical simulations support further investigations into this class of models. Further information is in 3.
8.6 Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms
We present a novel set of rigorous and computationally efficient topology-based complexity notions that exhibit a strong correlation with the generalization gap in modern deep neural networks (DNNs). DNNs show remarkable generalization properties, yet the source of these capabilities remains elusive, defying the established statistical learning theory. Recent studies have revealed that properties of training trajectories can be indicative of generalization. Building on this insight, state-of-the-art methods have leveraged the topology of these trajectories, particularly their fractal dimension, to quantify generalization. Most existing works compute this quantity by assuming continuous- or infinite-time training dynamics, complicating the development of practical estimators capable of accurately predicting generalization without access to test data. In this paper, we respect the discrete-time nature of training trajectories and investigate the underlying topological quantities that can be amenable to topological data analysis tools. This leads to a new family of reliable topological complexity measures that provably bound the generalization error, eliminating the need for restrictive geometric assumptions. These measures are computationally friendly, enabling us to propose simple yet effective algorithms for computing generalization indices. Moreover, our flexible framework can be extended to different domains, tasks, and architectures. Our experimental results demonstrate that our new complexity measures correlate highly with generalization error in industry-standards architectures such as transformers and deep graph networks. Our approach consistently outperforms existing topological bounds across a wide range of datasets, models, and optimizers, highlighting the practical relevance and effectiveness of our complexity measures. Further information is in 1.
8.7 Heavy-Tail Phenomenon in Decentralized SGD
Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks. Furter information is in 8.
8.8 Iteratively Reweighted Least Squares for Phase Unwrapping
The 2D phase unwrapping problem seeks to recover a phase image from its observation modulo 2π, and is a crucial step in a variety of imaging applications. In particular, it is one of the most time-consuming steps in the interferometric synthetic aperture radar (InSAR) pipeline. In this work we tackle the L1-norm phase unwrapping problem. In optimization terms, this is a simple sparsity-inducing problem, albeit in very large dimension. To solve this high-dimensional problem, we iteratively solve a series of numerically simpler weighted least squares problems, which are themselves solved using a preconditioned conjugate gradient method. Our algorithm guarantees a sublinear rate of convergence in function values, is simple to implement and can easily be ported to GPUs, where it significantly outperforms state of the art phase unwrapping methods.
8.9 Frank-Wolfe meets Shapley-Folkman: a systematic approach for solving nonconvex separable problems with linear constraints
We consider separable nonconvex optimization problems under affine constraints. For these problems, the Shapley-Folkman theorem provides an upper bound on the duality gap as a function of the nonconvexity of the objective functions, but does not provide a systematic way to construct primal solutions satisfying that bound. In this work, we develop a two-stage approach to do so. The first stage approximates the optimal dual value with a large set of primal feasible solutions. In the second stage, this set is trimmed down to a primal solution by computing (approximate) Caratheodory representations. The main computational requirement of our method is tractability of the Fenchel conjugates of the component functions and their (sub)gradients. When the function domains are convex, the method recovers the classical duality gap bounds obtained via Shapley-Folkman. When the function domains are nonconvex, the method also recovers classical duality gap bounds from the literature, based on a more general notion of nonconvexity.
8.10 Open-Canopy: Towards Very High Resolution Forest Monitoring
Estimating canopy height and its changes at meter resolution from satellite imagery is a significant challenge in computer vision with critical environmental applications. However, the lack of open-access datasets at this resolution hinders the reproducibility and evaluation of models. We introduce Open-Canopy, the first open-access, country-scale benchmark for very high-resolution (1.5 m) canopy height estimation, covering over 87,000 km2 across France with 1.5 m resolution satellite imagery and aerial LiDAR data. Additionally, we present Open-Canopy-Δ, a benchmark for canopy height change detection between images from different years at tree level-a challenging task for current computer vision models. We evaluate state-of-the-art architectures on these benchmarks, highlighting significant challenges and opportunities for improvement. Our datasets and code are publicly available at this https URL.
8.11 An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue Optimization Problems
Clément Lezane, Cristóbal Guzmán, Alexandre d'Aspremont In this work, we revisit the problem of solving large-scale semidefinite programs using randomized first-order methods and stochastic smoothing. We introduce two oblivious stochastic mirror descent algorithms based on a complementary composite setting. One algorithm is designed for non-smooth objectives, while an accelerated version is tailored for smooth objectives. Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function. For the non-smooth case with bounded oracles, we prove a convergence rate. For the L-smooth case with a feasible set bounded by D, we derive a convergence rate These rates had only been obtained so far by either assuming prior knowledge of the Lipschitz constant or the starting distance to an optimal solution. We further show how to extend our framework to relative scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
8.12 Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent
A recent line of empirical studies has demonstrated that SGD might exhibit a heavy-tailed behavior in practical settings, and the heaviness of the tails might correlate with the overall performance. In this work, we investigate the emergence of such heavy tails. Previous works on this problem only considered, up to our knowledge, online (also called single-pass) SGD, in which the emergence of heavy tails in theoretical findings is contingent upon access to an infinite amount of data. Hence, the underlying mechanism generating the reported heavy-tailed behavior in practical settings, where the amount of training data is finite, is still not well-understood. Our contribution aims to fill this gap. In particular, we show that the stationary distribution of offline (also called multi-pass) SGD exhibits ‘approximate’ power-law tails and the approximation error is controlled by how fast the empirical distribution of the training data converges to the true underlying data distribution in the Wasserstein metric. Our main takeaway is that, as the number of data points increases, offline SGD will behave increasingly ‘power-law-like’. To achieve this result, we first prove nonasymptotic Wasserstein convergence bounds for offline SGD to online SGD as the number of data points increases, which can be interesting on their own. Finally, we illustrate our theory on various experiments conducted on synthetic data and neural networks. Further details are in 16.
8.13 Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent
Algorithmic stability is an important notion that has proven powerful for deriving generalization bounds for practical algorithms. The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. In this study, we make a novel connection between learning theory and applied probability and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms. We illustrate our approach on stochastic gradient descent (SGD) and we obtain time-uniform stability bounds (i.e., the bound does not increase with the number of iterations) for strongly convex losses and nonconvex losses with additive noise, where we recover similar results to the prior art or extend them to more general cases by using a single proof technique. Our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature. It also illustrates that ergodicity is an important component for obtaining time-uniform bounds – which might not be achieved for convex or non-convex losses unless additional noise is injected to the iterates. Finally, we slightly stretch our analysis technique and prove time-uniform bounds for SGD under convex and non-convex losses (without additional additive noise), which, to our knowledge, is novel. Further information is in 26.
8.14 Learning via Wasserstein-Based High Probability Generalisation Bounds
Minimising upper bounds on the population risk or the generalisation gap has been widely used in structural risk minimisation (SRM) – this is in particular at the core of PAC-Bayesian learning. Despite its successes and unfailing surge of interest in recent years, a limitation of the PAC-Bayesian framework is that most bounds involve a Kullback-Leibler (KL) divergence term (or its variations), which might exhibit erratic behavior and fail to capture the underlying geometric structure of the learning problem – hence restricting its use in practical applications. As a remedy, recent studies have attempted to replace the KL divergence in the PAC-Bayesian bounds with the Wasserstein distance. Even though these bounds alleviated the aforementioned issues to a certain extent, they either hold in expectation, are for bounded losses, or are nontrivial to minimize in an SRM framework. In this work, we contribute to this line of research and prove novel Wasserstein distance-based PAC-Bayesian generalisation bounds for both batch learning with independent and identically distributed (i.i.d.) data, and online learning with potentially non-i.i.d. data. Contrary to previous art, our bounds are stronger in the sense that (i) they hold with high probability, (ii) they apply to unbounded (potentially heavy-tailed) losses, and (iii) they lead to optimizable training objectives that can be used in SRM. As a result we derive novel Wasserstein-based PAC-Bayesian learning algorithms and we illustrate their empirical advantage on a variety of experiments. More information can be found in 23.
8.15 Generalization Guarantees via Algorithm-dependent Rademacher Complexity
Algorithm- and data-dependent generalization bounds are required to explain the generalization behavior of modern machine learning algorithms. In this context, there exists information theoretic generalization bounds that involve (various forms of) mutual information, as well as bounds based on hypothesis set stability. We propose a conceptually related, but technically distinct complexity measure to control generalization error, which is the empirical Rademacher complexity of an algorithm- and data-dependent hypothesis class. Combining standard properties of Rademacher complexity with the convenient structure of this class, we are able to (i) obtain novel bounds based on the finite fractal dimension, which (a) extend previous fractal dimension-type bounds from continuous to finite hypothesis classes, and (b) avoid a mutual information term that was required in prior work; (ii) we greatly simplify the proof of a recent dimension-independent generalization bound for stochastic gradient descent; and (iii) we easily recover results for VC classes and compression schemes, similar to approaches based on conditional mutual information. More information can be found in 21.
8.16 Generalization Bounds using Data-Dependent Fractal Dimensions
Providing generalization guarantees for modern neural networks has been a crucial task in statistical learning. Recently, several studies have attempted to analyze the generalization error in such settings by using tools from fractal geometry. While these works have successfully introduced new mathematical tools to apprehend generalization, they heavily rely on a Lipschitz continuity assumption, which in general does not hold for neural networks and might make the bounds vacuous. In this work, we address this issue and prove fractal geometry-based generalization bounds without requiring any Lipschitz assumption. To achieve this goal, we build up on a classical covering argument in learning theory and introduce a data-dependent fractal dimension. Despite introducing a significant amount of technical complications, this new notion lets us control the generalization error (over either fixed or random hypothesis spaces) along with certain mutual information (MI) terms. To provide a clearer interpretation to the newly introduced MI terms, as a next step, we introduce a notion of ‘geometric stability’ and link our bounds to the prior art. Finally, we make a rigorous connection between the proposed data-dependent dimension and topological data analysis tools, which then enables us to compute the dimension in a numerically efficient way. We support our theory with experiments conducted on various settings. More information can be found in 6.
8.17 Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions
Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this work, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions. More information can be found in 19.
8.18 Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares
Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this work, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions. More information can be found in 17.
8.19 Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD than Constant Stepsize
Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called “tail-index”) in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic and randomized stepsizes compared to constant stepsize in terms of the tail behavior. We illustrate our theory on linear regression experiments and show through deep learning experiments that Markovian stepsizes can achieve even a heavier tail and be a viable alternative to cyclic and i.i.d. randomized stepsize rules. More information can be found in 9.
8.20 An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue Optimization Problems
In this work, we revisit the problem of solving large-scale semidefinite programs using randomized first-order methods and stochastic smoothing. We introduce two oblivious stochastic mirror descent algorithms based on a complementary composite setting. One algorithm is designed for non-smooth objectives, while an accelerated version is tailored for smooth objectives. Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function. For the non-smooth case with bounded oracles, we prove a convergence rate of . For the -smooth case with a feasible set bounded by , we derive a convergence rate of , where is the starting distance to an optimal solution, and is the stochastic oracle variance. These rates had only been obtained so far by either assuming prior knowledge of the Lipschitz constant or the starting distance to an optimal solution. We further show how to extend our framework to relative scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
8.21 Vision Transformers, a new approach for high-resolution and large-scale mapping of canopy heights
Accurate and timely monitoring of forest canopy heights is critical for assessing forest dynamics, biodiversity, carbon sequestration as well as forest degradation and deforestation. Recent advances in deep learning techniques, coupled with the vast amount of spaceborne remote sensing data offer an unprecedented opportunity to map canopy height at high spatial and temporal resolutions. Current techniques for wall-to-wall canopy height mapping correlate remotely sensed 2D information from optical and radar sensors to the vertical structure of trees using LiDAR measurements. While studies using deep learning algorithms have shown promising performances for the accurate mapping of canopy heights, they have limitations due to the type of architectures and loss functions employed. Moreover, mapping canopy heights over tropical forests remains poorly studied, and the accurate height estimation of tall canopies is a challenge due to signal saturation from optical and radar sensors, persistent cloud covers and sometimes the limited penetration capabilities of LiDARs. Here, we map heights at 10 m resolution across the diverse landscape of Ghana with a new vision transformer (ViT) model optimized concurrently with a classification (discrete) and a regression (continuous) loss function. This model achieves better accuracy than previously used convolutional based approaches (ConvNets) optimized with only a continuous loss function. The ViT model results show that our proposed discrete/continuous loss significantly increases the sensitivity for very tall trees (i.e., > 35m), for which other approaches show saturation effects. The height maps generated by the ViT also have better ground sampling distance and better sensitivity to sparse vegetation in comparison to a convolutional model. Our ViT model has a RMSE of 3.12m in comparison to a reference dataset while the ConvNet model has a RMSE of 4.3m.
8.22 Non-Parametric Learning of Stochastic Differential Equations with Fast Rates of Convergence.
In 27, we propose a novel non-parametric learning paradigm for the identification of drift and diffusion coefficients of multi-dimensional non-linear stochastic differential equations, which relies upon discrete-time observations of the state. The key idea essentially consists of fitting a RKHS-based approximation of the corresponding Fokker-Planck equation to such observations, yielding theoretical estimates of non-asymptotic learning rates which, unlike previous works, become increasingly tighter when the regularity of the unknown drift and diffusion coefficients becomes higher. Our method being kernel-based, offline pre-processing may be profitably leveraged to enable efficient numerical implementation, offering excellent balance between precision and computational complexity.
8.23 Structured Prediction in Online Learning.
In 69 we study a theoretical and algorithmic framework for structured prediction in the online learning setting. The problem of structured prediction, i.e. estimating function where the output space lacks a vectorial structure, is well studied in the literature of supervised statistical learning. We show that our algorithm is a generalisation of optimal algorithms from the supervised learning setting, and achieves the same excess risk upper bound also when data are not i.i.d. Moreover, we consider a second algorithm designed especially for non-stationary data distributions, including adversarial data. We bound its stochastic regret in function of the variation of the data distributions.
8.24 Closed-form Filtering for Non-linear Systems
Sequential Bayesian Filtering aims to estimate the current state distribution of a Hidden Markov Model, given the past observations. The problem is well-known to be intractable for most application domains, except in notable cases such as the tabular setting or for linear dynamical systems with gaussian noise. In 71 , we propose a new class of filters based on Gaussian PSD Models, which offer several advantages in terms of density approximation and computational efficiency. We show that filtering can be efficiently performed in closed form when transitions and observations are Gaussian PSD Models. When the transition and observations are approximated by Gaussian PSD Models, we show that our proposed estimator enjoys strong theoretical guarantees, with estimation error that depends on the quality of the approximation and is adaptive to the regularity of the transition probabilities. In particular, we identify regimes in which our proposed filter attains a TV -error with memory and computational complexity of and respectively, including the offline learning step, in contrast to the complexity of sampling methods such as particle filtering.
8.25 Frank-Wolfe meets Shapley-Folkman: a systematic approach for solving nonconvex separable problems with linear constraints
We consider separable nonconvex optimization problems under affine constraints. For these problems, the Shapley-Folkman theorem provides an upper bound on the duality gap as a function of the nonconvexity of the objective functions, but does not provide a systematic way to construct primal solutions satisfying that bound. In this work, we develop a two-stage approach to do so. The first stage approximates the optimal dual value with a large set of primal feasible solutions. In the second stage, this set is trimmed down to a primal solution by computing (approximate) Caratheodory representations. The main computational requirement of our method is tractability of the Fenchel conjugates of the component functions and their (sub)gradients. When the function domains are convex, the method recovers the classical duality gap bounds obtained via Shapley-Folkman. When the function domains are nonconvex, the method also recovers classical duality gap bounds from the literature, based on a more general notion of nonconvexity.
8.26 Naive Feature Selection: A Nearly Tight Convex Relaxation for Sparse Naive Bayes
Due to its linear complexity, naive Bayes classification remains an attractive su- pervised learning method, especially in very large-scale settings. We propose a
sparse version of naive Bayes, which can be used for feature selection. This leads to a combinatorial maximum-likelihood problem, for which we provide an exact solution in the case of binary data, or a bound in the multinomial case. We prove that our bound becomes tight as the marginal contribution of additional features decreases. Both binary and multinomial sparse models are solvable in time almost linear in problem size, representing a very small extra relative cost compared to the classical naive Bayes. Numerical experiments on text data show that the naive Bayes feature selection method is as statistically effective as state-of-the-art feature
selection methods such as recursive feature elimination, l1-penalized logistic re- gression and LASSO, while being orders of magnitude faster. For a large data set,
having more than with 1.6 million training points and about 12 million features, and with a non-optimized CPU implementation, our sparse naive Bayes model can be trained in less than 15 seconds.
8.27 Physics-informed kernel learning
Physics-informed machine learning typically integrates physical priors into the learning process by minimizing a loss function that includes both a data-driven term and a partial differential equation (PDE) regularization. In 49, building on the formulation of the problem as a kernel regression task, we use Fourier methods to approximate the associated kernel, and propose a tractable estimator that minimizes the physics-informed risk function. We refer to this approach as physics-informed kernel learning (PIKL). This framework provides theoretical guarantees, enabling the quantification of the physical prior’s impact on convergence speed. We demonstrate the numerical performance of the PIKL estimator through simulations, both in the context of hybrid modeling and in solving PDEs. In particular, we show that PIKL can outperform physics-informed neural networks in terms of both accuracy and computation time. Additionally, we identify cases where PIKL surpasses traditional PDE solvers, particularly in scenarios with noisy boundary conditions.
8.28 Variational Dynamic Programming for Stochastic Optimal Control
In 53, we consider the problem of stochastic optimal control, where the state-feedback control policies take the form of a probability distribution and where a penalty on the entropy is added. By viewing the cost function as a KullbackLeibler (KL) divergence between two joint distributions, we bring the tools from variational inference to bear on our optimal control problem. This allows for deriving a dynamic programming principle, where the value function is defined as a KL divergence again. We then resort to Gaussian distributions to approximate the control policies and apply the theory to control affine nonlinear systems with quadratic costs. This results in closed-form recursive updates, which generalize LQR control and the backward Riccati equation. We illustrate this novel method on the simple problem of stabilizing an inverted pendulum.
8.29 Variational Inference on the Boolean Hypercube with the Quantum Entropy
In 68, we derive variational inference upper-bounds on the log-partition function of pairwise Markov random fields on the Boolean hypercube, based on quantum relaxations of the Kullback-Leibler divergence. We then propose an efficient algorithm to compute these bounds based on primal-dual optimization. An improvement of these bounds through the use of “hierarchies,” similar to sum-of-squares (SoS) hierarchies is proposed, and we present a greedy algorithm to select among these relaxations. We carry extensive numerical experiments and compare with state-of-the-art methods for this inference problem.
8.30 Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at https://chat.lmsys.org.
8.31 Fair Allocation in Dynamic Mechanism Design
We consider a dynamic mechanism design problem where an auctioneer sells an indivisible good to groups of buyers in every round, for a total of T rounds. The auctioneer aims to maximize their discounted overall revenue while adhering to a fairness constraint that guarantees a minimum average allocation for each group. We begin by studying the static case (T=1) and establish that the optimal mechanism involves two types of subsidization: one that increases the overall probability of allocation to all buyers, and another that favors the groups which otherwise have a lower probability of winning the item. We then extend our results to the dynamic case by characterizing a set of recursive functions that determine the optimal allocation and payments in each round. Notably, our results establish that in the dynamic case, the seller, on the one hand, commits to a participation bonus to incentivize truth-telling, and on the other hand, charges an entry fee for every round. Moreover, the optimal allocation once more involves subsidization, which its extent depends on the difference in future utilities for both the seller and buyers when allocating the item to one group versus the others. Finally, we present an approximation scheme to solve the recursive equations and determine an approximately optimal and fair allocation efficiently.
9 Bilateral contracts and grants with industry
9.1 Bilateral grants with industry
- Chaire “Marchés et Apprentissage”, portée par Michael Jordan au sein de la Fondation Inria, et lancée en Juillet 2024. En partenariat avec Air Liquide, BNP Paribas Asset Management Europe, EDF, Orange et la SNCF.
- Francis Bach: Co-advised PhD student with Meta.
10 Partnerships and cooperations
10.1 International initiatives
10.1.1 Associate Teams in the framework of an Inria International Lab or in the framework of an Inria International Program
-
FOAM
First-Order Accelerated Methods for Machine Learning.
-
Duration:
2020 -> 2024
-
Coordinator:
Cristobal Guzman (crguzmanp@mat.uc.cl)
-
Partners:
Pontificia Universidad Católica de Chile Santiago (Chili)
-
Inria contact:
Alexandre d'Aspremont
-
Summary:
Our main interest is to investigate novel and improved convergence results for first-order iterative methods for saddle-points, variational inequalities and fixed points, under the lens of PEP. Our interest in improving first-order methods is also deeply related with applications in machine learning. Particularly in sparsity-oriented inverse problems, optimization methods are the workhorse for state of the art results. On some of these problems, a set of new hypothesis and theoretical results shows improved complexity bounds for problems with good recovery guarantees and we plan to extend these new performance bounds to the variational framework.
10.2 International research visitors
10.2.1 Visits of international scientists
Inria International Chair
Participants: Laurent El Ghaoui.
Other international visits to the team
Participants: Sebastian Gruber.
10.3 European initiatives
10.3.1 Horizon Europe
DYNASTY
DYNASTY project on cordis.europa.eu
-
Title:
Dynamics-Aware Theory of Deep Learning
-
Duration:
From October 1, 2022 to September 30, 2027
-
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
-
Inria contact:
Umut Simsekli
- Coordinator:
-
Summary:
The recent advances in deep learning (DL) have transformed many scientific domains and have had major impacts on industry and society. Despite their success, DL methods do not obey most of the wisdoms of statistical learning theory, and the vast majority of the current DL techniques mainly stand as poorly understood black-box algorithms.
Even though DL theory has been a very active research field in the past few years, there is a significant gap between the current theory and practice: (i) the current theory often becomes vacuous for models with large number of parameters (which is typical in DL), and (ii) it cannot capture the interaction between data, architecture, training algorithm and its hyper-parameters, which can have drastic effects on the overall performance. Due to this lack of theoretical understanding, designing new DL systems has been dominantly performed by ad-hoc, 'trial-and-error' approaches.
The main objective of this proposal is to develop a mathematically sound and practically relevant theory for DL, which will ultimately serve as the basis of a software library that provides practical tools for DL practitioners. In particular, (i) we will develop error bounds that closely reflect the true empirical performance, by explicitly incorporating the dynamics aspect of training, (ii) we will develop new model selection, training, and compression algorithms with reduced time/memory/storage complexity, by exploiting the developed theory.
To achieve the expected breakthroughs, we will develop a novel theoretical framework, which will enable tight analysis of learning algorithms in the lens of dynamical systems theory. The outcomes will help relieve DL from being a black-box system and avoid the heuristic design process. We will produce comprehensive open-source software tools adapted to all popular DL libraries, and test the developed algorithms on a wide range of real applications arising in computer vision, audio/music/natural language processing.
CASPER
CASPER project on cordis.europa.eu
-
Title:
Systematic and computer-aided performance certification for numerical optimization
-
Duration:
From November 1, 2024 to October 31, 2029
-
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
-
Inria contact:
Adrien Taylor
- Coordinator:
-
Summary:
Numerical optimization is a fundamental tool with a growing impact in many disciplines from science to industry. Many of its successes are due to theoretical advances, which are key to developing trust in numerical algorithms. While trust is non-negotiable in many applications, the complexity level of modern and future problems makes it very hard for theory to keep up with efficient proposals. Arguably worse, while both theory and experimental practice are key to the field, their respective recommendations often conflict with each other and the gap between theory and practice gets embarrassingly large.
The main objective of this proposal is to push forward the theoretical foundations of algorithmic optimization to drastically reduce the gap between fundamental theoretical understanding and practical scenarios. To achieve this, we will develop principled and systematic approaches to algorithmic analyses, as well as computer-aided performance certification tools. Whereas my recent works show that such techniques already allow going far beyond the surprisingly few classical templates for algorithmic analysis, they have currently very limited applicability beyond simple scenarios. We will largely broaden the techniques to develop and study modern algorithms with working guarantees that can (i) scale to unprecedented problem and data sizes, (ii) adapt to common problem structures, and (iii) be deployed on modern massively parallel computing environments. On the way, this project will allow for simplified certification and validation of existing theory, an absolute necessity in this era of massive scientific production.
Outcomes of CASPER will include symbolical and numerical algorithmic certification and development tools, as well as algorithms with unprecedented working guarantees. The tools will be released as open-source libraries and algorithms validated on key benchmarks that include challenging machine learning and robotic tasks.
10.3.2 H2020 projects
REAL
REAL project on cordis.europa.eu
-
Title:
Reliable and cost-effective large scale machine learning
-
Duration:
From April 1, 2021 to March 31, 2026
-
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
-
Inria contact:
Alessandro Rudi
- Coordinator:
-
Summary:
In the last decade, machine learning (ML) has become a fundamental tool with a growing impact in many disciplines, from science to industry. However, nowadays, the scenario is changing: data are exponentially growing compared to the computational resources (post Moore's law era), and ML algorithms are becoming crucial building blocks in complex systems for decision making, engineering, science. Current machine learning is not suitable for the new scenario, both from a theoretical and a practical viewpoint: (a) the lack of cost-effectiveness of the algorithms impacts directly the economic/energetic costs of large scale ML, making it barely affordable by universities or research institutes; (b) the lack of reliability of the predictions affects critically the safety of the systems where ML is employed. To deal with the challenges posed by the new scenario, REAL will lay the foundations of a solid theoretical and algorithmic framework for reliable and cost-effective large scale machine learning on modern computational architectures. In particular, REAL will extend the classical ML framework to provide algorithms with two additional guarantees: (a) the predictions will be reliable, i.e., endowed with explicit bounds on their uncertainty guaranteed by the theory; (b) the algorithms will be cost-effective, i.e., they will be naturally adaptive to the new architectures and will provably achieve the desired reliability and accuracy level, by using minimum possible computational resources. The algorithms resulting from REAL will be released as open-source libraries for distributed and multi-GPU settings, and their effectiveness will be extensively tested on key benchmarks from computer vision, natural language processing, audio processing, and bioinformatics. The methods and the techniques developed in this project will help machine learning to take the next step and become a safe, effective, and fundamental tool in science and engineering for large scale data problems.
NN-OVEROPT
NN-OVEROPT project on cordis.europa.eu
-
Title:
Neural Network : An Overparametrization Perspective
-
Duration:
From November 1, 2021 to October 31, 2024
-
Partners:
- INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
- THE BOARD OF TRUSTEES OF THE UNIVERSITY OF ILLINOIS (UNIVERSITY OF ILLINOIS), United States
-
Inria contact:
Francis Bach
- Coordinator:
-
Summary:
In recent times, overparametrized models where the number of model parameters far exceeds the number of training samples available are the methods of choice for learning problems and neural networks are amongst the most popular overparametrized methods used heavily in practice. It has been discovered recently that overparametrization surprisingly improves the optimization landscape of a complex non-convex problem, i.e., the training of neural networks, and also has positive effects on the generalization performance. Despite improved empirical performance of overparametrized models like neural networks, the theoretical understanding of these models is quite limited which hinders the progress of the field in the right direction. Any progress in the understanding of the optimization as well as generalization aspects for theses complex models especially neural networks will lead to big technical advancement in the field of machine learning and artificial intelligence. During the Marie Sklodowska-Curie Actions Individual Fellowship-Global Fellowship (MSCA-IF-GF), I plan to study the optimization problem arising while training overparametrized neural networks and generalization in overparametrized neural networks. The end goal for this project is to provide better theoretical understanding of the optimization landscape while training overparametrized models as a result of which to provide better optimization algorithms for training as well as to study the universal approximation guarantees of overparametrized models. We also aim to study the implicit bias induced by optimization algorithms while training overparametrized complex models. To achieve the objective discussed above, I will be using tools from traditional optimization theory, statistical learning theory, gradient flows, as well as from statistical physics.
11 Dissemination
11.1 Promoting scientific activities
11.1.1 Scientific events: organisation
Member of the organizing committees
- Adrien Taylor
- Workshop organizer: robotics: science and systems (Frontiers of optimization for robotics)
- Organizer of 4 sessions of talks (16 invited speakers) at Europt 2024.
- Umut Simsekli
- Co-organizer of the Séminaire Parisien de Statistique
- Alexandre d'Aspremont
- Co-organizer of the Physiscs of AI workshop at Les Houches.
11.1.2 Scientific events: selection
Member of conference program committees
- M. Jordan: Program Committee, International Congress of Mathematicians
Reviewer
- Adrien Taylor:
- Reviewer for conference on learning theory (COLT)
- Reviewer for robotics: science and systems (RSS)
- Umut Simsekli
- Area chair for NeurIPS 2024
- Area chair for Algorithmic Learning Theory 2025
11.1.3 Journal
Member of the editorial boards
- Alexandre d'Aspremont: Section editor for SIAM Journal on the Mathematics of Data Science (SIMODS).
Reviewer - reviewing activities
- Adrien Taylor:
- Reviewer for Mathematical Programming Series A
- Reviewer for Mathematical Programming Series B
- Reviewer for SIAM Journal on Optimization (SIOPT)
- Reviewer for Optimization Letters
- Reviewer for Journal of Machine Learning Research (JMLR)
- Reviewer for Journal of the Association for Computing Machinery (JACM)
- Umut Simsekli
- Reviewer for Applied Probability Journals
- Alessandro Rudi
- Reviewer for the Journal of Machine Learning Research
- Reviewer for Constructive approximation
11.1.4 Invited talks
- Adrien Taylor
- Optimization seminar in the DAO team
- (denied [ecological reasons], December 2024) Invited plenary talk at the Opt-ML NeurIPS workshop in Vancouver.
- (denied [ecological reasons], July 2024) Invited speaker at the DOML workshop in Tokyo.
- (denied [ecological reasons], July 2024) Invited talk at International Symposium on Mathematical Programming (ISMP), Montréal.
- (June 2024) Invited talk at European Conference on Operational Research (EURO), Copenhaguen.
- (June 2024) Invited talk at the One World Optimization Seminar in Vienna, Vienna.
- (April 2024) Invited talk at the Applied Algebra and Analysis Seminar, Bremen.
- Umut Simsekli
- (April 2024) Invited lecturer at the Isaac Newton Institute (Cambridge University) programme called 'Heavy Tails in Machine Learning'
- (April 2024) Invited talk at University of Oxford
- (November 2024) Invited talk at ENSAE
- Alexandre d'Aspremont
- Invited talk, math of data, at Institute for Mathematical Sciences (IMS) in Singapore.
- Francis Bach
- Invited talk, Statistical Methods for Post Genomic Data 2024
- Invited seminars, Ecole Polytechnique Fédérale de Lausanne
- Colloquium, Institut Mathématique de Toulouse
- Keynote speaker, International Symposium on Biomedical Imaging
- Invited seminars, University of Warwick
- CIME summer school, Italy
- Michael Jordan
- Bruno De Finetti Lecture, ISBA, Venice, Italy, 7/4/24
- Keynote Speaker, joint ISBA-Fusion 2024 Workshop, Venice, Italy, 7/8/24
- Keynote Speaker, North American Economics and AI+ML Meeting of the Econometric Society, Ithaca, NY, 8/14/24
- Keynote Speaker, Conference on the Bund, Shanghai, China, 9/8/24
- Plenary Speaker, Trieste Science Festival, Trieste, Italy, 9/28/24
- Keynote Speaker, AI-ML Systems Conference, Baton Rouge, LA, 10/9/24
- Keynote Speaker, NETGCOOP Conference, Inria Lille, France, 10/11/24
- Keynote Speaker, RECSYS Conference, Bari, Italy, 10/15/24
- Colloquium Speaker, Harvard Statistics, Cambridge, MA, 11/4/24
11.1.5 Scientific expertise
- F. Bach: Member of the Scientific Council of the Société Informatique de France, since 2022.
11.2 Teaching - Supervision - Juries
11.2.1 Teaching
- Master: Alexandre d'Aspremont, Optimisation convexe: modélisation, algorithmes et applications cours magistraux 21h (2011-Present), Master M2 MVA, ENS PS.
- Master : Francis Bach, Learning theory from first principles, 27h, Master M2 MASH, Université Paris Dauphine PSL, France.
- Master: Alessandro Rudi, Umut Simsekli. Introduction to Machine Learning, 52h, L3, ENS, Paris.
- Master: Alessandro Rudi, Kernel Methods 10h, Master M2 MVA, ENS PS.
- Master: Adrien Taylor. Convex Optimization, 21h, M1, ENS, Paris.
- Master: Adrien Taylor, Optimisation convexe stochastique (Invited): 3h, Master M2 MVA, ENS PS.
- Master: Adrien Taylor, Optimization and deep learning: 12h, Ecole Polytechnique, Palaiseau, France.
- Master : Umut Simsekli. Deep Learning, 21h, M2, Ecole Polytechnique, Palaiseau, France
- Master : Francis Bach, Learning theory from first principles, 27h, Master M2 IASD, PSL Reseaarch University, France.
11.2.2 Supervision
- Adrien Taylor:
- PhD defense: Antoine Bambade
- PhD defense: Céline Moucer (co-advised with Francis Bach)
- PhD defense: Baptiste Goujaud
- new PhD student (starting 10/2024): Roland Andrews
- new PhD student (starting 10/2024): Weijia Wang
- new intern (starting 11/2024): Daniel Berg Thomsen
- Umut Simsekli
- PhD Student: Benjamin Dupuis
- PhD Student: Dario Shariatian (with Alain Durmus)
- Postdoc: Fabian Schaipp (with Adrien Taylor and Francis Bach)
- Postdoc: Maxime Haddouche
- Alexandre d'Aspremont:
- Benjamin Dubois-Taine, PhD student
- Sarah Brood, PhD student
- Arthur Calvi, PhD student
- Yang Su, Postdoc
- Fajwel Fogel, Postdoc
- Alessandro Rudi
- PhD defense: Gaspard Beugnot
- PhD defense: Theophile Cantelobre
- Francis Bach
- PhD defense: Bertille Follain
- PhD defense: Marc Lambert
- PhD in progress: Lawrence Stewart
- PhD in progress: Simon Martin
- New PhD student: Juliette Decugis, co-advised with Gabriel Synnaeve and Taco Cohen (Meta)
- New PhD student: Alexandre François, co-advised wûth Antonio Orvieto (Ellis institute, Tübingen)
- New Post-doc: Frederik Kunstner
- New PhD student: Eugène Berta (co-advised with Michael Jordan)
- New PhD student: Sacha Braun (co-advised with Michael Jordan)
- Michael Jordan
- New PhD student: Nabil Boukir (co-advised with Francis Bach)
- New PhD student: Etienne Gauthier (co-advised with Francis Bach)
- New Post-doc: Yuron Chen
11.3 Popularization
11.3.1 Participation in Live events
- F. Bach: Conférence publique, Ambassade de France, Rome, Italie
12 Scientific production
12.1 Major publications
- 1 inproceedingsTopological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms.PMLRAdvances in Neural Information Processing SystemsVancouver, Canada2024HALback to text
- 2 articleApproximation Bounds for Sparse Programs.SIAM Journal on Mathematics of Data Science42June 2022, 514-530HALDOI
- 3 inproceedingsPiecewise deterministic generative models.PMLRAdvances in Neural Information Processing SystemsVancouver, Canada2024HALback to text
- 4 miscMeasuring dissimilarity with diffeomorphism invariance.February 2022HALDOI
- 5 articleOptimal Complexity and Certification of Bregman First-Order Methods.Mathematical Programming1941July 2022, 41-83HALDOI
- 6 inproceedingsGeneralization Bounds using Data-Dependent Fractal Dimensions.Proceedings of Machine Learning ResearchInternational Conference on Machine Learning (ICML 2023)Honolulu, United StatesJuly 2023HALback to text
- 7 inproceedingsGeneralization Bounds for Heavy-Tailed SDEs through the Fractional Fokker-Planck Equation.PMLRInternational Conference on Machine LearningVienna, Austria2024HALback to text
- 8 articleHeavy-Tail Phenomenon in Decentralized SGD.IISE Transactions2024HALback to text
- 9 articleCyclic and Randomized Stepsizes Invoke Heavier Tails in SGD than Constant Stepsize.Transactions on Machine Learning Research Journal2023HALback to text
- 10 inproceedingsGeneralization Bounds using Lower Tail Exponents in Stochastic Optimizers.International Conference on Machine LearningBaltimore, United States2022HAL
- 11 inproceedingsGeneralized Sliced Probability Metrics.ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)Singapore, SingaporeIEEEMay 2022, 4513-4517HALDOI
- 12 articleGlobal assessment of oil and gas methane ultra-emitters.Science3756580February 2022, 557-561HALDOI
- 13 inproceedingsChaotic Regularization and Heavy-Tailed Limits for Deterministic Gradient Descent.Advances in Neural Processing SystemsNew Orleans, United States2022HAL
- 14 unpublishedNon-parametric Models for Non-negative Functions.July 2020, working paper or preprintHAL
-
15
inproceedingsGeneralization Bounds for Stochastic Gradient Descent via Localized
-Covers.Advances in Neural Processing SystemsBaltimore, United StatesSeptember 2022HAL - 16 proceedingsK. L.Krunoslav Lehman Pavasovic, A.Alain Durmus and U.Umut Simsekli, eds. Approximate Heavy Tails in Offline (Multi-Pass) Stochastic Gradient Descent.Advances in Neural Information Processing SystemsOctober 2023HALback to text
- 17 inproceedingsAlgorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares.Algorithmic Learning TheorySingapore, Singapore2023HALback to text
- 18 proceedingsA.Anant Raj, U.Umut Şimşekli and A.Alessandro Rudi, eds. Efficient Sampling of Stochastic Differential Equations with Positive Semi-Definite Models.Advances in Neural Information Processing Systems2023HAL
- 19 inproceedingsAlgorithmic Stability of Heavy-Tailed SGD with General Loss Functions.International Conference on Machine LearningHonolulu, United States2023HALback to text
- 20 articleSharpness, Restart and Acceleration.SIAM Journal on Optimization301October 2020, 262-289HALDOI
- 21 inproceedingsGeneralization Guarantees via Algorithm-dependent Rademacher Complexity.Conference on Learning TheoryBangalore (Virtual event), IndiaJuly 2023HALback to text
- 22 inproceedingsRate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms.COLT 2022 - 35th Annual Conference on Learning Theory178Proceedings of Machine Learning ResearchLondon, United KingdomJuly 2022HAL
- 23 inproceedingsLearning via Wasserstein-Based High Probability Generalisation Bounds.NeurIPS 2023 - Thirty-seventh Conference on Neural Information Processing SystemsNew Orleans, United StatesJune 2023HALDOIback to text
- 24 inproceedingsImplicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD.PMLRInternational Conference on Machine LearningVienna, Austria2024HALback to text
- 25 miscNon-Convex Optimization with Certificates and Fast Rates Through Kernel Sums of Squares.April 2022HALDOI
- 26 proceedingsL.Lingjiong Zhu, M.Mert Gurbuzbalaban, A.Anant Raj and U.Umut Simsekli, eds. Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent.Advances in Neural Information Processing Systems2023HALback to text
12.2 Publications of the year
International journals
International peer-reviewed conferences
Conferences without proceedings
Doctoral dissertations and habilitation theses
Reports & preprints
12.3 Cited publications
- 90 articleNon-Parametric Learning of Stochastic Differential Equations with Fast Rates of Convergence.Foundations of Computational MathematicsMarch 2025HALback to text