Keywords
Computer Science and Digital Science
 A3.4. Machine learning and statistics
 A5.4. Computer vision
 A6.2. Scientific computing, Numerical Analysis & Optimization
 A7.1. Algorithms
 A8.2. Optimization
 A9.2. Machine learning
Other Research Topics and Application Domains
 B9.5.6. Data science
1 Team members, visitors, external collaborators
Research Scientists
 Francis Bach [Team leader, INRIA, Senior Researcher, HDR]
 Alessandro Rudi [INRIA, Researcher]
 Umut Simsekli [INRIA, Researcher]
 Alexandre d'Aspremont [CNRS, Senior Researcher]
PostDoctoral Fellows
 PierreCyril Aubin [INRIA]
 Hans Kersting [INRIA]
 Ziad Kobeissi [ILB]
 Clément Mantoux [INRIA]
 Anant Raj [INRIA  Marie Curie Fellowship]
 Blake Woodworth [DIENS]
PhD Students
 Andrea Basteri [Inria, from Oct 2022]
 Eloise Berthier [DGA]
 Gaspard Beugnot [INRIA]
 Theophile Cantelobre [INRIA]
 Bertille Follain [ENS PARIS]
 Gautier Izacard [CNRS]
 Remi Jezequel [ENS PARIS]
 Marc Lambert [DGA]
 Clément Lezane [UNIV TWENTE]
 Ulysse MarteauFerey [INRIA]
 Gregoire Mialon [Inria, until Jan 2022]
 Céline Moucer [ENS PARISSACLAY]
 Benjamin PaulDuboisTaine [UNIV PARIS SACLAY]
 Lawrence Stewart [INRIA]
Technical Staff
 Loic Esteve [INRIA]
Interns and Apprentices
 Benjamin Dupuis [Inria, from Sep 2022]
 David Holzmuller [UNIV TU BERLIN, from Mar 2022 until Sep 2022]
 Krunoslav Lehman Pavasovic [Inria, from Oct 2022]
 Sarah Sachs [University of Amsterdam, from Nov 2022]
Administrative Assistants
 Helene Bessin Rousseau [INRIA]
 Helene Milome [INRIA]
Visiting Scientists
 Silvere Bonnabel [UNIV NOUVELLECALEDONIE, from Sep 2022]
 Steffen Grunewalder [UNIV NEWCASTLE, from Sep 2022]
 Cristobal Guzman [INRIA, from Sep 2022]
2 Overall objectives
2.1 Statement
Machine learning is a recent scientific domain, positioned between applied mathematics, statistics and computer science. Its goals are the optimization, control, and modelisation of complex systems from examples. It applies to data from numerous engineering and scientific fields (e.g., vision, bioinformatics, neuroscience, audio processing, text processing, economy, finance, etc.), the ultimate goal being to derive general theories and algorithms allowing advances in each of these domains. Machine learning is characterized by the high quality and quantity of the exchanges between theory, algorithms and applications: interesting theoretical problems almost always emerge from applications, while theoretical analysis allows the understanding of why and when popular or successful algorithms do or do not work, and leads to proposing significant improvements.
Our academic positioning is exactly at the intersection between these three aspects—algorithms, theory and applications—and our main research goal is to make the link between theory and algorithms, and between algorithms and highimpact applications in various engineering and scientific fields, in particular computer vision, bioinformatics, audio processing, and text processing.
3 Research program
Machine learning has emerged as its own scientific domain in the last 30 years, providing a good abstraction of many problems and allowing exchanges of best practices between data oriented scientific fields. Among its main research areas, there are currently probabilistic models, supervised learning (including neural networks), unsupervised learning, reinforcement learning, and statistical learning theory. All of these are represented in the SIERRA team, but the main goals of the team are mostly related to supervised learning and optimization, and their mutual interactions, as well as with interdisciplinary collaborations. One particularity of the team is the strong focus on optimization (in particular convex optimization, but with more works in the nonconvex world recently), leading to contributions in optimization which go beyond the machine learning context.
We have thus divided our research effort in three parts:
 Convex optimization
 Nonconvex optimization
 Machine learning.
4 Application domains
Machine learning research can be conducted from two main perspectives: the first one, which has been dominant in the last 30 years, is to design learning algorithms and theories which are as generic as possible, the goal being to make as few assumptions as possible regarding the problems to be solved and to let data speak for themselves. This has led to many interesting methodological developments and successful applications. However, we believe that this strategy has reached its limit for many application domains, such as computer vision, bioinformatics, neuroimaging, text and audio processing, which leads to the second perspective our team is built on: Research in machine learning theory and algorithms should be driven by interdisciplinary collaborations, so that specific prior knowledge may be properly introduced into the learning process, in particular with the following fields:
 Computer vision: object recognition, object detection, image segmentation, image/video processing, computational photography. In collaboration with the Willow projectteam.
 Bioinformatics: cancer diagnosis, protein function prediction, virtual screening.
 Text processing: document collection modeling, language models.
 Audio processing: source separation, speech/music processing.
 Climate science (satellite imaging).
5 Highlights of the year
5.1 Awards
 Starting investigator ERC grant Dynasty (U. Simsekli)
 ICASSP 2022 best paper award (U. Simsekli)
 Médaille des Assises des Mathématiques (F. Bach)
6 New software and platforms
6.1 New software
6.1.1 PEPit

Name:
PEPit

Keyword:
Optimisation

Functional Description:
PEPit is a Python package aiming at simplifying the access to worstcase analyses of a large family of firstorder optimization methods possibly involving gradient, projection, proximal, or linear optimization oracles, along with their approximate, or Bregman variants. In short, PEPit is a package enabling computerassisted worstcase analyses of firstorder optimization methods. The key underlying idea is to cast the problem of performing a worstcase analysis, often referred to as a performance estimation problem (PEP), as a semidefinite program (SDP) which can be solved numerically. For doing that, the package users are only required to write firstorder methods nearly as they would have implemented them. The package then takes care of the SDP modelling parts, and the worstcase analysis is performed numerically via a standard solver.
This software is primarily based on the works on performance estimation problems by Adrien Taylor. Compared to other scientific software, its maintenance is relatively low cost (we can do it ourself, together with students involved in using those techniques). We plan to continue updating this software by incorporating recent advances of the community, and with the clear long term idea of making it a tool for teaching firstorder optimization.
 URL:

Contact:
Adrien Taylor
7 New results
7.1 Information theory through kernel methods
We consider the analysis of probability distributions through their associated covariance operators from reproducing kernel Hilbert spaces. We show that the von Neumann entropy and relative entropy of these operators are intimately related to the usual notions of Shannon entropy and relative entropy, and share many of their properties. They come together with efficient estimation algorithms from various oracles on the probability distributions. We also consider product spaces and show that for tensor product kernels, we can define notions of mutual information and joint entropies, which can then characterize independence perfectly, but only partially conditional independence. We finally show how these new notions of relative entropy lead to new upperbounds on log partition functions, that can be used together with convex optimization within variational inference methods, providing a new family of probabilistic inference methods. 52
7.2 Optimal Algorithms for Stochastic Complementary Composite Minimization
Inspired by regularization techniques in statistics and machine learning, we study complementary composite minimization in the stochastic setting. This problem corresponds to the minimization of the sum of a (weakly) smooth function endowed with a stochastic firstorder oracle, and a structured uniformly convex (possibly nonsmooth and nonLipschitz) regularization term. Despite intensive work on closely related settings, prior to our work no complexity bounds for this problem were known. We close this gap by providing novel excess risk bounds, both in expectation and with high probability. Our algorithms are nearly optimal, which we prove via novel lower complexity bounds for this class of problems. We conclude by providing numerical results comparing our methods to the state of the art.
7.3 SuperAcceleration with Cyclical Stepsizes
We develop a convergencerate analysis of momentum with cyclical stepsizes. We show that under some assumption on the spectral gap of Hessians in machine learning, cyclical stepsizes are provably faster than constant stepsizes. More precisely, we develop a convergence rate analysis for quadratic objectives that provides optimal parameters and shows that cyclical learning rates can improve upon traditional lower complexity bounds. We further propose a systematic approach to design optimal first order methods for quadratic minimization with a given spectral structure. Finally, we provide a local convergence rate analysis beyond quadratic minimization for the proposed methods and illustrate our findings through benchmarks on least squares and logistic regression problems.
7.4 An optimal gradient method for smooth strongly convex minimization
We present an optimal gradient method for smooth strongly convex optimization. The method is optimal in the sense that its worstcase bound on the distance to an optimal point exactly matches the lower bound on the oracle complexity for the class of problems, meaning that no blackbox firstorder method can have a better worstcase guarantee without further assumptions on the class of problems at hand. In addition, we provide a constructive recipe for obtaining the algorithmic parameters of the method and illustrate that it can be used for deriving methods for other optimality criteria as well.
7.5 PEPit: computerassisted worstcase analyses of firstorder optimization methods in Python
PEPit is a Python package aiming at simplifying the access to worstcase analyses of a large family of firstorder optimization methods possibly involving gradient, projection, proximal, or linear optimization oracles, along with their approximate, or Bregman variants. In short, PEPit is a package enabling computerassisted worstcase analyses of firstorder optimization methods. The key underlying idea is to cast the problem of performing a worstcase analysis, often referred to as a performance estimation problem (PEP), as a semidefinite program (SDP) which can be solved numerically. For doing that, the package users are only required to write firstorder methods nearly as they would have implemented them. The package then takes care of the SDP modelling parts, and the worstcase analysis is performed numerically via a standard solver.
7.6 LastIterate Convergence of Optimistic Gradient Method for Monotone Variational Inequalities
The Past Extragradient (PEG) [Popov, 1980] method, also known as the Optimistic Gradient method, has known a recent gain in interest in the optimization community with the emergence of variational inequality formulations for machine learning. Recently, in the unconstrained case, Golowich et al. [2020] proved that a $O(1/N)$ lastiterate convergence rate in terms of the squared norm of the operator can be achieved for Lipschitz and monotone operators with a Lipschitz Jacobian. In this work, by introducing a novel analysis through potential functions, we show that (i) this $O(1/N)$ lastiterate convergence can be achieved without any assumption on the Jacobian of the operator, and (ii) it can be extended to the constrained case, which was not derived before even under Lipschitzness of the Jacobian. The proof is significantly different from the one known from Golowich et al. [2020], and its discovery was computeraided. Those results close the open question of the last iterate convergence of PEG for monotone variational inequalities.
7.7 Fast Stochastic Composite Minimization and an Accelerated FrankWolfe Algorithm under Parallelization
We consider the problem of minimizing the sum of two convex functions. One of those functions has Lipschitzcontinuous gradients, and can be accessed via stochastic oracles, whereas the other is “simple”. We provide a Bregmantype algorithm with accelerated convergence in function values to a ball containing the minimum. The radius of this ball depends on problemdependent constants, including the variance of the stochastic oracle. We further show that this algorithmic setup naturally leads to a variant of FrankWolfe achieving acceleration under parallelization. More precisely, when minimizing a smooth convex function on a bounded domain, we show that one can achieve an $\u03f5$ primaldual gap (in expectation) in $\tilde{O}(1/\sqrt{\u03f5})$ iterations, by only accessing gradients of the original function and a linear maximization oracle with $O(1/\sqrt{\u03f5})$ computing units in parallel. We illustrate this fast convergence on synthetic numerical experiments.
7.8 VectorValued LeastSquares Regression under Output Regularity Assumptions.
In 19 we propose and analyse a reducedrank method for solving leastsquares regression problems with infinite dimensional output. We derive learning bounds for our method, and study under which setting statistical performance is improved in comparison to fullrank method. Our analysis extends the interest of reducedrank regression beyond the standard lowrank setting to more general output regularity assumptions. We illustrate our theoretical insights on synthetic leastsquares problems. Then, we propose a surrogate structured prediction method derived from this reducedrank method. We assess its benefits on three different problems: image reconstruction, multilabel classification, and metabolite identification.
7.9 On the Benefits of Large Learning Rates for Kernel Methods
In 30, we study an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that such a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a twodimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.
7.10 Active Labeling: Streaming Stochastic Gradients
The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning with partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples. We illustrate our technique in depth for robust regression. The results are presented in 31
7.11 Sampling from Arbitrary Functions via PSD Models
In many areas of applied statistics and machine learning, generating an arbitrary number of independent and identically distributed (i.i.d.) samples from a given distribution is a key task. When the distribution is known only through evaluations of the density, current methods either scale badly with the dimension or require very involved implementations. Instead, we take a twostep approach by first modeling the probability distribution and then sampling from that model. We use the recently introduced class of positive semidefinite (PSD) models, which have been shown to be efficient for approximating probability densities. We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models. We also present preliminary empirical results to illustrate our assertions. The results are presented in 37.
7.12 On the Consistency of MaxMargin Losses
The foundational concept of MaxMargin in machine learning is illposed for output spaces with more than two labels such as in structured prediction. In this paper, we show that the MaxMargin loss can only be consistent to the classification task under highly restrictive assumptions on the discrete loss measuring the error between outputs. These conditions are satisfied by distances defined in tree graphs, for which we prove consistency, thus being the first losses shown to be consistent for MaxMargin beyond the binary setting. We finally address these limitations by correcting the concept of MaxMargin and introducing the RestrictedMaxMargin, where the maximization of the lossaugmented scores is maintained, but performed over a subset of the original domain. The resulting loss is also a generalization of the binary support vector machine and it is consistent under milder conditions on the discrete loss. The results are presented in the paper 45.
7.13 Exponential convergence of sumofsquares hierarchies for trigonometric polynomials
In 54 we consider the unconstrained optimization of multivariate trigonometric polynomials by the sumofsquares hierarchy of lower bounds. We first show a convergence rate of $O(1/{s}^{2})$ for the relaxation with degree s without any assumption on the trigonometric polynomial to minimize. Second, when the polynomial has a finite number of global minimizers with invertible Hessians at these minimizers, we show an exponential convergence rate with explicit constants. Our results also apply to the minimization of regular multivariate polynomials on the hypercube.
7.14 Measuring dissimilarity with diffeomorphism invariance
Measures of similarity (or dissimilarity) are a key ingredient to many machine learning algorithms. We introduce DID, a pairwise dissimilarity measure applicable to a wide range of data spaces, which leverages the data's internal structure to be invariant to diffeomorphisms. We prove that DID enjoys properties which make it relevant for theoretical study and practical use. By representing each datum as a function, DID is defined as the solution to an optimization problem in a Reproducing Kernel Hilbert Space and can be expressed in closedform. In practice, it can be efficiently approximated via Nyström sampling. Empirical experiments support the merits of DID. The results are presented in the paper 58.
7.15 NonConvex Optimization with Certificates and Fast Rates Through Kernel Sums of Squares
In 41 We consider potentially nonconvex optimization problems, for which optimal rates of approximation depend on the dimension of the parameter space and the smoothness of the function to be optimized. In this paper, we propose an algorithm that achieves close to optimal a priori computational guarantees, while also providing a posteriori certificates of optimality. Our general formulation builds on infinitedimensional sumsofsquares and Fourier analysis, and is instantiated on the minimization of multivariate periodic functions.
7.16 Generalization Bounds using Lower Tail Exponents in Stochastic Optimizers
Despite the ubiquitous use of stochastic optimization algorithms in machine learning, the precise impact of these algorithms and their dynamics on generalization performance in realistic nonconvex settings is still poorly understood. While recent work has revealed connections between generalization and heavytailed behavior in stochastic optimization, this work mainly relied on continuoustime approximations; and a rigorous treatment for the original discretetime iterations is yet to be performed. To bridge this gap, we present novel bounds linking generalization to the lower tail exponent of the transition kernel associated with the optimizer around a local minimum, in both discrete and continuoustime settings. To achieve this, we first prove a data and algorithmdependent generalization bound in terms of the celebrated FerniqueTalagrand functional applied to the trajectory of the optimizer. Then, we specialize this result by exploiting the Markovian structure of stochastic optimizers, and derive bounds in terms of their (datadependent) transition kernels. We support our theory with empirical results from a variety of neural networks, showing correlations between generalization error and lower tail exponents. The results are presented in the paper 32.
7.17 RateDistortion Theoretic Generalization Bounds for Stochastic Learning Algorithms
Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their highlevel impact. In this study, we prove novel generalization bounds through the lens of ratedistortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the 'compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the 'lossless compression' setting, we recover and improve existing mutual informationbased bounds, whereas a 'lossy compression' scheme allows us to link generalization to the ratedistortion dimensiona particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions. The results are presented in the paper 46.
7.18 Chaotic Regularization and HeavyTailed Limits for Deterministic Gradient Descent
Recent studies have shown that gradient descent (GD) can achieve improved generalization when its dynamics exhibits a chaotic behavior. However, to obtain the desired effect, the stepsize should be chosen sufficiently large, a task which is problem dependent and can be difficult in practice. In this study, we incorporate a chaotic component to GD in a controlled manner, and introduce multiscale perturbed GD (MPGD), a novel optimization framework where the GD recursion is augmented with chaotic perturbations that evolve via an independent dynamical system. We analyze MPGD from three different angles: (i) By building up on recent advances in rough paths theory, we show that, under appropriate assumptions, as the stepsize decreases, the MPGD recursion converges weakly to a stochastic differential equation (SDE) driven by a heavytailed Lévystable process. (ii) By making connections to recently developed generalization bounds for heavytailed processes, we derive a generalization bound for the limiting SDE and relate the worstcase generalization error over the trajectories of the process to the parameters of MPGD. (iii) We analyze the implicit regularization effect brought by the dynamical regularization and show that, in the weak perturbation regime, MPGD introduces terms that penalize the Hessian of the loss function. Empirical results are provided to demonstrate the advantages of MPGD. The results are presented in the paper 36.
7.19 Generalized Sliced Probability Metrics
Sliced probability metrics have become increasingly popular in machine learning, and they play a quintessential role in various applications, including statistical hypothesis testing and generative modeling. However, in a practical setting, the convergence behavior of the algorithms built upon these distances have not been well established, except for a few specific cases. In this paper, we introduce a new family of sliced probability metrics, namely Generalized Sliced Probability Metrics (GSPMs), based on the idea of slicing highdimensional distributions into a set of their onedimensional marginals. We show that GSPMs are true metrics, and they are related to the Maximum Mean Discrepancy (MMD). Exploiting this relationship, we consider GSPMbased gradient flows and show that, under mild assumptions, the gradient flow converges to the global optimum. Finally, we demonstrate that various choices of GSPMs lead to new positive definite kernels that could be used in the MMD formulation while providing a unique integral geometric interpretation. We illustrate the application of GSPMs in gradient flows. The results are presented in the paper 33.
7.20 Generalization Bounds for Stochastic Gradient Descent via Localized epsilonCovers
In 40, we propose a new covering technique localized for the trajectories of SGD. This localization provides an algorithmspecific complexity measured by the covering number, which can have dimensionindependent cardinality in contrast to standard uniform covering arguments that result in exponential dimension dependency. Based on this localized construction, we show that if the objective function is a finite perturbation of a piecewise strongly convex and smooth function with $P$ pieces, i.e. nonconvex and nonsmooth in general, the generalization error can be upper bounded by $O\left(\sqrt{(lognlog(nP\left)\right)/n}\right)$, where $n$ is the number of data samples. In particular, this rate is independent of dimension and does not require early stopping and decaying step size. Finally, we employ these results in various contexts and derive generalization bounds for multiindex linear models, multiclass support vector machines, and $K$means clustering for both hard and soft label setups, improving the known stateoftheart rates.
8 Bilateral contracts and grants with industry
8.1 Bilateral grants with industry
 Alexandre d’Aspremont, Francis Bach, Martin Jaggi (EPFL): Google Focused award.
 Francis Bach: Gift from Facebook AI Research.
 Alexandre d’Aspremont: fondation AXA, "Mécénat scientifique", optimisation & machine learning.
9 Partnerships and cooperations
9.1 International initiatives
9.1.1 Associate Teams in the framework of an Inria International Lab or in the framework of an Inria International Program
FOAM

Title:
FirstOrder Accelerated Methods for Machine Learning.

Duration:
2020 >

Coordinator:
Cristobal Guzman (crguzmanp@mat.uc.cl)

Partners:
 Pontificia Universidad Católica de Chile Santiago (Chili)

Inria contact:
Alexandre d'Aspremont

Summary:
Our main interest is to investigate novel and improved convergence results for firstorder iterative methods for saddlepoints, variational inequalities and fixed points, under the lens of PEP. Our interest in improving firstorder methods is also deeply related with applications in machine learning. Particularly in sparsityoriented inverse problems, optimization methods are the workhorse for state of the art results. On some of these problems, a set of new hypothesis and theoretical results shows improved complexity bounds for problems with good recovery guarantees and we plan to extend these new performance bounds to the variational framework.
4TUNE
 Title: Adaptive, Efficient, Provable and Flexible Tuning for Machine Learning
 Duration: 2020 >
 Coordinator: Peter Grünwald (pdg@cwi.nl)
 Partners: CWI
 Inria contact: Pierre Gaillard
 The longterm goal of 4TUNE is to push adaptive machine learning to the next level. We aim to develop refined methods, going beyond traditional worstcase analysis, for exploiting structure in the learning problem at hand. We will develop new theory and design sophisticated algorithms for the core tasks of statistical learning and individual sequence prediction. We are especially interested in understanding the connections between these tasks and developing unified methods for both. We will also investigate adaptivity to nonstandard patterns encountered in embedded learning tasks, in particular in iterative equilibrium computations.
9.2 International research visitors
9.2.1 Visits of international scientists
 Sarah Sachs  PhD Student visiting from University of Amsterdam
 Mert Gurbuzbalaban  Associate Professor visiting from Rutgers University
9.3 European initiatives
9.3.1 Horizon Europe
DYNASTY
DYNASTY project on cordis.europa.eu

Title:
DynamicsAware Theory of Deep Learning

Duration:
From October 1, 2022 to September 30, 2027

Partners:
 INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France

Inria contact:
Umut SIMSEKLI
 Coordinator:

Summary:
The recent advances in deep learning (DL) have transformed many scientific domains and have had major impacts on industry and society. Despite their success, DL methods do not obey most of the wisdoms of statistical learning theory, and the vast majority of the current DL techniques mainly stand as poorly understood blackbox algorithms.
Even though DL theory has been a very active research field in the past few years, there is a significant gap between the current theory and practice: (i) the current theory often becomes vacuous for models with large number of parameters (which is typical in DL), and (ii) it cannot capture the interaction between data, architecture, training algorithm and its hyperparameters, which can have drastic effects on the overall performance. Due to this lack of theoretical understanding, designing new DL systems has been dominantly performed by adhoc, 'trialanderror' approaches.
The main objective of this proposal is to develop a mathematically sound and practically relevant theory for DL, which will ultimately serve as the basis of a software library that provides practical tools for DL practitioners. In particular, (i) we will develop error bounds that closely reflect the true empirical performance, by explicitly incorporating the dynamics aspect of training, (ii) we will develop new model selection, training, and compression algorithms with reduced time/memory/storage complexity, by exploiting the developed theory.
To achieve the expected breakthroughs, we will develop a novel theoretical framework, which will enable tight analysis of learning algorithms in the lens of dynamical systems theory. The outcomes will help relieve DL from being a blackbox system and avoid the heuristic design process. We will produce comprehensive opensource software tools adapted to all popular DL libraries, and test the developed algorithms on a wide range of real applications arising in computer vision, audio/music/natural language processing.
9.3.2 H2020 projects
SEQUOIA
SEQUOIA project on cordis.europa.eu

Title:
Robust algorithms for learning from modern data

Duration:
From September 1, 2017 to August 31, 2023

Partners:
 INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France

Inria contact:
Francis BACH
 Coordinator:

Summary:
Machine learning is needed and used everywhere, from science to industry, with a growing impact on many disciplines. While first successes were due at least in part to simple supervised learning algorithms used primarily as black boxes on mediumscale problems, modern data pose new challenges. Scalability is an important issue of course: with large amounts of data, many current problems far exceed the capabilities of existing algorithms despite sophisticated computing architectures. But beyond this, the core classical model of supervised machine learning, with the usual assumptions of independent and identically distributed data, or welldefined features, outputs and loss functions, has reached its theoretical and practical limits.
Given this new setting, existing optimizationbased algorithms are not adapted. The main objective of this proposal is to push the frontiers of supervised machine learning, in terms of (a) scalability to data with massive numbers of observations, features, and tasks, (b) adaptability to modern computing environments, in particular for parallel and distributed processing, (c) provable adaptivity and robustness to problem and hardware specifications, and (d) robustness to nonconvexities inherent in machine learning problems.
To achieve the expected breakthroughs, we will design a novel generation of learning algorithms amenable to a tight convergence analysis with realistic assumptions and efficient implementations. They will help transition machine learning algorithms towards the same widespread robust use as numerical linear algebra libraries. Outcomes of the research described in this proposal will include algorithms that come with strong convergence guarantees and are welltested on reallife benchmarks coming from computer vision, bioinformatics, audio processing and natural language processing. For both distributed and nondistributed settings, we will release opensource software, adapted to widely available computing platforms.
REAL
REAL project on cordis.europa.eu

Title:
Reliable and costeffective large scale machine learning

Duration:
From April 1, 2021 to March 31, 2026

Partners:
 INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France

Inria contact:
Alessandro Rudi
 Coordinator:

Summary:
In the last decade, machine learning (ML) has become a fundamental tool with a growing impact in many disciplines, from science to industry. However, nowadays, the scenario is changing: data are exponentially growing compared to the computational resources (post Moore's law era), and ML algorithms are becoming crucial building blocks in complex systems for decision making, engineering, science. Current machine learning is not suitable for the new scenario, both from a theoretical and a practical viewpoint: (a) the lack of costeffectiveness of the algorithms impacts directly the economic/energetic costs of large scale ML, making it barely affordable by universities or research institutes; (b) the lack of reliability of the predictions affects critically the safety of the systems where ML is employed. To deal with the challenges posed by the new scenario, REAL will lay the foundations of a solid theoretical and algorithmic framework for reliable and costeffective large scale machine learning on modern computational architectures. In particular, REAL will extend the classical ML framework to provide algorithms with two additional guarantees: (a) the predictions will be reliable, i.e., endowed with explicit bounds on their uncertainty guaranteed by the theory; (b) the algorithms will be costeffective, i.e., they will be naturally adaptive to the new architectures and will provably achieve the desired reliability and accuracy level, by using minimum possible computational resources. The algorithms resulting from REAL will be released as opensource libraries for distributed and multiGPU settings, and their effectiveness will be extensively tested on key benchmarks from computer vision, natural language processing, audio processing, and bioinformatics. The methods and the techniques developed in this project will help machine learning to take the next step and become a safe, effective, and fundamental tool in science and engineering for large scale data problems.
NNOVEROPT
NNOVEROPT project on cordis.europa.eu

Title:
Neural Network : An Overparametrization Perspective

Duration:
From November 1, 2021 to October 31, 2024

Partners:
 INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET AUTOMATIQUE (INRIA), France
 THE BOARD OF TRUSTEES OF THE UNIVERSITY OF ILLINOIS (UNIVERSITY OF ILLINOIS), United States

Inria contact:
Francis Bach
 Coordinator:

Summary:
In recent times, overparametrized models where the number of model parameters far exceeds the number of training samples available are the methods of choice for learning problems and neural networks are amongst the most popular overparametrized methods used heavily in practice. It has been discovered recently that overparametrization surprisingly improves the optimization landscape of a complex nonconvex problem, i.e., the training of neural networks, and also has positive effects on the generalization performance. Despite improved empirical performance of overparametrized models like neural networks, the theoretical understanding of these models is quite limited which hinders the progress of the field in the right direction. Any progress in the understanding of the optimization as well as generalization aspects for theses complex models especially neural networks will lead to big technical advancement in the field of machine learning and artificial intelligence. During the Marie SklodowskaCurie Actions Individual FellowshipGlobal Fellowship (MSCAIFGF), I plan to study the optimization problem arising while training overparametrized neural networks and generalization in overparametrized neural networks. The end goal for this project is to provide better theoretical understanding of the optimization landscape while training overparametrized models as a result of which to provide better optimization algorithms for training as well as to study the universal approximation guarantees of overparametrized models. We also aim to study the implicit bias induced by optimization algorithms while training overparametrized complex models. To achieve the objective discussed above, I will be using tools from traditional optimization theory, statistical learning theory, gradient flows, as well as from statistical physics.
9.4 National initiatives
 Alexandre d'Aspremont: IRIS, PSL “Science des données, données de la science”.
10 Dissemination
10.1 Promoting scientific activities
10.1.1 Scientific events: organisation
Member of the organizing committees
 F. Bach: Workshop on on Uncertainty Quantification,Erwin Schrödinger International Institute for Mathematics and Physics (ESI), coorganizer, June 2022.
Member of the conference program committees
 A. Rudi: area chair, ICML 2022.
 U. Simsekli: area chair, ICML and NeurIPS 2022.
10.1.2 Journal
Member of the editorial boards
 A. d'Aspremont, Associate Editor, SIAM Journal on Optimization.
 A. d'Aspremont, Section Editor, SIAM Journal on the Mathematics of Data Science.
 F. Bach, coeditorinchief, Journal of Machine Learning Research
 F. Bach: Series Editor, Adaptive Computation and Machine Learning, MIT Press, since 2016.
Reviewer  reviewing activities
 Adrien Taylor: reviewer for “SIAM Journal on Optimization”.
 Adrien Taylor: reviewer for “Journal of Optimization Theory and Applications”.
 Adrien Taylor: reviewer for “IMA Journal on Numerical Analysis”.
 Adrien Taylor: reviewer for “Conference on Learning Theory 2022”.
 Adrien Taylor: reviewer for “Journal of Machine Learning Research”.
 A. Rudi: reviewer for “Journal of Machine Learning Research”.
 U. Simsekli: reviewer for “Journal of Applied Probability”.
10.1.3 Invited talks
 A. d'Aspremont, ISMP 2022, semiplenary.
 Adrien Taylor: invited talk at TradeOPT Workshop on Algorithmic and Continuous Optimization, LouvainlaNeuve.
 Adrien Taylor: invited talk at Workshop on Conic Linear Optimization for Computerassisted Proofs, Oberwolfach.
 Adrien Taylor: invited talk at Rice University (ECE speaker seminar series, online).
 A. Rudi: CzechFrench Workshop in AI, Prague, Sept 2022
 A. Rudi:ELLIS Theory Workshop, Arenzano, Italy, June 2022
 A. Rudi:Colloquium PRAIRIE, Paris, May 2022
 A. Rudi: Seminaire Parisien de Statistique — Institut Henri Poincare ́, Paris, April 2022
 A. Rudi: Workshop MASMODE, SMAI, Paris, March 2022
 A. Rudi: Seminars of IA and Mathematics by Centro Nazionale di Ricerche, March 2022
 A. Rudi: Seminaire Parisien d’Optimisation — Institut Henri Poincare ́, Paris, March 2022
 A. Rudi: SDCE Seminar Series  Alan Turing Institute, London, March 2022
 A. Rudi: Unione Matematica Italiana, Colloquium Prisma, Feb. 2022
 U. Simsekli: The Statistics and Machine Learning Thematic Seminar, University of Amsterdam, Nov. 2022
 U. Simsekli: On Future Trends and Opportunities for Monte Carlo methods, Warsaw, Dec. 2022
 F. Bach: Workshop on optimization, CIRM, Luminy, May 2022
 F. Bach: Workshop on probabilistic programming, Collège de France, July 2022.
 F. Bach: International Congress of Mathematicians, online, July 2022, invited speaker.
 F. Bach: ECMLPKDD, Grenoble, invited speaker, September 2022.
 F. Bach: Workshop on optimization, CIRM, Luminy, October 2022
10.1.4 Leadership within the scientific community
 F. Bach: president of the board of ICML
 F. Bach: Member of the Scientific Council of the Société Informatique de France, since 2022.
10.1.5 Scientific expertise
 F. Bach: Deputy Scientific director, Inria Paris
10.2 Teaching  Supervision  Juries
10.2.1 Teaching
 Master: Alexandre d'Aspremont, Optimisation convexe: modélisation, algorithmes et applications cours magistraux 21h (2011Present), Master M2 MVA, ENS PS.
 Master : Francis Bach, Optimisation et apprentissage statistique, 20h, Master M2 (Mathématiques de l'aléatoire), Université ParisSud, France.
 Master : Francis Bach, Learning theory from first principles, 27h, Master M2 MASH, Université Paris Dauphine PSL, France.
 Master: Alessandro Rudi, Umut Simsekli. Introduction to Machine Learning, 52h, L3, ENS, Paris.
10.2.2 Supervision
 PhD in progress: Antoine Bambade supervised by Jean Ponce (WILLOW), Justin Carpentier (WILLOW), and Adrien Taylor.
 PhD in progress: Baptiste Goujaud supervised by Eric Moulines (École Polytechnique), Aymeric Dieuleveut (École Polytechnique), and Adrien Taylor.
 PhD in progress: Céline Moucer supervised by Francis Bach and Adrien Taylor.
 PhD in progress: Theophile Cantelobre supervised by Benjamin Guedji (INRIA), Carlo Ciliberto (UCL), A. Rudi.
 PhD in progress: Gaspard Beugnot supervised by Julien Mairal (INRIA) and A. Rudi.
 PhD in progress: Andrea Basteri supervised by A. Rudi.
 PhD in progress: Bertille Follain supervised by F. Bach and U. Simsekli.
 PhD in progress: Marc Lambert supervised by F. Bach and S. Bonnabel.
 PhD in progress: Ivan Lerner, coadvised with Anita Burgun et Antoine Neuraz.
 PhD in progress: Lawrence Stewart, coadvised by Francis Bach and JeanPhilippe Vert.
 PhD in progress: Gautier Izacard, coadvised by Alexandre d'Aspremont and Edouard Grave (Meta).
 PhD in progress: Rémi Jezequel, coadvised by Alessandro Rudi and Pierre Gaillard (Inria Grenoble)
 PhD in progress: Clément Lezane, coadvised by Alexandre d'Aspremont and Cristobal Guzman.
 PhD defended: Vivien Cabannes, supervised by Francis Bach and Alessandro Rudi.
 PhD defended: Eloise Berthier, supervised by Francis Bach.
 PhD defended: Ulysse Marteau Ferey, supervised by Alessandro Rudi and Francis Bach.
 PhD defended: Grégoire Mialon, Sample Selection Methods, 2018, Alexandre d'Aspremont (joint with Julien Mairal)
 PHD defended: Theo Ryffel, supervised by Francis Bach and David Pointcheval.
10.2.3 Juries
 A. d'Aspremont. PhD exam, David Martinez, Oxford.
 A. d'Aspremont. PhD exam, Guillaume Braun, U. Lille.
 A. d'Aspremont. HdR, Andrea Simonetto, ENSTA.
 A. Rudi. PhD exam, Jonas Wacker, Eurecom.
 A. Rudi. PhD exam, Alex Lambert, Telecom Paris.
 A. Rudi, PhD exam, Belhal Karimi, Ecole Polytechnique.
 U. Simsekli, PhD exam, Thibault Séjourné, ENS Paris
 U. Simsekli, PhD exam, Merve Gürel, ETH Zurich
 F. Bach, PhD exam, El Mehdi Achour, Université de Toulouse
 F. Bach, PhD exam, Daniel LeJeune, Rice University
10.3 Popularization
10.3.1 Interventions
 A. d'Aspremont. Vivatech
 A. d'Aspremont. AI for Finance workshop, palais Brongniart.
11 Scientific production
11.1 Major publications
 1 articleApproximation Bounds for Sparse Programs.SIAM Journal on Mathematics of Data Science42June 2022, 514530
 2 miscMeasuring dissimilarity with diffeomorphism invariance.February 2022
 3 articleOptimal Complexity and Certification of Bregman FirstOrder Methods.Mathematical Programming1941July 2022, 4183
 4 inproceedingsGeneralization Bounds using Lower Tail Exponents in Stochastic Optimizers.International Conference on Machine LearningBaltimore, United States2022
 5 inproceedingsGeneralized Sliced Probability Metrics.ICASSP 2022  2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)Singapore, SingaporeIEEEMay 2022, 45134517
 6 articleGlobal assessment of oil and gas methane ultraemitters.Science3756580February 2022, 557561
 7 inproceedingsChaotic Regularization and HeavyTailed Limits for Deterministic Gradient Descent.Advances in Neural Processing SystemsNew Orleans, United States2022
 8 unpublishedNonparametric Models for Nonnegative Functions.July 2020, working paper or preprint

9
inproceedingsGeneralization Bounds for Stochastic Gradient Descent via Localized
$$ Covers.Advances in Neural Processing SystemsBaltimore, United StatesSeptember 2022  10 articleSharpness, Restart and Acceleration.SIAM Journal on Optimization301October 2020, 262289
 11 inproceedingsRateDistortion Theoretic Generalization Bounds for Stochastic Learning Algorithms.COLT 2022  35th Annual Conference on Learning Theory178Proceedings of Machine Learning ResearchLondon, United KingdomJuly 2022
 12 miscNonConvex Optimization with Certificates and Fast Rates Through Kernel Sums of Squares.April 2022
11.2 Publications of the year
International journals
 13 articleApproximation Bounds for Sparse Programs.SIAM Journal on Mathematics of Data Science42June 2022, 514530
 14 articleThe reproducing kernel Hilbert spaces underlying linear SDE Estimation, Kalman filtering and their relation to optimal control.Pure and Applied Functional Analysis2022
 15 articleA note on approximate accelerated forwardbackward methods with absolute and relative errors, and possibly strongly convex objectives.Open Journal of Mathematical OptimizationJanuary 2022
 16 articleConvergence of a Constrained Vector Extrapolation Scheme.SIAM Journal on Mathematics of Data ScienceJuly 2022
 17 articleNonlinear Acceleration of Momentum and PrimalDual Algorithms.Mathematical Programming, Series AFebruary 2022, 138
 18 articleOnline hierarchical forecasting for power consumption data.International Journal of Forecasting381January 2022, 339351
 19 articleVectorValued LeastSquares Regression under Output Regularity Assumptions.Journal of Machine Learning Research2022
 20 articleOptimal Complexity and Certification of Bregman FirstOrder Methods.Mathematical Programming1941July 2022, 4183
 21 articleOn the oracle complexity of smooth strongly convex minimization.Journal of ComplexityFebruary 2022
 22 articleRobust Distributed Accelerated Stochastic Gradient Methods for MultiAgent Networks.Journal of Machine Learning Research2022
 23 articleStable Bounds on the Duality Gap of Separable Nonconvex Optimization Problems.Mathematics of Operations ResearchJuly 2022
 24 articleRestarting FrankWolfe.Journal of Optimization Theory and Applications1923March 2022, 799829
 25 articleThe limitedmemory recursive variational Gaussian approximation (LRVGA).Statistics and Computing2022
 26 articleGlobal assessment of oil and gas methane ultraemitters.Science3756580February 2022, 557561
 27 articleAn optimal gradient method for smooth convex minimization.Mathematical Programming, Series A2022
International peerreviewed conferences
 28 inproceedingsOperatorvalued Kernels and Control of Infinite dimensional Dynamic Systems.CDC 2022  61st IEEE Conference on Decision and ControlCancun, Mexico2022
 29 inproceedingsInfiniteDimensional SumsofSquares for Optimal Control.61st IEEE Conference on Decision and ControlCancun, MexicoDecember 2022
 30 inproceedingsOn the Benefits of Large Learning Rates for Kernel Methods.COLT 2022  35th Annual Conference on Learning Theory178Proceedings of Thirty Fifth Conference on Learning TheoryLondon, United KingdomJuly 2022, 254282
 31 inproceedingsActive Labeling: Streaming Stochastic Gradients.NeurIPS 2022  36th Conference on Neural Information Processing SystemsNew Orleans, United StatesNovember 2022
 32 inproceedingsGeneralization Bounds using Lower Tail Exponents in Stochastic Optimizers.ICML 2022  39th International Conference on Machine LearningBaltimore, United States2022
 33 inproceedingsGeneralized Sliced Probability Metrics.ICASSP 2022  2022 IEEE International Conference on Acoustics, Speech and Signal ProcessingSingapore, SingaporeIEEEMay 2022, 45134517
 34 inproceedingsThe continuousdiscrete variational Kalman filter (CDVKF).61st IEEE Conference on Decision and ControlCancun, MexicoDecember 2022
 35 inproceedingsVariational inference via Wasserstein gradient flows.NeurIPS 2022  Thirtysixth Conference on Neural Information Processing SystemsNouvelle Orléans (Louisiane), United StatesNovember 2022
 36 inproceedingsChaotic Regularization and HeavyTailed Limits for Deterministic Gradient Descent.NeurIPS 2022  Thirtysixth Conference on Neural Information Processing SystemsNew Orleans, United StatesNovember 2022
 37 inproceedingsSampling from Arbitrary Functions via PSD Models.AISTATS 2022  25th International Conference on Artificial Intelligence and StatisticsValencia (virtual), SpainMarch 2022
 38 inproceedingsAsynchronous SGD Beats Minibatch SGD Under Arbitrary Delays.NeurIPS 2022  Thirtysixth Conference on Neural Information Processing SystemsAdvances in Neural Information Processing SystemsNew Orleans, United StatesNovember 2022
 39 inproceedingsAnticorrelated Noise Injection for Improved Generalization.ICML 2022  39th International Conference on Machine LearningBaltimore, United StatesJuly 2022

40
inproceedingsGeneralization Bounds for Stochastic Gradient Descent via Localized
$$ Covers.NeurIPS 2022  Thirtysixth Conference on Neural Information Processing SystemsAdvances in Neural Processing SystemsBaltimore, United StatesSeptember 2022  41 inproceedingsNonConvex Optimization with Certificates and Fast Rates Through Kernel Sums of Squares.Conference on Learning TheoryConference on Learning Theory 2022178London, United KingdomPMLRJuly 2022, 46204642
Conferences without proceedings
 42 inproceedingsGradient Descent on Infinitely Wide Neural Networks: Global Convergence and Generalization.International Congress of MathematiciansSaintPetersbourg, Russia2022
 43 inproceedingsPROXQP: Yet another Quadratic Programming Solver for Robotics and beyond.RSS 2022  Robotics: Science and SystemsNew York, United StatesJune 2022
 44 inproceedingsLastIterate Convergence of Optimistic Gradient Method for Monotone Variational Inequalities.NeurIPS 2022  Thirtysixth Conference on Neural Information Processing SystemsAdvances in Neural Information Processing Systems 35New orleans, United StatesNovember 2022
 45 inproceedingsOn the Consistency of MaxMargin Losses.AISTATS 2022  25th International Conference on Artificial Intelligence and StatisticsValència, SpainMarch 2022
 46 inproceedingsRateDistortion Theoretic Generalization Bounds for Stochastic Learning Algorithms.COLT 2022  35th Annual Conference on Learning Theory178Proceedings of Machine Learning ResearchLondon, United KingdomJuly 2022
Doctoral dissertations and habilitation theses
 47 thesisEfficient Algorithms for Control and Reinforcement Learning.Université PSL (Paris Sciences & Lettres)October 2022
 48 thesisFrom weakly supervised learning to active labeling.Ecole Normale Supérieure (ENS)July 2022
 49 thesisModelling functions with kernels: from logistic regression to global optimisation.PSLSeptember 2022
 50 thesisOn Inductive Biases for Machine Learning in Data Constrained Settings.Ecole Normale Supérieure (ENS), Paris, FRA.January 2022
 51 thesisCryptography for PrivacyPreserving Machine Learning.PSLJune 2022
Reports & preprints
 52 miscInformation Theory with Kernel Methods.August 2022
 53 miscOn the relationship between multivariate splines and infinitelywide neural networks.February 2023
 54 miscExponential convergence of sumofsquares hierarchies for trigonometric polynomials.January 2023
 55 miscSumofSquares Relaxations for Information Theory and Variational Inference.August 2022
 56 miscA Nonasymptotic Analysis of Nonparametric TemporalDifference Learning.May 2022
 57 miscAcceleration of Gossip Algorithms through the EulerPoissonDarboux Equation.February 2022
 58 miscMeasuring dissimilarity with diffeomorphism invariance.February 2022
 59 miscDecoding speech from noninvasive brain recordings.October 2022
 60 miscFast Stochastic Composite Minimization and an Accelerated FrankWolfe Algorithm under Parallelization.September 2022
 61 misc Are Largescale Datasets Necessary for SelfSupervised Pretraining? February 2022
 62 miscPEPit: computerassisted worstcase analyses of firstorder optimization methods in Python.September 2022
 63 miscOptimal firstorder methods for convex functions with a quadratic upper bound.September 2022
 64 miscEfficient and NearOptimal Online Portfolio Selection.September 2022
 65 miscOn a VarianceReduction Correction for the TemporalDifference Learning in the Stochastic Continuous Setting.June 2022
 66 miscDifferentiable Collision Detection: a Randomized Smoothing Approach.September 2022
 67 miscA systematic approach to Lyapunov analyses of continuoustime models in convex optimization.May 2022
 68 miscHigh resolution assessment of coal mining methane emissions by satellite in Shanxi, China.November 2022
 69 miscDifferential Privacy Guarantees for Stochastic Gradient Langevin Dynamics.February 2022
 70 miscRegression as Classification: Influence of Task Formulation on Neural Network Features.November 2022
 71 miscTwo Losses Are Better Than One: Faster Optimization Using a Cheaper Proxy.February 2023