Machine learning is a recent scientific domain, positioned between applied mathematics, statistics and computer science. Its goals are the optimization, control, and modelisation of complex systems from examples. It applies to data from numerous engineering and scientific fields (e.g., vision, bioinformatics, neuroscience, audio processing, text processing, economy, finance, etc.), the ultimate goal being to derive general theories and algorithms allowing advances in each of these domains. Machine learning is characterized by the high quality and quantity of the exchanges between theory, algorithms and applications: interesting theoretical problems almost always emerge from applications, while theoretical analysis allows the understanding of why and when popular or successful algorithms do or do not work, and leads to proposing significant improvements.

Our academic positioning is exactly at the intersection between these three aspects—algorithms, theory and applications—and our main research goal is to make the link between theory and algorithms, and between algorithms and high-impact applications in various engineering and scientific fields, in particular computer vision, bioinformatics, audio processing, text processing and neuro-imaging.

Machine learning is now a vast field of research and the team focuses on the following aspects: supervised learning (kernel methods, calibration), unsupervised learning (matrix factorization, statistical tests), parsimony (structured sparsity, theory and algorithms), and optimization (convex optimization, bandit learning). These four research axes are strongly interdependent, and the interplay between them is key to successful practical applications.

This part of our research focuses on methods where, given a set of examples of input/output pairs, the goal is to predict the output for a new input, with research on kernel methods, calibration methods, and multi-task learning.

We focus here on methods where no output is given and the goal is to find structure of certain known types (e.g., discrete or low-dimensional) in the data, with a focus on matrix factorization, statistical tests, dimension reduction, and semi-supervised learning.

The concept of parsimony is central to many areas of science. In the context of statistical machine learning, this takes the form of variable or feature selection. The team focuses primarily on structured sparsity, with theoretical and algorithmic contributions.

Optimization in all its forms is central to machine learning, as many of its theoretical frameworks are based at least in part on empirical risk minimization. The team focuses primarily on convex and bandit optimization, with a particular focus on large-scale optimization.

Machine learning research can be conducted from two main perspectives: the first one, which has been dominant in the last 30 years, is to design learning algorithms and theories which are as generic as possible, the goal being to make as few assumptions as possible regarding the problems to be solved and to let data speak for themselves. This has led to many interesting methodological developments and successful applications. However, we believe that this strategy has reached its limit for many application domains, such as computer vision, bioinformatics, neuro-imaging, text and audio processing, which leads to the second perspective our team is built on: Research in machine learning theory and algorithms should be driven by interdisciplinary collaborations, so that specific prior knowledge may be properly introduced into the learning process, in particular with the following fields:

Computer vision: object recognition, object detection, image segmentation, image/video processing, computational photography. In collaboration with the Willow project-team.

Bioinformatics: cancer diagnosis, protein function prediction, virtual screening. In collaboration with Institut Curie.

Text processing: document collection modeling, language models.

Audio processing: source separation, speech/music processing.

Neuro-imaging: brain-computer interface (fMRI, EEG, MEG).

Damien Scieur, Prix de thèse PSL-ADELI

Francis Bach, Prix Jean-Jacques Moreau

Keyword: Optimization

Functional Description: A C++/Python code implementing the methods in the paper "Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization", F. Pedregosa, R. Leblond and S. Lacoste-Julien, Advances in Neural Information Processing Systems (NIPS) 2017. Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 12x on a 20-core machine.

Contact: Fabian Pedregosa

Keyword: Computer vision

Functional Description: Code for the paper Joint Discovery of Object States and Manipulation Actions, ICCV 2017: Many human activities involve object manipulations aiming to modify the object state. Examples of common state changes include full/empty bottle, open/closed door, and attached/detached car wheel. In this work, we seek to automatically discover the states of objects and the associated manipulation actions. Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions. Our model is formulated as a discriminative clustering cost with constraints. We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision. We demonstrate successful discovery of seven manipulation actions and corresponding object states on a new dataset of videos depicting real-life object manipulations. We show that our joint formulation results in an improvement of object state discovery by action recognition and vice versa.

Participants: Jean-Baptiste Alayrac, Josef Sivic, Ivan Laptev and Simon Lacoste-Julien

Contact: Jean-Baptiste Alayrac

Publication: Joint Discovery of Object States and Manipulation Actions

We describe fast gradient methods for solving the symmetric nonnegative matrix factorization problem (SymNMF). We use recent results on non-Euclidean gradient methods and show that the SymNMF problem is smooth relative to a well-chosen Bregman divergence. This approach provides a simple hyperparameter-free method which comes with theoretical convergence guarantees. We also discuss accelerated variants. Numerical experiments on clustering problems show that our algorithm scales well and reaches both state of the art convergence speed and clustering accuracy for SymNMF methods.

Due to its linear complexity, naive Bayes classification remains an attractive supervised learning method, especially in very large-scale settings. We propose a sparse version of naive Bayes, which can be used for feature selection. This leads to a combinatorial maximum-likelihood problem, for which we provide an exact solution in the case of binary data, or a bound in the multinomial case. We prove that our bound becomes tight as the marginal contribution of additional features decreases. Both binary and multinomial sparse models are solvable in time almost linear in problem size, representing a very small extra relative cost compared to the classical naive Bayes. Numerical experiments on text data show that the naive Bayes feature selection method is as statistically effective as state-of-the-art feature selection methods such as recursive feature elimination, l1-penalized logistic regression and LASSO, while being orders of magnitude faster. For a large data set, having more than with 1.6 million training points and about 12 million features, and with a non-optimized CPU implementation, our sparse naive Bayes model can be trained in less than 15 seconds.

The problem of estimating Wasserstein distances in high-dimensional spaces suffers from the curse of dimensionality: Indeed, ones needs an exponential (w.r.t. dimension) number of samples for the distance between the two samples to be comparable to that between the two measures. Therefore, regularizing the optimal transport (OT) problem is crucial when using Wasserstein distances in machine learning. One of the greatest achievements of the OT literature in recent years lies in regularity theory: one can prove under suitable hypothesis that the OT map between two measures is Lipschitz, or, equivalently when studying 2-Wasserstein distances, that the Brenier convex potential (whose gradient yields an optimal map) is a smooth function. We propose in this work to go backwards, to adopt instead regularity as a regularization tool. We propose algorithms working on discrete measures that can recover nearly optimal transport maps that have small distortion, or, equivalently, nearly optimal Brenier potential that are strongly convex and smooth. For univariate measures, we show that computing these potentials is equivalent to solving an isotonic regression problem under Lipschitz and strong monotonicity constraints. For multivariate measures the problem boils down to a non-convex QCQP problem. We show that this QCQP can be lifted a semidefinite program. Most importantly, these potentials and their gradient can be evaluated on the measures themselves, but can more generally be evaluated on any new point by solving each time a QP. Building on these two formulations we propose practical algorithms to estimate and evaluate transport maps, and illustrate their performance statistically as well as visually on a color transfer task.

Given a measurement graph

Accelerated algorithms for minimizing smooth strongly convex functions usually require knowledge of the strong convexity parameter mu. In the case of an unknown mu, current adaptive techniques are based on restart schemes. When the optimal value

Modern large-scale finite-sum optimization relies on two key aspects: distribution
and stochastic updates. For smooth and strongly convex problems, existing
decentralized algorithms are slower than modern accelerated variance-reduced
stochastic algorithms when run on a single machine, and are therefore not efficient.
Centralized algorithms are fast, but their scaling is limited by global aggregation
steps that result in communication bottlenecks. In this work, we propose an
efficient Accelerated Decentralized stochastic algorithm for Finite Sums named
ADFS, which uses local stochastic proximal updates and randomized pairwise
communications between nodes. On n machines, ADFS learns from

In a series of recent theoretical works, it was shown that strongly overparameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this “lazy training” phenomenon is not specific to overparameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that “lazy training” is behind the many successes of neural networks in difficult high dimensional tasks.

When optimizing over-parameterized models, such as deep neural networks, a large set of parameters can achieve zero training error. In such cases, the choice of the optimization algorithm and its respective hyper-parameters introduces biases that will lead to convergence to specific minimizers of the objective. Consequently, this choice can be considered as an implicit regularization for the training of over-parametrized models. In this work, we push this idea further by studying the discrete gradient dynamics of the training of a two-layer linear network with the least-squares loss. Using a time rescaling, we show that, with a vanishing initialization and a small enough step size, this dynamics sequentially learns the solutions of a reduced-rank regression with a gradually increasing rank.

We develop efficient algorithms to train

Modern machine learning focuses on highly
expressive models that are able to fit or interpolate
the data completely, resulting in zero
training loss. For such models, we show that
the stochastic gradients of common loss functions
satisfy a strong growth condition. Under
this condition, we prove that constant
step-size stochastic gradient descent (SGD)
with Nesterov acceleration matches the convergence
rate of the deterministic accelerated
method for both convex and strongly-convex
functions. We also show that this condition
implies that SGD can find a first-order stationary
point as efficiently as full gradient descent
in non-convex settings. Under interpolation,
we further show that all smooth loss
functions with a finite-sum structure satisfy a
weaker growth condition. Given this weaker
condition, we prove that SGD with a constant
step-size attains the deterministic convergence
rate in both the strongly-convex and
convex settings. Under additional assumptions,
the above results enable us to prove
an

In this project, we study large-scale convex optimization algorithms based on the Newton method applied to regularized generalized self-concordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a constant factor which scales only logarithmically with the condition number. In the parametric setting, we obtain an algorithm with the same scaling than regular first-order methods but with an improved behavior, in particular in ill-conditioned problems. Second, in the non parametric machine learning setting, we provide an explicit algorithm combining the previous scheme with Nyström projection techniques, and prove that it achieves optimal generalization bounds with a time complexity of order

We are interested in a framework of online learning with kernels for low-dimensional but large-scale and potentially adversarial datasets. We study the computational and theoretical performance of online variations of kernel Ridge regression. Despite its simplicity, the algorithm we study is the first to achieve the optimal regret for a wide range of kernels with a per-round complexity of order

In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distribution. We prove that the proposed estimator *affine-invariant* bound of the form

in high probability, where S is the unknown covariance matrix, and

We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as

Poincaré inequalities are ubiquitous in probability and analysis and have various applications in statistics (concentration of measure, rate of convergence of Markov chains). The Poincaré constant, for which the inequality is tight, is related to the typical convergence rate of diffusions to their equilibrium measure. In this paper, we show both theoretically and experimentally that, given sufficiently many samples of a measure, we can estimate its Poincaré constant. As a by-product of the estimation of the Poincaré constant, we derive an algorithm that captures a low dimensional representation of the data by finding directions which are difficult to sample. These directions are of crucial importance for sampling or in fields like molecular dynamics, where they are called reaction coordinates. Their knowledge can leverage, with a simple conditioning step, computational bottlenecks by using importance sampling techniques.

We provide a novel computer-assisted technique for systematically analyzing first-order methods for optimization. In contrast with previous works, the approach is particularly suited for handling sublinear convergence rates and stochastic oracles. The technique relies on semidefinite programming and potential functions. It allows simultaneously obtaining worst-case guarantees on the behavior of those algorithms, and assisting in choosing appropriate parameters for tuning their worst-case performances. The technique also benefits from comfortable tightness guarantees, meaning that unsatisfactory results can be improved only by changing the setting. We use the approach for analyzing deterministic and stochastic first-order methods under different assumptions on the nature of the stochastic noise. Among others, we treat unstructured noise with bounded variance, different noise models arising in over-parametrized expectation minimization problems, and randomized block-coordinate descent schemes.

We provide a lower bound showing that the O(1/k) convergence rate of the NoLips method (a.k.a. Bregman Gradient) is optimal for the class of functions satisfying the h-smoothness assumption. This assumption, also known as relative smoothness, appeared in the recent developments around the Bregman Gradient method, where acceleration remained an open issue. On the way, we show how to constructively obtain the corresponding worst-case functions by extending the computer-assisted performance estimation framework of Drori and Teboulle (Mathematical Programming, 2014) to Bregman first-order methods, and to handle the classes of differentiable and strictly convex functions.

We describe a novel constructive technique for devising efficient first-order methods for a wide range of large-scale convex minimization settings, including smooth, non-smooth, and strongly convex minimization. The technique builds upon a certain variant of the conjugate gradient method to construct a family of methods such that a) all methods in the family share the same worst-case guarantee as the base conjugate gradient method, and b) the family includes a fixed-step first-order method. We demonstrate the effectiveness of the approach by deriving optimal methods for the smooth and non-smooth cases, including new methods that forego knowledge of the problem parameters at the cost of a one-dimensional line search per iteration, and a universal method for the union of these classes that requires a three-dimensional search per iteration. In the strongly convex case, we show how numerical tools can be used to perform the construction, and show that the resulting method offers an improved worst-case bound compared to Nesterov's celebrated fast gradient method.

Microsoft Research: “Structured Large-Scale Machine Learning”. Machine learning is now ubiquitous in industry, science, engineering, and personal life. While early successes were obtained by applying off-the-shelf techniques, there are two main challenges faced by machine learning in the “big data” era: structure and scale. The project proposes to explore three axes, from theoretical, algorithmic and practical perspectives: (1) large-scale convex optimization, (2) large-scale combinatorial optimization and (3) sequential decision making for structured data. The project involves two Inria sites (Paris and Grenoble) and four MSR sites (Cambridge, New England, Redmond, New York). Project website: http://www.msr-inria.fr/projects/structured-large-scale-machine-learning/.

Alexandre d’Aspremont, Francis Bach, Martin Jaggi (EPFL): Google Focused award.

Francis Bach: Gift from Facebook AI Research.

Alexandre d’Aspremont: fondation AXA, "Mécénat scientifique", optimisation & machine learning.

Alexandre d'Aspremont: IRIS, PSL “Science des données, données de la science”.

ERC Sequoia Title: Robust algorithms for learning from modern data

Programm: H2020

Type: ERC

Duration: 2017-2022

Coordinator: Inria

Inria contact: Francis Bach

Abstract: Machine learning is needed and used everywhere, from science to industry, with a growing impact on many disciplines. While first successes were due at least in part to simple supervised learning algorithms used primarily as black boxes on medium-scale problems, modern data pose new challenges. Scalability is an important issue of course: with large amounts of data, many current problems far exceed the capabilities of existing algorithms despite sophisticated computing architectures. But beyond this, the core classical model of supervised machine learning, with the usual assumptions of independent and identically distributed data, or well-defined features, outputs and loss functions, has reached its theoretical and practical limits. Given this new setting, existing optimization-based algorithms are not adapted. The main objective of this project is to push the frontiers of supervised machine learning, in terms of (a) scalability to data with massive numbers of observations, features, and tasks, (b) adaptability to modern computing environments, in particular for parallel and distributed processing, (c) provable adaptivity and robustness to problem and hardware specifications, and (d) robustness to non-convexities inherent in machine learning problems. To achieve the expected breakthroughs, we will design a novel generation of learning algorithms amenable to a tight convergence analysis with realistic assumptions and efficient implementations. They will help transition machine learning algorithms towards the same wide- spread robust use as numerical linear algebra libraries. Outcomes of the research described in this proposal will include algorithms that come with strong convergence guarantees and are well-tested on real-life benchmarks coming from computer vision, bioinformatics, audio processing and natural language processing. For both distributed and non-distributed settings, we will release open-source software, adapted to widely available computing platforms.

Sebastian Pokutta from TU & Zuse Institute, Berlin, December 2019.

Critobal Guzman from Universidad Católica de Chile, July 2019.

Quentin Berthet from University of Cambridge, from Feb 2019 until Apr 2019.

Eduard Gorbunov from Moscow Institute of Physics and Technology, Oct 2019.

Song Mei, from Stanford University, from Sep 2019 until Oct 2019.

Anant Raj, from M.P.I. Tubingen, from Oct 2019.

Aadirupa Saha, from Indian Institute of Technology, Bangalore, from Nov 2019

Adrien Taylor, ICCOPT session organizer: *Performance Estimation of First-Order Methods* (with F. Glineur), *Splitting Methods and Applications (Part I)* (with P. Giselsson and E. Ryu), *Splitting Methods and Applications (Part III)* (with P. Giselsson and E. Ryu).

Alexandre d'Aspremont, co-organizer, Les Houches Workshop on Optimization and Machine Learning, March 2019.

Senior Area Chair, NeurIPS conference 2019 (Francis Bach).

Most of the team referees for the major machine learning conferences such as NIPS, AISTATS, ICML.

Alexandre d’Aspremont: SIAM Journal on Optimization, Associate Editor

Alexandre d’Aspremont: SIAM Journal on the Mathematics of Data Science, Associate Editor

Alexandre d’Aspremont: Mathematical Programming B, Associate Editor

Alexandre d’Aspremont: Mathematics of Operations Research, Associate Editor

Francis Bach: Journal of Machine Learning Research, co-editor-in-chief

Francis Bach: Information and Inference, Associate Editor

Francis Bach: Electronic Journal of Statistics, Associate Editor.

Francis Bach: Mathematical Programming, Associate Editor

Francis Bach: Foundations of Computational Mathematics, Associate Editor

Adrien Taylor, reviewer for Automatica.

Adrien Taylor, reviewer for SIAM Journal on Numerical Analysis (SINUM).

Adrien Taylor, reviewer for SIAM Journal on Optimization (SIOPT).

Adrien Taylor, reviewer for Mathematical Programming (MPA).

Adrien Taylor, reviewer for International Conference on Machine Learning 2019 (ICML19).

Alexandre d'Aspremont, ICCOPT 2019, Berlin, August 2019.

Alexandre d'Aspremont, CIMI Workshop, Toulouse, October 2019.

Alexandre d'Aspremont, France-Germany-Swiss Optimization conference, Nice, September 2019.

Alexandre d'Aspremont, BIRS Workshop, Oaxaca, Oct. 2019.

Alexandre d'Aspremont, SAMPTA, Bordeaux, July 2019.

Francis Bach,Optimization workshop, Les Houches, March 2019.

Francis Bach, Oberwolfach, Germany, May 2019.

Francis Bach, AI Global summit, Geneva, Switzerland, May 2019.

Francis Bach, ETH Data science seminar, Zurich, Switzerland, June 2019.

Francis Bach, ETH Imaging workshop, Zurich, Switzerland, June 2019.

Francis BACH, ICIAM invited session, Valencia, Spain, July 2019.

Francis Bach, Workshop on covariance operators, Germany, Berlin, September 2019.

Francis Bach, GAMM Workshop COMinDS2019, Berlin, Germany, October 2019.

Francis Bach, DIMS workshop, Leipzig, Germany, November 2019.

Francis Bach, Conference on Decision and Control, Nice, December 2019.

Adrien Taylor, SPOT optimization seminar, Toulouse, February 2019.

Adrien Taylor, Optimization workshop, Les Houches, March 2019.

Adrien Taylor, CWI Network & Optimization Seminar, Amsterdam, April 2019.

Adrien Taylor, Summer school on Optimization (MIPT & HSE), Moscow, June 2019.

Adrien Taylor, ICCOPT 2019, Berlin, August 2019.

Alessandro Rudi, *Recent developments on kernel methods*, RKM 2019 (Sept. 2019, London, UK)

Alessandro Rudi, *Data, Learning and Inference meeting*, DALI 2019 (Sept. 2019, San Sebastian, Spain)

Alessandro Rudi,*32nd European Meeting of Statisticians*, EMS 2019 (July 2019 , Palermo, Italy)

Alessandro Rudi,*Applied Inverse Problems Conference*, AIP 2019 (July 2019, Grenoble, France)

Alessandro Rudi, *Imaging and Machine Learning Conference* (April 2019, Paris, France)

Alessandro Rudi, *Seminar on the Mathematics of Imaging* (March 2019, Paris, France)

Alexandre d'Aspremont. Conseil scientifique, Vivienne Investissement.

Alexandre d'Aspremont. Reponsable scientifique, IRIS PSL, "Sciences des données, données de la science".

Francis Bach, Deputy Scientific Delegate for Inria Paris research center, member of the Inria Evaluation Committee.

Master: Alexandre d'Aspremont, Optimisation Combinatoire et Convexe, avec Zhentao Li, (2015-Present) cours magistraux 30h, Master M1, ENS Paris.

Master: Alexandre d'Aspremont, Optimisation convexe: modélisation, algorithmes et applications cours magistraux 21h (2011-Present), Master M2 MVA, ENS PS.

Summer school: Francis Bach, optimization for machine learning, 12 heures, AIMS Master, Kigali, Rwanda

Summer school: Francis Bach, optimization for machine learning, 4 heures, D3S summer school, Ecole Polytechnique.

Master : Francis Bach, Optimisation et apprentissage statistique, 20h, Master M2 (Mathématiques de l'aléatoire), Université Paris-Sud, France.

Master: Pierre Gaillard, Alessandro Rudi, Introduction to Machine Learning, 52h, L3, ENS, Paris.

PhD in progress : Thomas Kerdreux, New Complexity Bounds for Frank Wolfe, 2017, Alexandre d'Aspremont

PhD in progress : Radu - Dragomir Alexandru, Bregman Gradient Methods, 2018, Alexandre d'Aspremont

PhD in progress : Mathieu Barré, Accelerated Polyak Methods, 2018, Alexandre d'Aspremont

PhD in progress : Grégoire Mialon, Sample Selection Methods, 2018, Alexandre d'Aspremont

PhD in progress : Raphaël Berthier, started September 2017, supervised by Francis Bach and Pierre Gaillard.

PhD in progress: Loucas Pillaud-Vivien, supervised by Francis Bach and Alessandro Rudi.

PhD in progress: Alexandre Défossez, supervised by Francis Bach and Léon Bottou (Facebook AI Research).

PhD in progress: Alex Nowak-Vila, supervised by Francis Bach and Alessandro Rudi.

PhD in progress: Ulysse Marteau Ferey, supervised by Francis Bach and Alessandro Rudi.

PhD in progress: Vivien Cabannes, supervised by Francis Bach and Alessandro Rudi.

PhD in progress: Eloise Berthier, supervised by Francis Bach.

PhD in progress: Theo Ryffel, supervised by Francis Bach and David Pointcheval.

PhD in progress: Margaux Brégère, supervised by Pierre Gaillard and Gilles Stoltz (Université Paris-Sud).

PhD in progress: Rémi Jezequel, supervised by Pierre Gaillard and Alessandro Rudi.

PhD defended: Dmitry Babichev, co-advised by Francis Bach and Anatoly Judistky, defended February 22 2019

PhD defended: Tatiana Shpakova, advised by Francis Bach, defended February 21 2019

HdR Pierre Weiss, IMT Toulouse, September 2019 (Alexandre d'Aspremont).

HDR Rémi Flamary, Université de Nice, November 2019 (Francis Bach).

Participation to "Fête de la Science" (with the Apprenti Illustrateur).