Structured Learning from Videos and Language

SIERRA Statistical Machine Learning and Parsimony

Optimization, machine learning and statistical methods

Applied Mathematics, Computation and Simulation

http://www.di.ens.fr/sierra/ Département d'Informatique de l'Ecole Normale Supérieure CNRS Ecole normale supérieure de Paris Creation of the Team: 2011 January 01, updated into Project-Team: 2012 January 01 Project-Team A1.2.8. - Network security A3.4. - Machine learning and statistics A5.4. - Computer vision A6.2. - Scientific computing, Numerical Analysis & Optimization A7.1. - Algorithms A8.2. - Optimization A9.2. - Machine learning B9.5.6. - Data science Francis Bach Chercheur

Paris

Team leader, Inria, Senior Researcher oui Alexandre d'Aspremont Chercheur

Paris

CNRS, Senior Researcher oui Pierre Gaillard Chercheur

Paris

Inria, Researcher Remi Leblond PhD

Paris

Inria, Researcher, until Aug 2018 Alessandro Rudi Chercheur

Paris

Inria, Starting Research Position Lenaic Chizat PostDoc

Paris

Inria, until Nov 2018 Pierre Yves Massé PostDoc

Paris

Université Technique de Prague, from Apr 2018 Dmitrii Ostrovskii PostDoc

Paris

Inria, from Feb 2018 Adrien Taylor PostDoc

Paris

Inria Dmitry Babichev PhD

Paris

Inria Mathieu Barré PhD

Paris

Ecole Normale Supérieure Paris, from Sep 2018 Raphaël Berthier PhD

Paris

Inria, from Oct 2018 Anaël Bonneton PhD

Paris

Ecole Normale Supérieure Paris Margaux Brégère PhD

Paris

EDF Alexandre Défossez PhD

Paris

Facebook Radu Alexandru Dragomir PhD

Paris

Ecole polytechnique, from Sep 2018 Thomas Kerdreux PhD

Paris

Ecole polytechnique Gregoire Mialon PhD

Paris

Inria, from Oct 2018 Loucas Pillaud Vivien PhD

Paris

Ministère de l'Ecologie, de l'Energie, du Développement durable et de la Mer Antoine Recanati PhD

Paris

CNRS, until Sep 2018 Damien Scieur PhD

Paris

Inria, until Aug 2018 Tatiana Shpakova PhD

Paris

Inria Loïc Estève Technique

Paris

Inria, from Apr 2018 Hadrien Hendrikx Technique

Paris

Inria, from Apr 2018 until Sep 2018 Mathieu Barre Stagiaire

Paris

Inria, from Apr 2018 until Sep 2018 Raphaël Berthier Stagiaire

Paris

Ecole Normale Supérieure Paris, until Sep 2018 Vivien Cabannes Stagiaire

Paris

Univ Vincennes-Saint Denis, from Sep 2018 Florentin Guth Stagiaire

Paris

Ecole Normale Supérieure Paris, from Feb 2018 until Mar 2018 Remi Jézequel Stagiaire

Paris

Ecole Normale Supérieure Paris, from Oct 2018 Ulysse Marteau Ferey Stagiaire

Paris

Ecole Normale Supérieure Paris, from Apr 2018 Helene Bessin Rousseau Assistant

Paris

Inria, from Mar 2018 Sabrine Boumizy Assistant

Paris

Inria, until Feb 2018 Sandrine Verges Assistant

Paris

Inria, until Jan 2018 Vijaya Bollapragada Visiteur

Paris

Northwestern University, from Apr 2018 until Jul 2018 Aaron Defazio Visiteur

Paris

Facebook Research, until Feb 2018 Gauthier Gidel Visiteur

Paris

University of Montreal, Jan 2018 Achintya Kundu Visiteur

Paris

Ecole d'ingénieurs, from Jun 2018 until Aug 2018 Gregoire Mialon Visiteur

Paris

Inria, Sep 2018 Sharan Vaswani Visiteur

Paris

University of British Columbia, from Apr 2018 until Jul 2018 Simon Lacoste-Julien Visiteur

Paris

University of Montreal, Aug 2018 Alex Nowak Vila PhD

Paris

Inria, from Oct 2018 Alex Nowak Vila Stagiaire

Paris

Inria, from Apr 2018 until Sep 2018 Overall Objectives Statement

Machine learning is a recent scientific domain, positioned between applied mathematics, statistics and computer science. Its goals are the optimization, control, and modelisation of complex systems from examples. It applies to data from numerous engineering and scientific fields (e.g., vision, bioinformatics, neuroscience, audio processing, text processing, economy, finance, etc.), the ultimate goal being to derive general theories and algorithms allowing advances in each of these domains. Machine learning is characterized by the high quality and quantity of the exchanges between theory, algorithms and applications: interesting theoretical problems almost always emerge from applications, while theoretical analysis allows the understanding of why and when popular or successful algorithms do or do not work, and leads to proposing significant improvements.

Our academic positioning is exactly at the intersection between these three aspects—algorithms, theory and applications—and our main research goal is to make the link between theory and algorithms, and between algorithms and high-impact applications in various engineering and scientific fields, in particular computer vision, bioinformatics, audio processing, text processing and neuro-imaging.

Machine learning is now a vast field of research and the team focuses on the following aspects: supervised learning (kernel methods, calibration), unsupervised learning (matrix factorization, statistical tests), parsimony (structured sparsity, theory and algorithms), and optimization (convex optimization, bandit learning). These four research axes are strongly interdependent, and the interplay between them is key to successful practical applications.

Research Program Supervised Learning

This part of our research focuses on methods where, given a set of examples of input/output pairs, the goal is to predict the output for a new input, with research on kernel methods, calibration methods, and multi-task learning.

Unsupervised Learning

We focus here on methods where no output is given and the goal is to find structure of certain known types (e.g., discrete or low-dimensional) in the data, with a focus on matrix factorization, statistical tests, dimension reduction, and semi-supervised learning.

Parsimony

The concept of parsimony is central to many areas of science. In the context of statistical machine learning, this takes the form of variable or feature selection. The team focuses primarily on structured sparsity, with theoretical and algorithmic contributions.

Optimization

Optimization in all its forms is central to machine learning, as many of its theoretical frameworks are based at least in part on empirical risk minimization. The team focuses primarily on convex and bandit optimization, with a particular focus on large-scale optimization.

Application Domains Applications for Machine Learning

Machine learning research can be conducted from two main perspectives: the first one, which has been dominant in the last 30 years, is to design learning algorithms and theories which are as generic as possible, the goal being to make as few assumptions as possible regarding the problems to be solved and to let data speak for themselves. This has led to many interesting methodological developments and successful applications. However, we believe that this strategy has reached its limit for many application domains, such as computer vision, bioinformatics, neuro-imaging, text and audio processing, which leads to the second perspective our team is built on: Research in machine learning theory and algorithms should be driven by interdisciplinary collaborations, so that specific prior knowledge may be properly introduced into the learning process, in particular with the following fields:

Computer vision: object recognition, object detection, image segmentation, image/video processing, computational photography. In collaboration with the Willow project-team.

Bioinformatics: cancer diagnosis, protein function prediction, virtual screening. In collaboration with Institut Curie.

Text processing: document collection modeling, language models.

Audio processing: source separation, speech/music processing.

Neuro-imaging: brain-computer interface (fMRI, EEG, MEG).

Highlights of the Year Highlights of the Year Awards

Francis Bach, Lagrange Prize in Continuous Optimization, Society for Industrial and Applied Mathematics 2018

Francis Bach, Best Paper Award, NeurIPS 2018.

Francis Bach included in the report Highly cited researchers, year 2018, Clarivate Analytics, 2018

Nicolas Flammarion, PhD thesis award in the Programme Gaspard Monge, Fondation Mathématique Jacques Hadamard, 2018.

Adrien Taylor, Tucker Prize (finalist) 2018 (dissertation prize by the Math- ematical Optimization Society for 2015-2017).

Adrien Taylor, IBM/FNRS innovation award 2018 (dissertation prize for original contributions to informatics).

Adrien Taylor, Icteam thesis award 2018 (dissertation award by the icteam institute of UCLouvain, Belgium).

Adrien Taylor, Best paper award 2018 from the journal Optimization Letters for the paper On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions, Etienne De Klerk, François Glineur, Adrien Taylor. journal=.

New Software and Platforms ProxASAGA

Keyword: Optimization

Functional Description: A C++/Python code implementing the methods in the paper "Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization", F. Pedregosa, R. Leblond and S. Lacoste-Julien, Advances in Neural Information Processing Systems (NIPS) 2017. Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 12x on a 20-core machine.

Contact: Fabian Pedregosa

URL: https://github.com/fabianp/ProxASAGA

object-states-action

Keyword: Computer vision

Functional Description: Code for the paper Joint Discovery of Object States and Manipulation Actions, ICCV 2017: Many human activities involve object manipulations aiming to modify the object state. Examples of common state changes include full/empty bottle, open/closed door, and attached/detached car wheel. In this work, we seek to automatically discover the states of objects and the associated manipulation actions. Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions. Our model is formulated as a discriminative clustering cost with constraints. We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision. We demonstrate successful discovery of seven manipulation actions and corresponding object states on a new dataset of videos depicting real-life object manipulations. We show that our joint formulation results in an improvement of object state discovery by action recognition and vice versa.

Participants: Jean-Baptiste Alayrac, Josef Sivic, Ivan Laptev and Simon Lacoste-Julien

Contact: Jean-Baptiste Alayrac

Publication: Joint Discovery of Object States and Manipulation Actions

URL: https://github.com/jalayrac/object-states-action

New Results On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, in we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.

Sharp Analysis of Learning with Discrete Losses

In , we study a least-squares framework to systematically design learning algorithms for discrete losses, with quantitative characterizations in terms of statistical and computational complexity. In particular we improve existing results by providing explicit dependence on the number of labels for a wide class of losses and faster learning rates in conditions of low-noise. Theoretical results are complemented with experiments on real datasets, showing the effectiveness of the proposed general approach.

Gossip of Statistical Observations using Orthogonal Polynomials

Consider a network of agents connected by communication links, where each agent holds a real value. The gossip problem consists in estimating the average of the values diffused in the network in a distributed manner. Current techniques for gossiping are designed to deal with worst-case scenarios, which is irrelevant in applications to distributed statistical learning and denoising in sensor networks. In , we design second-order gossip methods tailor-made for the case where the real values are i.i.d. samples from the same distribution. In some regular network structures, we are able to prove optimality of our methods, and simulations suggest that they are efficient in a wide range of random networks. Our approach of gossip stems from a new acceleration framework using the family of orthogonal polynomials with respect to the spectral measure of the network graph.

Marginal Weighted Maximum Log-likelihood for Efficient Learning of Perturb-and-Map models

In , We consider the structured-output prediction problem through probabilistic approaches and generalize the “perturb-and-MAP” framework to more challenging weighted Hamming losses, which are crucial in applications. While in principle our approach is a straightforward marginalization, it requires solving many related MAP inference problems. We show that for log-supermodular pairwise models these operations can be performed efficiently using the machinery of dynamic graph cuts. We also propose to use double stochastic gradient descent, both on the data and on the perturbations, for efficient learning. Our framework can naturally take weak supervision (e.g., partial labels) into account. We conduct a set of experiments on medium-scale character recognition and image segmentation, showing the benefits of our algorithms.

Slice inverse regression with score functions

In , we consider non-linear regression problems where we assume that the response depends non-linearly on a linear projection of the covariates. We propose score function extensions to sliced inverse regression problems, both for the first- order and second-order score functions. We show that they provably improve estimation in the population case over the non-sliced versions and we study finite sample estimators and their consistency given the exact score functions. We also propose to learn the score function as well, in two steps, i.e., first learning the score function and then learning the effective dimension reduction space, or directly, by solving a convex optimization problem regularized by the nuclear norm. We illustrate our results on a series of experiments.

Constant Step Size Stochastic Gradient Descent for Probabilistic Modeling

Stochastic gradient methods enable learning probabilistic models from large amounts of data. While large step-sizes (learning rates) have shown to be best for least-squares (e.g., Gaussian noise) once combined with parameter averaging, these are not leading to convergent algorithms in general. In this paper, we consider generalized linear models, that is, conditional models based on exponential families. In , we propose averaging moment parameters instead of natural parameters for constant-step-size stochastic gradient descent. For finite-dimensional models, we show that this can sometimes (and surprisingly) lead to better predictions than the best linear model. For infinite-dimensional models, we show that it always converges to optimal predictions, while averaging natural parameters never does. We illustrate our findings with simulations on synthetic data and classical benchmarks with many observations.

Nonlinear Acceleration of Momentum and Primal-Dual Algorithms

In , We describe a convergence acceleration scheme for multistep optimization algorithms. The extrapolated solution is written as a nonlinear average of the iterates produced by the original optimization algorithm. Our scheme does not need the underlying fixed-point operator to be symmetric, hence handles e.g. algorithms with momentum terms such as Nesterov's accelerated method, or primal-dual methods. The weights are computed via a simple linear system and we analyze performance in both online and offline modes. We use Crouzeix's conjecture to show that acceleration performance is controlled by the solution of a Chebyshev problem on the numerical range of a non-symmetric operator modelling the behavior of iterates near the optimum. Numerical experiments are detailed on image processing problems, logistic regression and neural network training for CIFAR10 and ImageNet.

Nonlinear Acceleration of Deep Neural Networks

Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods, with marginal computational overhead. It aims to improve convergence using only the iterates of simple iterative algorithms. However, so far its application to optimization was theoretically limited to gradient descent and other single-step algorithms. Here, we adapt RNA to a much broader setting including stochastic gradient with momentum and Nesterov's fast gradient. In , we use it to train deep neural networks, and empirically observe that extrapolated networks are more accurate, especially in the early iterations. A straightforward application of our algorithm when training ResNet-152 on ImageNet produces a top-1 test error of 20.88, improving by 0.8 the reference classification pipeline. Furthermore, the code runs offline in this case, so it never negatively affects performance.

Nonlinear Acceleration of CNNs

The Regularized Nonlinear Acceleration (RNA) algorithm is an acceleration method capable of improving the rate of convergence of many optimization schemes such as gradient descend, SAGA or SVRG. Until now, its analysis is limited to convex problems, but empirical observations shows that RNA may be extended to wider settings. In , we investigate further the benefits of RNA when applied to neural networks, in particular for the task of image recognition on CIFAR10 and ImageNet. With very few modifications of exiting frameworks, RNA improves slightly the optimization process of CNNs, after training.

Robust Seriation and Applications To Cancer Genomics

The seriation problem seeks to reorder a set of elements given pairwise similarity information, so that elements with higher similarity are closer in the resulting sequence. When a global ordering consistent with the similarity information exists, an exact spectral solution recovers it in the noiseless case and seriation is equivalent to the combinatorial 2-SUM problem over permutations, for which several relaxations have been derived. However, in applications such as DNA assembly, similarity values are often heavily corrupted, and the solution of 2-SUM may no longer yield an approximate serial structure on the elements. In , we introduce the robust seriation problem and show that it is equivalent to a modified 2-SUM problem for a class of similarity matrices modeling those observed in DNA assembly. We explore several relaxations of this modified 2-SUM problem and compare them empirically on both synthetic matrices and real DNA data. We then introduce the problem of seriation with duplications, which is a generalization of Seriation motivated by applications to cancer genome reconstruction. We propose an algorithm involving robust seriation to solve it, and present preliminary results on synthetic data sets.

Reconstructing Latent Orderings by Spectral Clustering

Spectral clustering uses a graph Laplacian spectral embedding to enhance the cluster structure of some data sets. When the embedding is one dimensional, it can be used to sort the items (spectral ordering). A number of empirical results also suggests that a multidimensional Laplacian embedding enhances the latent ordering of the data, if any. This also extends to circular orderings, a case where unidimensional embeddings fail. In , we tackle the task of retrieving linear and circular orderings in a unifying framework, and show how a latent ordering on the data translates into a filamentary structure on the Laplacian embedding. We propose a method to recover it, illustrated with numerical experiments on synthetic data and real DNA sequencing data.

Lyapunov Functions for First-Order Methods: Tight Automated Convergence Guarantees

In , we present a novel way of generating Lyapunov functions for proving linear convergence rates of first-order optimization methods. Our approach provably obtains the fastest linear convergence rate that can be verified by a quadratic Lyapunov function (with given states), and only relies on solving a small-sized semidefinite program. Our approach combines the advantages of performance estimation problems and integral quadratic constraints, and relies on convex interpolation.

Efficient First-order Methods for Convex Minimization: a Constructive Approach

In , we describe a novel constructive technique for devising efficient first-order methods for a wide range of large-scale convex minimization settings, including smooth, non-smooth, and strongly convex minimization. The design technique takes a method performing a series of subspace-searches and constructs a family of methods that share the same worst-case guarantees as the original method, and includes a fixed-step first-order method. We show that this technique yields optimal methods in the smooth and non-smooth cases and derive new methods for these cases, including methods that forego knowledge of the problem parameters, at the cost of a one-dimensional line search per iteration. In the strongly convex case, we show how numerical tools can be used to perform the construction, and show that resulting method offers an improved convergence rate compared to Nesterov's celebrated fast gradient method.

Operator Splitting Performance Estimation: Tight contraction factors and optimal parameter selection

In , we propose a methodology for studying the performance of common splitting methods through semidefinite programming. We prove tightness of the methodology and demonstrate its value by presenting two applications of it. First, we use the methodology as a tool for computer-assisted proofs to prove tight analytical contraction factors for Douglas–Rachford splitting that are likely too complicated for a human to find bare-handed. Second, we use the methodology as an algorithmic tool to computationally select the optimal splitting method parameters by solving a series of semidefinite programs.

Finite-sample Analysis of M-estimators using Self-concordance

In , we demonstrate how self-concordance of the loss allows to obtain asymptotically optimal rates for $M$ -estimators in finite-sample regimes. We consider two classes of losses: (i) self-concordant losses, i.e., whose third derivative is uniformly bounded with the $3 / 2$ power of the second; (ii) pseudo self-concordant losses, for which the power is removed. These classes contain some losses arising in generalized linear models, including the logistic loss; in addition, the second class includes some common pseudo-Huber losses. Our results consist in establishing the critical sample size sufficient to reach the asymptotically optimal excess risk in both cases. Denoting $d$ the parameter dimension, and $d_{e}$ the effective dimension taking into account possible model misspecification, we find the critical sample size to be $O (d_{e} \cdot d)$ for the first class of losses, and $O (ρ \cdot d_{e} \cdot d)$ for the second class, where $ρ$ is the problem-dependent parameter that characterizes the risk curvature at the best predictor $θ_{*}$ . In contrast to the existing results, we only impose local assumptions on the data distribution, assuming that the calibrated design, i.e., the design scaled with the square root of the second derivative of the loss, is subgaussian at the best predictor. Moreover, we obtain the improved bounds on the critical sample size, scaling near-linearly in $max (d_{e}, d)$ , under the extra assumption that the calibrated design is subgaussian in the Dikin ellipsoid of $θ_{*}$ . Motivated by these findings, we construct canonically self-concordant analogues of the Huber and logistic losses with improved statistical properties. Finally, we extend some of the above results to $ℓ_{1}$ -penalized $M$ -estimators in high-dimensional setups.

Uniform regret bounds over

R^{d}

for the sequential linear regression problem with the square loss

In we consider the setting of online linear regression for arbitrary deterministic sequences, with the square loss. We are interested in obtaining regret bounds that hold uniformly over all vectors $R^{d}$ . When the feature sequence is known at the beginning of the game, they provided closed-form regret bounds of $2 d B^{2} ln T + O (1)$ , where $T$ is the number of rounds and $B$ is a bound on the observations. Instead, we derive bounds with an optimal constant of 1 in front of the $d B^{2} ln T$ term. In the case of sequentially revealed features, we also derive an asymptotic regret bound of $d B^{2} ln T$ for any individual sequence of features and bounded observations. All our algorithms are variants of the online nonlinear ridge regression forecaster, either with a data-dependent regularization or with almost no regularization.

Efficient online algorithms for fast-rate regret bounds under sparsity.

In we consider the problem of online convex optimization in two different settings: arbitrary and i.i.d. sequence of convex loss functions. In both settings, we provide efficient algorithms whose cumulative excess risks are controlled with fast-rate sparse bounds. First, the excess risks bounds depend on the sparsity of the objective rather than on the dimension of the parameters space. Second, their rates are faster than the slow-rate $1 / \sqrt{T}$

Exponential convergence of testing error for stochastic gradient methods

In , we consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing loss (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing error (classification error) converges exponentially fast if low-noise conditions are assumed.

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

In , we consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.

Central Limit Theorem for stationary Fleming–Viot particle systems in finite spaces

In , we consider the Fleming–Viot particle system associated with a continuous-time Markov chain in a finite space. Assuming irreducibility, it is known that the particle system possesses a unique stationary distribution, under which its empirical measure converges to the quasistationary distribution of the Markov chain. We complement this Law of Large Numbers with a Central Limit Theorem. Our proof essentially relies on elementary computations on the infinitesimal generator of the Fleming–Viot particle system, and involves the so-called $π$ -return process in the expression of the asymptotic variance. Our work can be seen as an infinite-time version, in the setting of finite space Markov chains, of results by Del Moral and Miclo [ESAIM: Probab. Statist., 2003] and Cérou, Delyon, Guyader and Rousset [arXiv:1611.00515, arXiv:1709.06771].

SeaRNN: Improved RNN training through Global-Local Losses

In , we propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the “learning to search” (L2S) approach to structured prediction. RNNs have been widely successful in structured prediction applications such as machine translation or parsing, and are commonly trained using maximum likelihood estimation (MLE). Unfortunately, this training loss is not always an appropriate surrogate for the test error: by only maximizing the ground truth probability, it fails to exploit the wealth of information offered by structured losses. Further, it introduces discrepancies between training and predicting (such as exposure bias) that may hurt test performance. Instead, SEARNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error. We first demonstrate improved performance over MLE on two different tasks: OCR and spelling correction. Then, we propose a subsampling strategy to enable SEARNN to scale to large vocabulary sizes. This allows us to validate the benefits of our approach on a machine translation task.

Improved asynchronous parallel optimization analysis for stochastic incremental methods

As datasets continue to increase in size and multi-core computer architectures are developed, asynchronous parallel optimization algorithms become more and more essential to the field of Machine Learning. Unfortunately, conducting the theoretical analysis of asynchronous methods is difficult, notably due to the introduction of delay and inconsistency in inherently sequential algorithms. Handling these issues often requires resorting to simplifying but unrealistic assumptions. Through a novel perspective, in we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced "perturbed iterate" framework that resolves it. We demonstrate the usefulness of our new framework by analyzing three distinct asynchronous parallel incremental optimization algorithms: Hogwild (asynchronous SGD), KROMAGNON (asynchronous SVRG) and ASAGA, a novel asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. We are able to both remove problematic assumptions and obtain better theoretical results. Notably, we prove that ASAGA and KROMAGNON can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedups as well as the hardware overhead. Finally, we investigate the overlap constant, an ill-understood but central quantity for the theoretical analysis of asynchronous parallel algorithms. We find that it encompasses much more complexity than suggested in previous work, and often is order-of-magnitude bigger than traditionally thought.

Asynchronous optimisation for Machine Learning

The impressive breakthroughs of the last two decades in the field of machine learning can be in large part attributed to the explosion of computing power and available data. These two limiting factors have been replaced by a new bottleneck: algorithms. The focus of this thesis is thus on introducing novel methods that can take advantage of high data quantity and computing power. We present two independent contributions.

First, we develop and analyze novel fast optimization algorithms which take advantage of the advances in parallel computing architecture and can handle vast amounts of data. We introduce a new framework of analysis for asynchronous parallel incremental algorithms, which enable correct and simple proofs. We then demonstrate its usefulness by performing the convergence analysis for several methods, including two novel algorithms.

Asaga is a sparse asynchronous parallel variant of the variance-reduced algorithm Saga which enjoys fast linear convergence rates on smooth and strongly convex objectives. We prove that it can be linearly faster than its sequential counterpart, even without sparsity assump- tions.

ProxAsaga is an extension of Asaga to the more general setting where the regularizer can be non-smooth. We prove that it can also achieve a linear speedup. We provide extensive experiments comparing our new algorithms to the current state-of-art.

Second, we introduce new methods for complex struc- tured prediction tasks. We focus on recurrent neural net- works (RNNs), whose traditional training algorithm for RNNs – based on maximum likelihood estimation (MLE) – suffers from several issues. The associated surrogate training loss notably ignores the information contained in structured losses and introduces discrepancies between train and test times that may hurt performance.

To alleviate these problems, we propose SeaRNN, a novel training algorithm for RNNs inspired by the “learning to search” approach to structured prediction.SeaRNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error than the MLE objective.

We demonstrate improved performance over MLE on three challenging tasks, and provide several subsampling strategies to enable SeaRNN to scale to large-scale tasks, such as machine translation. Finally, after contrasting the behavior of SeaRNN models to MLE models, we conduct an in-depth comparison of our new approach to the related work.

M^{*}

-Regularized Dictionary Learning

In , we derive a performance measure for dictionaries in compressed sensing, based on the $M^{*}$ of the corresponding norm. We use this measure to regularize dictionary learning algorithms and study the performance of our methods on both compression and inpainting experiments.

Optimal Algorithms for Non-Smooth Distributed Optimization in Networks

In , we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O (1 / \sqrt{t})$ , the structure of the communication network only impacts a second-order term in $O (1 / t)$ , where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1 / 4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.

Relating Leverage Scores and Density using Regularized Christoffel Functions

Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce in the regularized Christoffel function associated to a positive definite kernel. This uncovers a variational formulation for leverage scores for kernel methods and allows to elucidate their relationships with the chosen kernel as well as population density. Our main result quantitatively describes a decreasing relation between leverage score and population density for a broad class of kernels on Euclidean spaces. Numerical simulations support our findings.

Averaging Stochastic Gradient Descent on Riemannian Manifolds

In we consider the minimization of a function defined on a Riemannian manifold $M$ accessible only through unbiased estimates of its gradients. We develop a geometric framework to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on $M$ to an averaged iterate sequence with a robust and fast $O (1 / n)$ convergence rate. We then present an application of our framework to geodesically-strongly-convex (and possibly Euclidean non-convex) problems. Finally, we demonstrate how these ideas apply to the case of streaming $k$ -PCA, where we show how to accelerate the slow rate of the randomized power method (without requiring knowledge of the eigengap) into a robust algorithm achieving the optimal rate of convergence.

Localized Structured Prediction

Key to structured prediction is exploiting the problem structure to simplify the learning process. A major challenge arises when data exhibit a local structure (e.g., are made by "parts") that can be leveraged to better approximate the relation between (parts of) the input and (parts of) the output. Recent literature on signal processing, and in particular computer vision, has shown that capturing these aspects is indeed essential to achieve state-of-the-art performance. While such algorithms are typically derived on a case-by-case basis, in we propose the first theoretical framework to deal with part-based data from a general perspective. We derive a novel approach to deal with these problems and study its generalization properties within the setting of statistical learning theory. Our analysis is novel in that it explicitly quantifies the benefits of leveraging the part-based structure of the problem with respect to the learning rates of the proposed estimator.

Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces

In , we study regression problems over a separable Hilbert space with the square loss, covering non-parametric regression over a reproducing kernel Hilbert space. We investigate a class of spectral-regularized algorithms, including ridge regression, principal component analysis, and gradient methods. We prove optimal, high-probability convergence results in terms of variants of norms for the studied algorithms, considering a capacity assumption on the hypothesis space and a general source condition on the target function. Consequently, we obtain almost sure convergence results with optimal rates. Our results improve and generalize previous results, filling a theoretical gap for the non-attainable cases.

Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance

Applications of optimal transport have recently gained remarkable attention thanks to the computational advantages of entropic regularization. However, in most situations the Sinkhorn approximation of the Wasserstein distance is replaced by a regularized version that is less accurate but easy to differentiate. In we characterize the differential properties of the original Sinkhorn distance, proving that it enjoys the same smoothness as its regularized version and we explicitly provide an efficient algorithm to compute its gradient. We show that this result benefits both theory and applications: on one hand, high order smoothness confers statistical guarantees to learning with Wasserstein approximations. On the other hand, the gradient formula allows us to efficiently solve learning and optimization problems in practice. Promising preliminary experiments complement our analysis.

Learning with SGD and Random Features

Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In , we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments.

Manifold Structured Prediction

Structured prediction provides a general framework to deal with supervised problems where the outputs have semantically rich structure. While classical approaches consider finite, albeit potentially huge, output spaces, in we discuss how structured prediction can be extended to a continuous scenario. Specifically, we study a structured prediction approach to manifold valued regression. We characterize a class of problems for which the considered approach is statistically consistent and study how geometric optimization can be used to compute the corresponding estimator. Promising experimental results on both simulated and real data complete our study.

On Fast Leverage Score Sampling and Optimal Learning

Leverage score sampling provides an appealing way to perform approximate computations for large matrices. Indeed, it allows to derive faithful approximations with a complexity adapted to the problem at hand. Yet, performing leverage scores sampling is a challenge in its own right requiring further approximations. In , we study the problem of leverage score sampling for positive definite matrices defined by a kernel. Our contribution is twofold. First we provide a novel algorithm for leverage score sampling and second, we exploit the proposed method in statistical learning by deriving a novel solver for kernel ridge regression. Our main technical contribution is showing that the proposed algorithms are currently the most efficient and accurate for these problems.

Accelerated Decentralized Optimization with Local Updates for Smooth and Strongly Convex Objectives

In , we study the problem of minimizing a sum of smooth and strongly convex functions split over the nodes of a network in a decentralized fashion. We propose a decentralized accelerated algorithm that only requires local synchrony. Its rate depends on the condition number $κ$ of the local functions as well as the network topology and delays. Under mild assumptions on the topology of the graph, our algorithm takes a time $O ((τ_{max} + Δ_{max}) \sqrt{κ / γ} ln (ϵ^{- 1}))$ to reach a precision $ϵ$ where $γ$ is the spectral gap of the graph, $τ_{max}$ the maximum communication delay and $Δ_{max}$ the maximum computation time. Therefore, it matches the rate of SSDA, which is optimal when $τ_{max} = Ω (Δ_{max})$ . Applying our algorithm to quadratic local functions leads to an accelerated randomized gossip algorithm of rate $O (\sqrt{θ_{gossip} / n})$ where $θ_{gossip}$ is the rate of the standard randomized gossip. To the best of our knowledge, it is the first asynchronous gossip algorithm with a provably improved rate of convergence of the second moment of the error. We illustrate these results with experiments in idealized settings.

Bilateral Contracts and Grants with Industry Bilateral Contracts with Industry

Microsoft Research: “Structured Large-Scale Machine Learning”. Machine learning is now ubiquitous in industry, science, engineering, and personal life. While early successes were obtained by applying off-the-shelf techniques, there are two main challenges faced by machine learning in the “big data” era: structure and scale. The project proposes to explore three axes, from theoretical, algorithmic and practical perspectives: (1) large-scale convex optimization, (2) large-scale combinatorial optimization and (3) sequential decision making for structured data. The project involves two Inria sites (Paris and Grenoble) and four MSR sites (Cambridge, New England, Redmond, New York). Project website: http://www.msr-inria.fr/projects/structured-large-scale- machine-learning/.

Bilateral Grants with Industry

Alexandre d’Aspremont, Francis Bach, Martin Jaggi (EPFL): Google Focused award.

Francis Bach: Gift from Facebook AI Research.

Alexandre d’Aspremont: AXA, "mécénat scientifique, chaire Havas-Dauphine", machine learning.

Partnerships and Cooperations National Initiatives

Alexandre d'Aspremont: IRIS, PSL “Science des données, données de la science”.

European Initiatives

ITN Spartan

Title: Sparse Representations and Compressed Sensing Training Network

Type: FP7

Instrument: Initial Training Network

Duration: October 2014 to October 2018

Coordinator: Mark Plumbley (University of Surrey)

Inria contact: Francis Bach

Abstract: The SpaRTaN Initial Training Network will train a new generation of interdisciplinary researchers in sparse representations and compressed sensing, contributing to Europe’s leading role in scientific innovation. By bringing together leading academic and industry groups with expertise in sparse representations, compressed sensing, machine learning and optimisation, and with an interest in applications such as hyperspectral imaging, audio signal processing and video analytics, this project will create an interdisciplinary, trans-national and inter-sectorial training network to enhance mobility and training of researchers in this area. SpaRTaN is funded under the FP7-PEOPLE-2013- ITN call and is part of the Marie Curie Actions — Initial Training Networks (ITN) funding scheme: Project number - 607290

ITN Macsenet

Title: Machine Sensing Training Network

Type: H2020

Instrument: Initial Training Network

Duration: January 2015 - January 2019

Coordinator: Mark Plumbley (University of Surrey)

Inria contact: Francis Bach

Abstract: The aim of this Innovative Training Network is to train a new generation of creative, entrepreneurial and innovative early stage researchers (ESRs) in the research area of measurement and estimation of signals using knowledge or data about the underlying structure. We will develop new robust and efficient Machine Sensing theory and algorithms, together methods for a wide range of signals, including: advanced brain imaging; inverse imaging problems; audio and music signals; and non-traditional signals such as signals on graphs. We will apply these methods to real-world problems, through work with non-Academic partners, and disseminate the results of this research to a wide range of academic and non-academic audiences, including through publications, data, software and public engagement events. MacSeNet is funded under the H2020-MSCA-ITN-2014 call and is part of the Marie Sklodowska- Curie Actions — Innovative Training Networks (ITN) funding scheme.

ERC Sequoia Title: Robust algorithms for learning from modern data

Programm: H2020

Type: ERC

Duration: 2017-2022

Coordinator: Inria

Inria contact: Francis Bach

Abstract: Machine learning is needed and used everywhere, from science to industry, with a growing impact on many disciplines. While first successes were due at least in part to simple supervised learning algorithms used primarily as black boxes on medium-scale problems, modern data pose new challenges. Scalability is an important issue of course: with large amounts of data, many current problems far exceed the capabilities of existing algorithms despite sophisticated computing architectures. But beyond this, the core classical model of supervised machine learning, with the usual assumptions of independent and identically distributed data, or well-defined features, outputs and loss functions, has reached its theoretical and practical limits. Given this new setting, existing optimization-based algorithms are not adapted. The main objective of this project is to push the frontiers of supervised machine learning, in terms of (a) scalability to data with massive numbers of observations, features, and tasks, (b) adaptability to modern computing environments, in particular for parallel and distributed processing, (c) provable adaptivity and robustness to problem and hardware specifications, and (d) robustness to non-convexities inherent in machine learning problems. To achieve the expected breakthroughs, we will design a novel generation of learning algorithms amenable to a tight convergence analysis with realistic assumptions and efficient implementations. They will help transition machine learning algorithms towards the same wide- spread robust use as numerical linear algebra libraries. Outcomes of the research described in this proposal will include algorithms that come with strong convergence guarantees and are well-tested on real-life benchmarks coming from computer vision, bioin- formatics, audio processing and natural language processing. For both distributed and non-distributed settings, we will release open-source software, adapted to widely available computing platforms.

International Initiatives BigFOKS2

Title: Learning from Big Data: First-Order methods for Kernels and Submodular functions

International Partner (Institution - Laboratory - Researcher):

IISc Bangalore (India) - Computer Science Department - Chiranjib Bhattacharyya

Start year: 2016

See also: mllab.csa.iisc.ernet.in/indo-french.html

Recent advances in sensor technologies have resulted in large amounts of data being generated in a wide array of scientific disciplines. Deriving models from such large datasets, often known as “Big Data”, is one of the important challenges facing many engineering and scientific disciplines. In this proposal we investigate the problem of learning supervised models from Big Data, which has immediate applications in Computational Biology, Computer vision, Natural language processing, Web, E-commerce, etc., where specific structure is often present and hard to take into account with current algorithms. Our focus will be on the algorithmic aspects. Often supervised learning problems can be cast as convex programs. The goal of this proposal will be to derive first-order methods which can be effective for solving such convex programs arising in the Big-Data setting. Keeping this broad goal in mind we investigate two foundational problems which are not well addressed in existing literature. The first problem investigates Stochastic Gradient Descent Algorithms in the context of First-order methods for designing algorithms for Kernel based prediction functions on Large Datasets. The second problem involves solving discrete optimization problems arising in Submodular formulations in Machine Learning, for which first-order methods have not reached the level of speed required for practical applications (notably in computer vision).

International Research Visitors

Vijaya Bollapragada from Northwestern University, Chicago, IL, United States, Apr - Jul 2018.

Aaron De Fazio from Facebook Research NY, New York, United States, Feb 2018.

Gauthier Gidel from University of Montreal - MILA, Montreal, Canada, Jan 2018.

Sharan Vaswani from University of British Columbia, Vancouver, Canada, Apr - Jul 2018

Simon Lacoste-Julien from University of Montreal - MILA, Montreal, Canada, Aug 2018.

Dissemination Promoting Scientific Activities Scientific Events Organisation General Chair, Scientific Chair

F. Bach: General Chair of ICML 2018 (Stockholm)

Member of the Organizing Committees

Adrian Taylor, Session Organizer: Computer-assisted analyses of optimization algorithms I & II, International Symposium on Mathematical Programming, July 2018.

F. Bach: Co-organization of the workshop “Horizon Maths 2018 : Intelligence Artificielle”, November 23, 2018

Scientific Events Selection Chair of Conference Program Committees

F. Bach: Program Chair of the Journées de Statistiques (Saclay)

Reviewer

Conference on Learning Theory (COLT 2018): Pierre Gaillard, Alessandro Rudi

Symposium on Discrete Algorithms (SODA 2019): Adrien Taylor,

Neural Information Processing Systems (NIPS 2018): Pierre Gaillard, Alessandro Rudi

Conference on Learning Theory (COLT 2018): Pierre Gaillard, Alessandro Rudi, Adrien Taylor

Symposium on Discrete Algorithms (SODA 2019): Adrien Taylor

International Conference of Machine Learning (ICML 2018): Pierre Gaillard, Alessandro Rudi

Journal Member of the Editorial Boards

F. Bach: Journal of Machine Learning Research, co-editor-in-chief

F. Bach: Information and Inference, Associate Editor.

F. Bach: Electronic Journal of Statistics, Associate Editor.

F. Bach: Mathematical Programming, Associate Editor.

F. Bach: Foundations of Computational Mathematics, Associate Editor.

A. d’Aspremont: SIAM Journal on Optimization, Associate editor

A. d’Aspremont: SIAM Journal on the Mathematics of Data Science, Associate Editor

A. d’Aspremont: Mathematical Programming, Associate Editor

Reviewer - Reviewing Activities

SIAM Journal on Optimization: Adrien Taylor

Mathematical Programming: Adrien Taylor

Journal of Optimization Theory and Algorithms: Adrien Taylor

Journal of Machine Learning Research: Pierre Gaillard, Alessandro Rudi

Applied Computational Harmonic Analysis: Alessandro Rudi

Invited Talks

F. Bach, Trends in Optimization Seminar, University of Washington, November 2018.

Pierre Gaillard. Distributed averaging of observations in a graph: the gossip problem. MNL Conference, Paris, November 2018.

Adrien Taylor, Analysis and design of first-order methods via semidefinite programming, Seminaire Parisien dOptimisation (SPO), Paris (France), November 2018.

F. Bach, Frontier Research and Artificial Intelligence, European Research Council, Brussels, October 2018.

F. Bach, IDSS Distinguished Speaker Seminar, MIT, October 2018.

F. Bach, Mathematical Institute Colloquium, Oxford, October 2018.

Adrien Taylor, Convex Interpolation and Performance Estimation of First- order Methods for Convex Optimization, IBM/FNRS innovation award, Brussels (Belgium), October 2018.

F. Bach, Workshop on Structural Inference in High-Dimensional Models, Moscow, September 2018.

F. Bach, Symposium on Mathematical Programming (ISMP), Bordeaux, plenary talk, July 2018.

Alexandre d'Aspremont, Sharpness, Restart and Compressed Sensing Performance, ISMP 2018, Bordeaux, July 2018.

Alessandro Rudi, FALKON: An optimal method for large scale learning with statistical guarantees, ISMP 2018, Bordeaux, July 2018.

Adrien Taylor, Computer-assisted Lyapunov-based worst-case analyses of first- order methods, International Symposium on Mathematical Programming, Bordeaux (France), July 2018.

F. Bach, SIAM Conference on Imaging Science, Bologna, Italy, invited talk, June 2018.

Pierre Gaillard. Online prediction of arbitrary time-series with application to electricity consumption. Conference on nonstationarity. Cergy Pontoise University. June 2018.

Adrien Taylor, Convex Interpolation and Performance Estimation of First-order Methods for Convex Optimization, International Symposium on Mathematical Programming: Tucker prize finalist, Bordeaux (France), July 2018.

Alexandre d'Aspremont, An approximate Shapley-Folkman Theorem, Isaac Newton Institute, Cambridge, June 2018.

F. Bach,Workshop on Future challenges in statistical scalability, Newton Institute, Cambridge, UK, June 2018.

Adrien Taylor, Automated design of first-order optimization methods, Operation Research Seminar, UCLouvain, Louvain-la-Neuve (Belgium), May 2018.

Adrien Taylor, Automated design of first-order optimization methods, LCCC Control Seminar, Lund University, Lund (Sweden), May 2018.

Pierre Gaillard. Distributed learning with orthogonal polynomials. Inria DGA meetup. May 2018.

F. Bach, Workshop on Optimisation and Machine Learning in Economics, London, March 2018.

Pierre Gaillard. An overview of Artificial Intelligence. Hackaton. PSL University. March 2018.

Alexandre d'Aspremont, Regularized Nonlinear Acceleration, US and Mexico Workshop on Optimization and its Applications, Jan 2018.

Alessandro Rudi, Learning with Random Features, Isaac Newton Institute, Cambridge, Jan 2018.

Pierre Gaillard. Online nonparametric regression with adversarial data. Smile seminar. Paris. Jan 2018.

Teaching - Supervision - Juries Teaching

F. Bach (together with N. Chopin), Graphical models, 30h, Master M2 (MVA), ENS Cachan, France.

F. Bach, Optimisation et apprentissage statistique, 20h, Master M2 (Mathématiques de l'aléatoire), Université Paris-Sud, France.

Alexandre d'Aspremont, Optimisation Combinatoire et Convexe, avec Zhentao Li, (2015-Present) cours magistraux 30h, Master M1, ENS Paris.

Alexandre d'Aspremont, Optimisation convexe: modélisation, algorithmes et applications cours magistraux 21h (2011-Present), Master M2 MVA, ENS PS.

F. Bach and P. Gaillard, Apprentissage statistique, 35h, Master M1, Ecole Normale Supérieure, France.

P. Gaillard (together with V. Perchet), Prediction of individual sequences, 21h, Master M2 MVA, ENS Cachan, France.

Gregoire Mialon, Python for Machine Learning, 21h, M2 MASH, Dauphine-ENS-PSL, Paris.

Supervision

Anaël Bonneton, PhD defended on July 2018, co-advised by Francis Bach, located in Agence nationale de la sécurité des systèmes d’information (ANSSI).

Damien Scieur, PhD defended on September 2018. Sur l'accélération des méthodes d’optimisation, supervised by Alexandre d'Aspremont and Francis Bach.

Jean-Baptiste Alayrac, PhD defended on September 2018, Structured Learning from Videos and Language, supervised by Simon Lacoste-Julien, Josef Sivic and Ivan Laptev.

Antoine Recanati, PhD. defended on November 2018. Application du problème de sériation au séquençage de l’ADN et autres relaxations convexes appliquées en bioinformatique, supervised by Alexandre d'Aspremont.

Rémi Leblond, PhD defended on November 2018, Asynchronous Optimization for Machine Learning, supervised by Simon Lacoste-Julien.

Mathieu Barre, PhD in progress Méthodes d'extrapolation, au-delà de la convexité, supervised by Alexandre d'Aspremont.

Grégoire Mialon, PhD in progress Algorithmes d'optimisation, méthodes de régularisation et architectures pour les réseaux de neurones profonds dans un contexte où les données labellisées sont rares, supervised by Alexandre d'Aspremont.

Radu-Alexandru Dragomir, PhD in progress Non-Euclidean first-order methods, supervised by Alexandre d'Aspremont and Jérôme Bolte.

Thomas Kerdreux, PhD in progress Optimisation and machine learning, supervised by Alexandre d'Aspremont.

Margaux Brégère, PhD in progress started September 2017, supervised by Pierre Gaillard, Gilles Stoltz and Yannig Goude (EDF R&D).

Raphaël Berthier, PhD in progress started September 2017, supervised by Francis Bach and Pierre Gaillard.

Loucas Pillaud-Vivien, PhD in progress, supervised by Francis Bach and Alessandro Rudi.

Alex Nowak, PhD in progress, supervised by Francis Bach and Alessandro Rudi.

Ulysse Marteau Ferey, PhD in progress, supervised by Francis Bach and Alessandro Rudi.

Dmitry Babichev, PhD in progress, started is September 2015, co-advised by Francis Bach and Anatoly Judistky (Univ. Grenoble).

Tatiana Shpakova, PhD in progress, started September 2015, advised by Francis Bach.

Juries

Alexandre d'Aspremont, Habilitation à diriger des recherches. Thomas Bruls, Genoscope, Université d’Evry.

Popularization Creation of media or tools for science outreach

Design and implementation of a demonstration for the permanent exhibit at Palais de la Découverte: “L’apprenti illustrateur” (J.-B. Alayrac, F. Bach)

Structured Learning from Videos and Language Jean-Baptiste Alayrac J.-B. Ecole normale supérieure - ENS PARIS September 2018 https://hal.inria.fr/tel-01885412 Theses Expert-in-the-Loop Supervised Learning for Computer Security Detection Systems Anaël Beaugnon A. PSL Research University June 2018 https://hal.archives-ouvertes.fr/tel-01888971 Theses Asynchronous Optimization for Machine Learning Rémi Leblond R. Ecole Normale Superieure de Paris - ENS Paris November 2018 https://hal.inria.fr/tel-01950576 Theses Relaxations of the Seriation problem and applications to de novo genome assembly Antoine Recanati A. PSL Research University November 2018 https://hal.archives-ouvertes.fr/tel-01984368 Theses Acceleration in Optimization Damien Scieur D. PSL Research University September 2018 https://hal.archives-ouvertes.fr/tel-01887163 Theses Slice inverse regression with score functions Dmitry Babichev D. Francis Bach F. 1935-7524 Electronic journal of statistics Volume 12, Number 1 (2018) May 2018 1507-1543 https://hal.inria.fr/hal-01388498 Submodular Functions: from Discrete to Continous Domains Francis Bach F. 0025-5610 Mathematical Programming, Series A 2018 https://hal.archives-ouvertes.fr/hal-01222319 https://arxiv.org/abs/1511.00394 Optimal Affine-Invariant Smooth Minimization Algorithms Alexandre D'Aspremont A. Cristobal Guzman C. Martin Jaggi M. 1052-6234 SIAM Journal on Optimization 28 3 July 2018 2384 - 2405 https://hal.archives-ouvertes.fr/hal-01927392 Consistent change-point detection with kernels Damien Garreau D. Sylvain Arlot S. 1935-7524 Electronic journal of statistics 12 2 December 2018 4440-4486 https://hal.archives-ouvertes.fr/hal-01416704 https://arxiv.org/abs/1612.04740 Improved asynchronous parallel optimization analysis for stochastic incremental methods Rémi Leblond R. Fabian Pedregosa F. Simon Lacoste-Julien S. 1532-4435 Journal of Machine Learning Research (JMLR) 2018 https://hal.inria.fr/hal-01950558 Central Limit Theorem for stationary Fleming–Viot particle systems in finite spaces Tony Lelievre T. Loucas Pillaud-Vivien L. Julien Reygner J. 1980-0436 ALEA : Latin American Journal of Probability and Mathematical Statistics 15 September 2018 1163-1182 https://hal-enpc.archives-ouvertes.fr/hal-01812120 https://arxiv.org/abs/1806.04490 Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces Junhong Lin J. Alessandro Rudi A. Lorenzo Rosasco L. Volkan Cevher V. 1063-5203 Applied and Computational Harmonic Analysis October 2018 https://hal.inria.fr/hal-01958890 Evaluating automatic speech recognition systems as quantitative models of cross-lingual phonetic category perception Thomas Schatz T. Francis Bach F. Emmanuel Dupoux E. 0001-4966 Journal of the Acoustical Society of America 143 5 May 2018 EL372 - EL378 https://hal.archives-ouvertes.fr/hal-01888735 Constant Step Size Stochastic Gradient Descent for Probabilistic Modeling Dmitry Babichev D. Francis Bach F. UAI 2018 - Conference on Uncertainty in Artificial Intelligence Monterey, United States August 2018 https://hal.inria.fr/hal-01929810 Conference on Uncertainty in Artificial Intelligence 2018 UAI https://arxiv.org/abs/1804.05567 Efficient Algorithms for Non-convex Isotonic Regression through Submodular Optimization Francis Bach F. Advances in Neural Information Processing Systems Montreal, Canada December 2018 https://hal.archives-ouvertes.fr/hal-01569934 Annual Conference on Neural Information Processing Systems 21 NIPS https://arxiv.org/abs/1707.09157 End-to-End Active Learning for Computer Security Experts Anaël Beaugnon A. Pierre Chifflier P. Francis Bach F. KDD Workshop on Interactive Data Exploration and Analytics (IDEA) Londres, United Kingdom August 2018 https://hal.archives-ouvertes.fr/hal-01888983 Workshop on Interactive Data Exploration and Analytics 2018 IDEA End-to-End Active Learning for Computer Security Experts Anaël Beaugnon A. Pierre Chifflier P. Francis Bach F. AAAI Workshop on Artificial Intelligence for Cyber Security (AICS) New Orleans, United States February 2018 https://hal.archives-ouvertes.fr/hal-01888976 Workshop on Artificial Intelligence for Cyber Security 2018 AICS Learning with SGD and Random Features Luigi Carratino L. Alessandro Rudi A. Lorenzo Rosasco L. Advances in Neural Information Processing Systems Montreal, Canada December 2018 10213–10224 https://hal.archives-ouvertes.fr/hal-01958906 Annual Conference on Neural Information Processing Systems 21 NIPS https://arxiv.org/abs/1807.06343 - Spotlight On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport Lenaic Chizat L. Francis Bach F. Advances in Neural Information Processing Systems (NIPS) Montréal, Canada December 2018 https://hal.archives-ouvertes.fr/hal-01798792 Annual Conference on Neural Information Processing Systems 21 NIPS https://arxiv.org/abs/1805.09545 SING: Symbol-to-Instrument Neural Generator Alexandre Défossez A. Neil Zeghidour N. Nicolas Usunier N. Léon Bottou L. Francis Bach F. Conference on Neural Information Processing Systems (NIPS) Montréal, Canada December 2018 https://hal.archives-ouvertes.fr/hal-01899949 Annual Conference on Neural Information Processing Systems 32 NIPS https://arxiv.org/abs/1810.09785 Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods Robert M. Gower R. M. Nicolas Le Roux N. Francis Bach F. International Conference on Artificial Intelligence and Statistics (AISTATS) Canary Islands, Spain 2018 https://hal.archives-ouvertes.fr/hal-01652152 International Conference on Artificial Intelligence and Statistics 13 AISTATS https://arxiv.org/abs/1710.07462 - 17 pages, 2 figures, 1 table Combinatorial Penalties: Which structures are preserved by convex relaxations? Marwa El Halabi M. E. Francis Bach F. Volkan Cevher V. AISTATS 2018 - 22nd International Conference on Artificial Intelligence and Statistics Canary Islands, Spain April 2018 https://hal.archives-ouvertes.fr/hal-01652151 International Conference on Artificial Intelligence and Statistics 21 AISTATS https://arxiv.org/abs/1710.06273 Frank-Wolfe with Subsampling Oracle Thomas Kerdreux T. Fabian Pedregosa F. Alexandre D'Aspremont A. ICML 2018 - 35th International Conference on Machine Learning Stockholm, Sweden July 2018 https://hal.archives-ouvertes.fr/hal-01927391 International Conference on Machine Learning 35 ICML https://arxiv.org/abs/1803.07348 Convex optimization over intersection of simple sets: improved convergence rate guarantees via an exact penalty approach Achintya Kundu A. Francis Bach F. Chiranjib Bhattacharyya C. AISTATS 2018 - 22nd International Conference on Artificial Intelligence and Statistics Canary Islands, Spain April 2018 https://hal.archives-ouvertes.fr/hal-01652149 International Conference on Artificial Intelligence and Statistics 21 AISTATS https://arxiv.org/abs/1710.06465 SeaRNN: Training RNNs with Global-Local Losses Rémi Leblond R. Jean-Baptiste Alayrac J.-B. Anton Osokin A. Simon Lacoste-Julien S. ICLR 2018 : 6th International Conference on Learning Representations Vancouver, Canada April 2018 https://hal.inria.fr/hal-01950555 International Conference on Learning Representations 6 ICLR Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance Giulia Luise G. Alessandro Rudi A. Massimiliano Pontil M. Carlo Ciliberto C. NIPS 2018 - Advances in Neural Information Processing Systems Montreal, Canada December 2018 5864-5874 https://hal.inria.fr/hal-01958887 Annual Conference on Neural Information Processing Systems 32 NIPS https://arxiv.org/abs/1805.11897 - 26 pages, 4 figures Relating Leverage Scores and Density using Regularized Christoffel Functions Edouard Pauwels E. Francis Bach F. Jean-Philippe Vert J.-P. Neural Information Processing Systems Montréal, Canada December 2018 https://hal.archives-ouvertes.fr/hal-01796591 Annual Conference on Neural Information Processing Systems 26 NIPS Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien L. Alessandro Rudi A. Francis Bach F. Conference on Learning Theory (COLT) Stockholm, Sweden July 2018 https://hal.archives-ouvertes.fr/hal-01662278 Annual Conference on Learning Theory 29 COLT https://arxiv.org/abs/1712.04755 Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes Loucas Pillaud-Vivien L. Alessandro Rudi A. Francis Bach F. Neural Information Processing Systems (NeurIPS) Montréal, Canada December 2018 https://hal.archives-ouvertes.fr/hal-01799116 Annual Conference on Neural Information Processing Systems 32 NIPS https://arxiv.org/abs/1805.10074 A Generic Approach for Escaping Saddle points Sashank J Reddi S. J. Manzil Zaheer M. Suvrit Sra S. Barnabas Poczos B. Francis Bach F. Ruslan Salakhutdinov R. Alexander J Smola A. J. AISTATS 2018 - 22nd International Conference on Artificial Intelligence and Statistics Canary Islands, Spain April 2018 https://hal.archives-ouvertes.fr/hal-01652150 International Conference on Artificial Intelligence and Statistics 21 AISTATS https://arxiv.org/abs/1709.01434 On Fast Leverage Score Sampling and Optimal Learning Alessandro Rudi A. Daniele Calandriello D. Luigi Carratino L. Lorenzo Rosasco L. NeurIPS 2018 - Thirty-second Conference on Neural Information Processing Systems Montreal, Canada Advances in Neural Information Processing Systems - NIPS-2018 31 December 2018 5677–5687 https://hal.inria.fr/hal-01958879 Annual Conference on Neural Information Processing Systems 32 NIPS https://arxiv.org/abs/1810.13258 Manifold Structured Prediction Alessandro Rudi A. Carlo Ciliberto C. Gian Maria Marconi G. M. Lorenzo Rosasco L. NIPS 2018 - Neural Information Processing Systems Conference Montreal, Canada Advances in Neural Information Processing Systems 31 December 2018 5615-5626 https://hal.archives-ouvertes.fr/hal-01958900 Annual Conference on Neural Information Processing Systems 32 NIPS https://arxiv.org/abs/1806.09908 Optimal Algorithms for Non-Smooth Distributed Optimization in Networks Kevin Scaman K. Francis Bach F. Sébastien Bubeck S. Yin Tat Lee Y. T. Laurent Massoulié L. Advances In Neural Information Processing systems Montreal, Canada December 2018 https://hal.archives-ouvertes.fr/hal-01957013 Annual Conference on Neural Information Processing Systems 21 NIPS https://arxiv.org/abs/1806.00291 - 17 pages Nonlinear Acceleration of CNNs Damien Scieur D. Edouard Oyallon E. Alexandre D'Aspremont A. Francis Bach F. ICLR Workshop track Vancouver, Canada April 2018 https://hal.archives-ouvertes.fr/hal-01805251 International Conference on Learning Representations 6 ICLR Marginal Weighted Maximum Log-likelihood for Efficient Learning of Perturb-and-Map models Tatiana Shpakova T. Francis Bach F. Anton Osokin A. UAI 2018 - Conference on Uncertainty in Artificial Intelligence 2018 Monterey, United States August 2018 https://hal.inria.fr/hal-01939549 Conference on Uncertainty in Artificial Intelligence 2018 UAI https://arxiv.org/abs/1811.08725 Lyapunov Functions for First-Order Methods: Tight Automated Convergence Guarantees Adrien B. Taylor A. B. Bryan Van Scoy B. Laurent Lessard L. Proceedings of the 35th International Conference on Machine Learning. PMLR 80:4897-4906 Stockholm, Sweden July 2018 https://hal.inria.fr/hal-01902068 International Conference on Machine Learning 35 ICML https://arxiv.org/abs/1803.06073 Averaging Stochastic Gradient Descent on Riemannian Manifolds Nilesh Tripuraneni N. Nicolas Flammarion N. Francis Bach F. Michael I. Jordan M. I. Computational Learning Theory (COLT) Stockholm, Sweden July 2018 https://hal.archives-ouvertes.fr/hal-01957015 Annual Conference on Learning Theory 31 COLT https://arxiv.org/abs/1802.09128 - COLT 2018 <formula type="inline"><math xmlns="http://www.w3.org/1998/Math/MathML" overflow="scroll"><msup><mi>𝐌</mi><mo>*</mo></msup></math></formula>-Regularized Dictionary Learning Mathieu Barré M. Alexandre D'Aspremont A. October 2018 https://hal.archives-ouvertes.fr/hal-01897496 https://arxiv.org/abs/1810.02748 - working paper or preprint Gossip of Statistical Observations using Orthogonal Polynomials Raphaël Berthier R. Francis Bach F. Pierre Gaillard P. May 2018 https://hal.archives-ouvertes.fr/hal-01797016 https://arxiv.org/abs/1805.08531 - working paper or preprint Nonlinear Acceleration of Momentum and Primal-Dual Algorithms Raghu Bollapragada R. Damien Scieur D. Alexandre d'Aspremont A. October 2018 https://hal.archives-ouvertes.fr/hal-01893921 https://arxiv.org/abs/1810.04539 - working paper or preprint A Note on Lazy Training in Supervised Differentiable Programming Lenaic Chizat L. Francis Bach F. December 2018 https://hal.inria.fr/hal-01945578 https://arxiv.org/abs/1812.07956 - working paper or preprint Localized Structured Prediction Carlo Ciliberto C. Francis Bach F. Alessandro Rudi A. December 2018 https://hal.inria.fr/hal-01958863 https://arxiv.org/abs/1806.02402 - 53 pages, 7 figures, 1 algorithm Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains Aymeric Dieuleveut A. Alain Durmus A. Francis Bach F. April 2018 https://hal.archives-ouvertes.fr/hal-01565514 https://arxiv.org/abs/1707.06386 - working paper or preprint Efficient First-order Methods for Convex Minimization: a Constructive Approach Yoel Drori Y. Adrien B. Taylor A. B. October 2018 https://hal.inria.fr/hal-01902048 https://arxiv.org/abs/1803.05676 - Code available at https://github.com/AdrienTaylor/GreedyMethods Uniform regret bounds over <formula type="inline"><math xmlns="http://www.w3.org/1998/Math/MathML" overflow="scroll"><msup><mi>R</mi><mi>d</mi></msup></math></formula> for the sequential linear regression problem with the square loss Pierre Gaillard P. Sébastien Gerchinovitz S. Malo Huard M. Gilles Stoltz G. February 2018 https://hal.archives-ouvertes.fr/hal-01802004 https://arxiv.org/abs/1805.11386 - working paper or preprint Efficient online algorithms for fast-rate regret bounds under sparsity Pierre Gaillard P. Olivier Wintenberger O. May 2018 https://hal.archives-ouvertes.fr/hal-01798201 https://arxiv.org/abs/1805.09174 - working paper or preprint Accelerated decentralized optimization with local updates for smooth and strongly convex objectives Hadrien Hendrikx H. Francis Bach F. Laurent Massoulié L. October 2018 https://hal.inria.fr/hal-01893568 working paper or preprint Restarting Frank-Wolfe Thomas Kerdreux T. Alexandre d'Aspremont A. Sebastian Pokutta S. October 2018 https://hal.archives-ouvertes.fr/hal-01893922 https://arxiv.org/abs/1810.02429 - working paper or preprint Sharp Analysis of Learning with Discrete Losses Alex Nowak-Vila A. Francis Bach F. Alessandro Rudi A. October 2018 https://hal.archives-ouvertes.fr/hal-01893006 https://arxiv.org/abs/1810.06839 - working paper or preprint Finite-sample Analysis of M-estimators using Self-concordance Dmitrii M. Ostrovskii D. M. Francis Bach F. October 2018 https://hal.archives-ouvertes.fr/hal-01895127 https://arxiv.org/abs/1810.06838 - working paper or preprint Reconstructing Latent Orderings by Spectral Clustering Antoine Recanati A. Thomas Kerdreux T. Alexandre D'Aspremont A. July 2018 https://hal.archives-ouvertes.fr/hal-01846269 https://arxiv.org/abs/1807.07122 - working paper or preprint Robust Seriation and Applications to Cancer Genomics Antoine Recanati A. Nicolas Servant N. Jean-Philippe Vert J.-P. Alexandre D'Aspremont A. July 2018 https://hal.archives-ouvertes.fr/hal-01851960 https://arxiv.org/abs/1806.00664 - working paper or preprint Operator Splitting Performance Estimation: Tight contraction factors and optimal parameter selection Ernest K. Ryu E. K. Adrien B. Taylor A. B. Carolina Bergeling C. Pontus Giselsson P. December 2018 https://hal.inria.fr/hal-01943622 https://arxiv.org/abs/1812.00146 - working paper or preprint Nonlinear Acceleration of Deep Neural Networks Damien Scieur D. Edouard Oyallon E. Alexandre D'Aspremont A. Francis Bach F. May 2018 https://hal.archives-ouvertes.fr/hal-01799269 working paper or preprint Structure-Adaptive Accelerated Coordinate Descent Junqi Tang J. Mohammad Golbabaee M. Francis Bach F. Mike E. Davies M. E. October 2018 https://hal.archives-ouvertes.fr/hal-01889990 working paper or preprint Tube-CNN: Modeling temporal evolution of appearance for object detection in video Tuan-Hung Vu T.-H. Anton Osokin A. Ivan Laptev I. January 2019 https://hal.archives-ouvertes.fr/hal-01980339 https://arxiv.org/abs/1812.02619 - 13 pages, 8 figures, technical report