Machine learning is a recent scientific domain, positioned between applied mathematics, statistics and computer science. Its goals are the optimization, control, and modelisation of complex systems from examples. It applies to data from numerous engineering and scientific fields (e.g., vision, bioinformatics, neuroscience, audio processing, text processing, economy, finance, etc.), the ultimate goal being to derive general theories and algorithms allowing advances in each of these domains. Machine learning is characterized by the high quality and quantity of the exchanges between theory, algorithms and applications: interesting theoretical problems almost always emerge from applications, while theoretical analysis allows the understanding of why and when popular or successful algorithms do or do not work, and leads to proposing significant improvements.

Our academic positioning is exactly at the intersection between these three aspects—algorithms, theory and applications—and our main research goal is to make the link between theory and algorithms, and between algorithms and high-impact applications in various engineering and scientific fields, in particular computer vision, bioinformatics, audio processing, text processing and neuro-imaging.

Machine learning is now a vast field of research and the team focuses on the following aspects: supervised learning (kernel methods, calibration), unsupervised learning (matrix factorization, statistical tests), parsimony (structured sparsity, theory and algorithms), and optimization (convex optimization, bandit learning). These four research axes are strongly interdependent, and the interplay between them is key to successful practical applications.

This part of our research focuses on methods where, given a set of examples of input/output pairs, the goal is to predict the output for a new input, with research on kernel methods, calibration methods, and multi-task learning.

We focus here on methods where no output is given and the goal is to find structure of certain known types (e.g., discrete or low-dimensional) in the data, with a focus on matrix factorization, statistical tests, dimension reduction, and semi-supervised learning.

The concept of parsimony is central to many areas of science. In the context of statistical machine learning, this takes the form of variable or feature selection. The team focuses primarily on structured sparsity, with theoretical and algorithmic contributions.

Optimization in all its forms is central to machine learning, as many of its theoretical frameworks are based at least in part on empirical risk minimization. The team focuses primarily on convex and bandit optimization, with a particular focus on large-scale optimization.

Machine learning research can be conducted from two main perspectives: the first one, which has been dominant in the last 30 years, is to design learning algorithms and theories which are as generic as possible, the goal being to make as few assumptions as possible regarding the problems to be solved and to let data speak for themselves. This has led to many interesting methodological developments and successful applications. However, we believe that this strategy has reached its limit for many application domains, such as computer vision, bioinformatics, neuro-imaging, text and audio processing, which leads to the second perspective our team is built on: Research in machine learning theory and algorithms should be driven by interdisciplinary collaborations, so that specific prior knowledge may be properly introduced into the learning process, in particular with the following fields:

Computer vision: object recognition, object detection, image segmentation, image/video processing, computational photography. In collaboration with the Willow project-team.

Bioinformatics: cancer diagnosis, protein function prediction, virtual screening. In collaboration with Institut Curie.

Text processing: document collection modeling, language models.

Audio processing: source separation, speech/music processing.

Neuro-imaging: brain-computer interface (fMRI, EEG, MEG).

Keyword: Optimization

Functional Description: A C++/Python code implementing the methods in the paper "Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization", F. Pedregosa, R. Leblond and S. Lacoste-Julien, Advances in Neural Information Processing Systems (NIPS) 2017. Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 12x on a 20-core machine.

Contact: Fabian Pedregosa

Keyword: Computer vision

Functional Description: Code for the paper Joint Discovery of Object States and Manipulation Actions, ICCV 2017: Many human activities involve object manipulations aiming to modify the object state. Examples of common state changes include full/empty bottle, open/closed door, and attached/detached car wheel. In this work, we seek to automatically discover the states of objects and the associated manipulation actions. Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions. Our model is formulated as a discriminative clustering cost with constraints. We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision. We demonstrate successful discovery of seven manipulation actions and corresponding object states on a new dataset of videos depicting real-life object manipulations. We show that our joint formulation results in an improvement of object state discovery by action recognition and vice versa.

Contact: Jean-Baptiste Alayrac

The Wasserstein distance between two probability measures on a metric space is a measure of closeness with applications in statistics, probability, and machine learning. In , we consider the fundamental question of how quickly the empirical measure obtained from

The Łojasiewicz inequality shows that sharpness bounds on the minimum of convex optimization problems hold almost generically. Sharpness directly controls the performance of restart schemes. The constants quantifying error bounds are of course unobservable, but we show in that optimal restart strategies are robust, and searching for the best scheme only increases the complexity by a logarithmic factor compared to the optimal bound. Overall then, restart schemes generically accelerate accelerated methods.

Zepeda and Pérez have recently demonstrated the promise of the exemplar SVM (ESVM) as a feature encoder for image retrieval. The paper extends this approach in several directions: We first show that replacing the hinge loss by the square loss in the ESVM cost function significantly reduces encoding time with negligible effect on accuracy. We call this model square-loss exemplar machine, or SLEM. We then introduce a kernelized SLEM which can be implemented efficiently through low-rank matrix decomposition , and displays improved performance. Both SLEM variants exploit the fact that the negative examples are fixed, so most of the SLEM computational complexity is relegated to an offline process independent of the positive examples. Our experiments establish the performance and computational advantages of our approach using a large array of base features and standard image retrieval datasets.

Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In , we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 12x on a 20-core machine.

Extrapolation methods use the last few iterates of an optimization algorithm to produce a better estimate of the optimum. They were shown to achieve optimal convergence rates in a deterministic setting using simple gradient iterates. In , we study extrapolation methods in a stochastic setting, where the iterates are produced by either a simple or an accelerated stochastic gradient algorithm. We first derive convergence bounds for arbitrary, potentially biased perturbations, then produce asymptotic bounds using the ratio between the variance of the noise and the accuracy of the current point. Finally, we apply this acceleration technique to stochastic algorithms such as SGD, SAGA, SVRG and Katyusha in different settings, and show significant performance gains.

A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points. First-order methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use second-order methods. However, most works on second-order methods rely extensively on expensive Hessian-based computations, making them impractical in large-scale settings. To tackle this challenge, we introduce in a generic framework that minimizes Hessian based computations while at the same time provably converging to second-order critical points. Our framework carefully alternates between a first-order and a second-order subroutine, using the latter only close to saddle points, and yields convergence results competitive to the state-of-the-art. Empirical results suggest that our strategy also enjoys a good practical performance. (Collaboration with Sashank Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola)

Many of the ordinal regression models that have been proposed in the literature can be seen as methods that minimize a convex surrogate of the zero-one, absolute, or squared loss functions. A key property that allows to study the statistical implications of such approximations is that of Fisher consistency. Fisher consistency is a desirable property for surrogate loss functions and implies that in the population setting, i.e., if the probability distribution that generates the data were available, then optimization of the surrogate would yield the best possible model. In we will characterize the Fisher consistency of a rich family of surrogate loss functions used in the context of ordinal regression, including support vector ordinal regression, ORBoosting and least absolute deviation. We will see that, for a family of surrogate loss functions that subsumes support vector ordinal regression and ORBoosting, consistency can be fully characterized by the derivative of a real-valued function at zero, as happens for convex margin-based surrogates in binary classification. We also derive excess risk bounds for a surrogate of the absolute error that generalize existing risk bounds for binary classification. Finally, our analysis suggests a novel surrogate of the squared error loss. We compare this novel surrogate with competing approaches on 9 different datasets. Our method shows to be highly competitive in practice, outperforming the least squares loss on 7 out of 9 datasets.

Microsoft Research: “Structured Large-Scale Machine Learning”. Machine learning is now ubiquitous in industry, science, engineering, and personal life. While early successes were obtained by applying off-the- shelf techniques, there are two main challenges faced by machine learning in the “big data” era: structure and scale. The project proposes to explore three axes, from theoretical, algorithmic and practical perspectives: (1) large-scale convex optimization, (2) large-scale combinatorial optimization and (3) sequential decision making for structured data. The project involves two Inria sites (Paris and Grenoble) and four MSR sites (Cambridge, New England, Redmond, New York). Project website: http://

A. d’Aspremont: AXA, "mécénat scientifique, chaire Havas-Dauphine", machine learning.

F. Bach: Gift from Facebook AI Research.

A. d'Aspremont: IRIS, PSL “Science des données, données de la science”.

**ITN Spartan**

Title: Sparse Representations and Compressed Sensing Training Network

Type: FP7

Instrument: Initial Training Network

Duration: October 2014 to October 2018

Coordinator: Mark Plumbley (University of Surrey)

Inria contact: Francis Bach

Abstract: The SpaRTaN Initial Training Network will train a new generation of interdisciplinary researchers in sparse representations and compressed sensing, contributing to Europe’s leading role in scientific innovation. By bringing together leading academic and industry groups with expertise in sparse representations, compressed sensing, machine learning and optimisation, and with an interest in applications such as hyperspectral imaging, audio signal processing and video analytics, this project will create an interdisciplinary, trans-national and inter-sectorial training network to enhance mobility and training of researchers in this area. SpaRTaN is funded under the FP7-PEOPLE-2013- ITN call and is part of the Marie Curie Actions — Initial Training Networks (ITN) funding scheme: Project number - 607290

**ITN Macsenet**

Title: Machine Sensing Training Network

Type: H2020

Instrument: Initial Training Network

Duration: January 2015 - January 2019

Coordinator: Mark Plumbley (University of Surrey)

Inria contact: Francis Bach

Abstract: The aim of this Innovative Training Network is to train a new generation of creative, entrepreneurial and innovative early stage researchers (ESRs) in the research area of measurement and estimation of signals using knowledge or data about the underlying structure. We will develop new robust and efficient Machine Sensing theory and algorithms, together methods for a wide range of signals, including: advanced brain imaging; inverse imaging problems; audio and music signals; and non-traditional signals such as signals on graphs. We will apply these methods to real-world problems, through work with non-Academic partners, and disseminate the results of this research to a wide range of academic and non-academic audiences, including through publications, data, software and public engagement events. MacSeNet is funded under the H2020-MSCA-ITN-2014 call and is part of the Marie Sklodowska- Curie Actions — Innovative Training Networks (ITN) funding scheme.

**ERC Sequoia**

Title: Robust algorithms for learning from modern data

Programm: H2020

Type: ERC

Duration: 2017-2022

Coordinator: Inria

Inria contact: Francis BACH

Abstract: Machine learning is needed and used everywhere, from science to industry, with a growing impact on many disciplines. While first successes were due at least in part to simple supervised learning algorithms used primarily as black boxes on medium-scale problems, modern data pose new challenges. Scalability is an important issue of course: with large amounts of data, many current problems far exceed the capabilities of existing algorithms despite sophisticated computing architectures. But beyond this, the core classical model of supervised machine learning, with the usual assumptions of independent and identically distributed data, or well-defined features, outputs and loss functions, has reached its theoretical and practical limits. Given this new setting, existing optimization-based algorithms are not adapted. The main objective of this project is to push the frontiers of supervised machine learning, in terms of (a) scalability to data with massive numbers of observations, features, and tasks, (b) adaptability to modern computing environments, in particular for parallel and distributed processing, (c) provable adaptivity and robustness to problem and hardware specifications, and (d) robustness to non-convexities inherent in machine learning problems. To achieve the expected breakthroughs, we will design a novel generation of learning algorithms amenable to a tight convergence analysis with realistic assumptions and efficient implementations. They will help transition machine learning algorithms towards the same wide-spread robust use as numerical linear algebra libraries. Outcomes of the research described in this proposal will include algorithms that come with strong convergence guarantees and are well-tested on real-life benchmarks coming from computer vision, bioin- formatics, audio processing and natural language processing. For both distributed and non-distributed settings, we will release open-source software, adapted to widely available computing platforms.

Title: Learning from Big Data: First-Order methods for Kernels and Submodular functions

International Partner (Institution - Laboratory - Researcher):

IISc Bangalore (India) - Computer Science Department - Chiranjib Bhattacharyya

Start year: 2016

Recent advances in sensor technologies have resulted in large amounts of data being generated in a wide array of scientific disciplines. Deriving models from such large datasets, often known as “Big Data”, is one of the important challenges facing many engineering and scientific disciplines. In this proposal we investigate the problem of learning supervised models from Big Data, which has immediate applications in Computational Biology, Computer vision, Natural language processing, Web, E-commerce, etc., where specific structure is often present and hard to take into account with current algorithms. Our focus will be on the algorithmic aspects. Often supervised learning problems can be cast as convex programs. The goal of this proposal will be to derive first-order methods which can be effective for solving such convex programs arising in the Big-Data setting. Keeping this broad goal in mind we investigate two foundational problems which are not well addressed in existing literature. The first problem investigates Stochastic Gradient Descent Algorithms in the context of First-order methods for designing algorithms for Kernel based prediction functions on Large Datasets. The second problem involves solving discrete optimization problems arising in Submodular formulations in Machine Learning, for which first-order methods have not reached the level of speed required for practical applications (notably in computer vision).

Marwa El Halabi, from Jan. until Apr. 2017, EPFL, Lausanne, Switzerland

Jonathan Weed, from Mar. 2017 until May 2017, MIT, US

Alfredo Zermini, from Mar 2017 until June 2017, University of Surrey, UK

Billy Tang, visited from Sept. 2017 until Dec. 2017, University of Edimburgh, UK

P. Germain and F. Bach: co-organization of NIPS workshop: "(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights" https://

A. d'Aspremont: co-organization of the workshop: “Optimization and Statistical Learning”, Les Houches, France

F. Bach: Senior Area chair for NIPS 2017

F. Bach: Action Editor, Journal of Machine Learning Research.

F. Bach: Information and Inference, Associate Editor.

F. Bach: Electronic Journal of Statistics, Associate Editor.

F. Bach: Mathematical Programming, Associate Editor.

F. Bach: Foundations of Computational Mathematics, Associate Editor.

A. d'Aspremont: SIAM Journal on Optimization, Associate Editor.

F. Bach: Workshop on Shape, Images and Optimization, Muenster, Germany invited talk, February 2017

F. Bach: SIAM conference on Optimization, Vancouver, Canada, invited tutorial, May 2017

F. Bach: LCCC workshop on Large-Scale and Distributed Optimization, Lund, Sweden, invited talk, June 2017

F. Bach: Summer school on Structured Regularization for High-Dimensional Data Analysis, Paris, invited talk, June 2017

F. Bach: FOCM Barcelona, two invited talks in special sessions, July 2017

F. Bach: European Signal Processing conference (EUSIPCO), Kos, Greece, keynote speaker, August 2017

F. Bach: StatMathAppli 2017, Frejus, mini-course on optimization, September 2017

F. Bach: 2017 ERNSI Workshop on System Identification, Lyon, invited plenary talk, September 2017

F. Bach: New-York University, Data science seminar, October 2017

F. Bach: Workshop on Generative models, parameter learning and sparsity, Cambridge, UK, invited talk, November 2017

F. Bach: NIPS workshops, two invited talks, Long Beach, CA, December 2017

A. d'Aspremont: “Regularized Nonlinear Acceleration”

GdR MOA, Bordeaux.

GdR MEGA, Paris.

SIAM OPtimization conference

Oxford computational math seminar

Alan Turing institute

A. d'Aspremont: “Sharpness, Restart and Acceleration”. Foundations of Computational Mathematics, Barcelona.

P. Germain: “Generalization of the PAC-Bayesian Theory, and Applications to Semi-Supervised Learning”, Modal Seminars, Lille, France, January 2017

P. Germain: “Theory Driven Domain Adaptation Algorithm”, Google Brain TechTalk, Mountain View (CA), USA, April 2017

P. Gaillard: “Sparse acceleration of exponential weights”

Seminar of the SEQUEL project team, Lilles, February 2017

49e Journées Françaises de Statistique, Avignon, Juin 2017

P. Gaillard: “Obtaining sparse and fast convergence rates online under Bernstein condition”, CWI-Inria Workshop, September 2017

P. Gaillard: “Online nonparametric learning”

Cambridge Statistics Seminar, October 2017

Statistics Seminar of the University Aix-Marseille, December 2017

Master: A. d'Aspremont, “Optimization”, 21h, M1, Ecole Normale Supérieure, France

Master: A. d'Aspremont, “Optimization”, 21h, M2 (MVA), ENS Cachan, France

Master: F. Bach and P. Gaillard, “Apprentissage statistique”, 35h, M1, Ecole Normale Supérieure, France.

Master: F. Bach (together with G. Obozinski), “Graphical models”, 30h, M2 (MVA), ENS Cachan, France.

Master: F. Bach, “Optimisation et apprentissage statistique”, 20h, M2 (Mathématiques de l'aléatoire), Université Paris-Sud, France.

Master: F. Pedregosa (together with Fajwel Fogel), “Introduction to scikit-learn”, M2 (MASH), Université Paris-Dauphine, France.

PhD: Nicolas Flammarion, July 2017, co-directed by Alexandre d'Aspremont and Francis Bach.

PhD: Aymeric Dieuleveut, September 2017, directed by Francis Bach.

PhD: Christophe Dupuy, June 2017, directed by Francis Bach.

PhD: Rafael Rezende, December 2017, Francis Bach, co-advised with Jean Ponce.

PhD: Vincent Roulet, December 2017, directed by Alexandre d'Aspremont.

PhD in progress: Damien Scieur, started September 2015, co-directed with Alexandre d'Aspremont and Francis Bach

PhD in progress: Antoine Recanati, started September 2015, directed by Alexandre d'Aspremont

PhD in progress: Anaël Bonneton, started December 2014, co-advised by Francis Bach, located in Agence nationale de la sécurité des systèmes d’information (ANSSI).

PhD in progress: Dmitry Babichev, started September 2015, co-advised by Francis Bach and Anatoly Judistky (Univ. Grenoble).

PhD in progress: Tatiana Shpakova, started September 2015, advised by Francis Bach.

PhD in progress: Loucas Pillaud-Vivie, started September 2017, co-directed by Alessandro Rudi and Francis Bach

PhD in progress: Margaux Brégère, started September 2017, co-advised by Pierre Gaillard, Gilles Stoltz and Yannig Goude (EDF R&D)

A. d'Aspremont: Paris Science et Data, PSL & Inria.

A. d'Aspremont: Journée innovation défense

P. Gaillard: testimony for EDF fellows day