Machine learning is a recent scientific domain, positioned between applied mathematics, statistics and computer science. Its goals are the optimization, control, and modelisation of complex systems from examples. It applies to data from numerous engineering and scientific fields (e.g., vision, bioinformatics, neuroscience, audio processing, text processing, economy, finance, etc.), the ultimate goal being to derive general theories and algorithms allowing advances in each of these domains. Machine learning is characterized by the high quality and quantity of the exchanges between theory, algorithms and applications: interesting theoretical problems almost always emerge from applications, while theoretical analysis allows the understanding of why and when popular or successful algorithms do or do not work, and leads to proposing significant improvements.

Our academic positioning is exactly at the intersection between these three aspects—algorithms, theory and applications—and our main research goal is to make the link between theory and algorithms, and between algorithms and high-impact applications in various engineering and scientific fields, in particular computer vision, bioinformatics, audio processing, text processing and neuro-imaging.

Machine learning is now a vast field of research and the team focuses on the following aspects: supervised learning (kernel methods, calibration), unsupervised learning (matrix factorization, statistical tests), parsimony (structured sparsity, theory and algorithms), and optimization (convex optimization, bandit learning). These four research axes are strongly interdependent, and the interplay between them is key to successful practical applications.

This part of our research focuses on methods where, given a set of examples of input/output pairs, the goal is to predict the output for a new input, with research on kernel methods, calibration methods, and multi-task learning.

We focus here on methods where no output is given and the goal is to find structure of certain known types (e.g., discrete or low-dimensional) in the data, with a focus on matrix factorization, statistical tests, dimension reduction, and semi-supervised learning.

The concept of parsimony is central to many areas of science. In the context of statistical machine learning, this takes the form of variable or feature selection. The team focuses primarily on structured sparsity, with theoretical and algorithmic contributions.

Optimization in all its forms is central to machine learning, as many of its theoretical frameworks are based at least in part on empirical risk minimization. The team focuses primarily on convex and bandit optimization, with a particular focus on large-scale optimization.

Machine learning research can be conducted from two main perspectives: the first one, which has been dominant in the last 30 years, is to design learning algorithms and theories which are as generic as possible, the goal being to make as few assumptions as possible regarding the problems to be solved and to let data speak for themselves. This has led to many interesting methodological developments and successful applications. However, we believe that this strategy has reached its limit for many application domains, such as computer vision, bioinformatics, neuro-imaging, text and audio processing, which leads to the second perspective our team is built on: Research in machine learning theory and algorithms should be driven by interdisciplinary collaborations, so that specific prior knowledge may be properly introduced into the learning process, in particular with the following fields:

Computer vision: object recognition, object detection, image segmentation, image/video processing, computational photography. In collaboration with the Willow project-team.

Bioinformatics: cancer diagnosis, protein function prediction, virtual screening. In collaboration with Institut Curie.

Text processing: document collection modeling, language models.

Audio processing: source separation, speech/music processing.

Neuro-imaging: brain-computer interface (fMRI, EEG, MEG).

Francis Bach, Lagrange Prize in Continuous Optimization, Society for Industrial and Applied Mathematics 2018

Francis Bach, Best Paper Award, NeurIPS 2018.

Francis Bach included in the report *Highly cited researchers, year 2018*, Clarivate Analytics, 2018

Nicolas Flammarion, PhD thesis award in the *Programme Gaspard Monge*, Fondation Mathématique Jacques Hadamard, 2018.

Adrien Taylor, Tucker Prize (finalist) 2018 (dissertation prize by the Math- ematical Optimization Society for 2015-2017).

Adrien Taylor, IBM/FNRS innovation award 2018 (dissertation prize for original contributions to informatics).

Adrien Taylor, Icteam thesis award 2018 (dissertation award by the icteam institute of UCLouvain, Belgium).

Adrien Taylor, Best paper award 2018 from the journal Optimization Letters for the paper *On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions*, Etienne De Klerk, François Glineur, Adrien Taylor.
journal=.

Keyword: Optimization

Functional Description: A C++/Python code implementing the methods in the paper "Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization", F. Pedregosa, R. Leblond and S. Lacoste-Julien, Advances in Neural Information Processing Systems (NIPS) 2017. Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 12x on a 20-core machine.

Contact: Fabian Pedregosa

Keyword: Computer vision

Functional Description: Code for the paper Joint Discovery of Object States and Manipulation Actions, ICCV 2017: Many human activities involve object manipulations aiming to modify the object state. Examples of common state changes include full/empty bottle, open/closed door, and attached/detached car wheel. In this work, we seek to automatically discover the states of objects and the associated manipulation actions. Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions. Our model is formulated as a discriminative clustering cost with constraints. We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision. We demonstrate successful discovery of seven manipulation actions and corresponding object states on a new dataset of videos depicting real-life object manipulations. We show that our joint formulation results in an improvement of object state discovery by action recognition and vice versa.

Participants: Jean-Baptiste Alayrac, Josef Sivic, Ivan Laptev and Simon Lacoste-Julien

Contact: Jean-Baptiste Alayrac

Publication: Joint Discovery of Object States and Manipulation Actions

Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, in we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.

Consider a network of agents connected by communication links, where each agent holds a real value. The gossip problem consists in estimating the average of the values diffused in the network in a distributed manner. Current techniques for gossiping are designed to deal with worst-case scenarios, which is irrelevant in applications to distributed statistical learning and denoising in sensor networks. In , we design second-order gossip methods tailor-made for the case where the real values are i.i.d. samples from the same distribution. In some regular network structures, we are able to prove optimality of our methods, and simulations suggest that they are efficient in a wide range of random networks. Our approach of gossip stems from a new acceleration framework using the family of orthogonal polynomials with respect to the spectral measure of the network graph.

Stochastic gradient methods enable learning probabilistic models from large amounts of data. While large step-sizes (learning rates) have shown to be best for least-squares (e.g., Gaussian noise) once combined with parameter averaging, these are not leading to convergent algorithms in general. In this paper, we consider generalized linear models, that is, conditional models based on exponential families. In , we propose averaging moment parameters instead of natural parameters for constant-step-size stochastic gradient descent. For finite-dimensional models, we show that this can sometimes (and surprisingly) lead to better predictions than the best linear model. For infinite-dimensional models, we show that it always converges to optimal predictions, while averaging natural parameters never does. We illustrate our findings with simulations on synthetic data and classical benchmarks with many observations.

Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods, with marginal computational overhead. It aims to improve convergence using only the iterates of simple iterative algorithms. However, so far its application to optimization was theoretically limited to gradient descent and other single-step algorithms. Here, we adapt RNA to a much broader setting including stochastic gradient with momentum and Nesterov's fast gradient. In , we use it to train deep neural networks, and empirically observe that extrapolated networks are more accurate, especially in the early iterations. A straightforward application of our algorithm when training ResNet-152 on ImageNet produces a top-1 test error of 20.88, improving by 0.8 the reference classification pipeline. Furthermore, the code runs offline in this case, so it never negatively affects performance.

The Regularized Nonlinear Acceleration (RNA) algorithm is an acceleration method capable of improving the rate of convergence of many optimization schemes such as gradient descend, SAGA or SVRG. Until now, its analysis is limited to convex problems, but empirical observations shows that RNA may be extended to wider settings. In , we investigate further the benefits of RNA when applied to neural networks, in particular for the task of image recognition on CIFAR10 and ImageNet. With very few modifications of exiting frameworks, RNA improves slightly the optimization process of CNNs, after training.

The seriation problem seeks to reorder a set of elements given pairwise similarity information, so that elements with higher similarity are closer in the resulting sequence. When a global ordering consistent with the similarity information exists, an exact spectral solution recovers it in the noiseless case and seriation is equivalent to the combinatorial 2-SUM problem over permutations, for which several relaxations have been derived. However, in applications such as DNA assembly, similarity values are often heavily corrupted, and the solution of 2-SUM may no longer yield an approximate serial structure on the elements. In , we introduce the robust seriation problem and show that it is equivalent to a modified 2-SUM problem for a class of similarity matrices modeling those observed in DNA assembly. We explore several relaxations of this modified 2-SUM problem and compare them empirically on both synthetic matrices and real DNA data. We then introduce the problem of seriation with duplications, which is a generalization of Seriation motivated by applications to cancer genome reconstruction. We propose an algorithm involving robust seriation to solve it, and present preliminary results on synthetic data sets.

Spectral clustering uses a graph Laplacian spectral embedding to enhance the cluster structure of some data sets. When the embedding is one dimensional, it can be used to sort the items (spectral ordering). A number of empirical results also suggests that a multidimensional Laplacian embedding enhances the latent ordering of the data, if any. This also extends to circular orderings, a case where unidimensional embeddings fail. In , we tackle the task of retrieving linear and circular orderings in a unifying framework, and show how a latent ordering on the data translates into a filamentary structure on the Laplacian embedding. We propose a method to recover it, illustrated with numerical experiments on synthetic data and real DNA sequencing data.

As datasets continue to increase in size and multi-core computer architectures are developed, asynchronous parallel optimization algorithms become more and more essential to the field of Machine Learning. Unfortunately, conducting the theoretical analysis of asynchronous methods is difficult, notably due to the introduction of delay and inconsistency in inherently sequential algorithms. Handling these issues often requires resorting to simplifying but unrealistic assumptions. Through a novel perspective, in we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced "perturbed iterate" framework that resolves it. We demonstrate the usefulness of our new framework by analyzing three distinct asynchronous parallel incremental optimization algorithms: Hogwild (asynchronous SGD), KROMAGNON (asynchronous SVRG) and ASAGA, a novel asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. We are able to both remove problematic assumptions and obtain better theoretical results. Notably, we prove that ASAGA and KROMAGNON can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedups as well as the hardware overhead. Finally, we investigate the overlap constant, an ill-understood but central quantity for the theoretical analysis of asynchronous parallel algorithms. We find that it encompasses much more complexity than suggested in previous work, and often is order-of-magnitude bigger than traditionally thought.

The impressive breakthroughs of the last two decades in the field of machine learning can be in large part attributed to the explosion of computing power and available data. These two limiting factors have been replaced by a new bottleneck: algorithms. The focus of this thesis is thus on introducing novel methods that can take advantage of high data quantity and computing power. We present two independent contributions.

First, we develop and analyze novel fast optimization algorithms which take advantage of the advances in parallel computing architecture and can handle vast amounts of data. We introduce a new framework of analysis for asynchronous parallel incremental algorithms, which enable correct and simple proofs. We then demonstrate its usefulness by performing the convergence analysis for several methods, including two novel algorithms.

Asaga is a sparse asynchronous parallel variant of the variance-reduced algorithm Saga which enjoys fast linear convergence rates on smooth and strongly convex objectives. We prove that it can be linearly faster than its sequential counterpart, even without sparsity assump- tions.

ProxAsaga is an extension of Asaga to the more general setting where the regularizer can be non-smooth. We prove that it can also achieve a linear speedup. We provide extensive experiments comparing our new algorithms to the current state-of-art.

Second, we introduce new methods for complex struc- tured prediction tasks. We focus on recurrent neural net- works (RNNs), whose traditional training algorithm for RNNs – based on maximum likelihood estimation (MLE) – suffers from several issues. The associated surrogate training loss notably ignores the information contained in structured losses and introduces discrepancies between train and test times that may hurt performance.

To alleviate these problems, we propose SeaRNN, a novel training algorithm for RNNs inspired by the “learning to search” approach to structured prediction.SeaRNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error than the MLE objective.

We demonstrate improved performance over MLE on three challenging tasks, and provide several subsampling strategies to enable SeaRNN to scale to large-scale tasks, such as machine translation. Finally, after contrasting the behavior of SeaRNN models to MLE models, we conduct an in-depth comparison of our new approach to the related work.

Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce in the regularized Christoffel function associated to a positive definite kernel. This uncovers a variational formulation for leverage scores for kernel methods and allows to elucidate their relationships with the chosen kernel as well as population density. Our main result quantitatively describes a decreasing relation between leverage score and population density for a broad class of kernels on Euclidean spaces. Numerical simulations support our findings.

Key to structured prediction is exploiting the problem structure to simplify the learning process. A major challenge arises when data exhibit a local structure (e.g., are made by "parts") that can be leveraged to better approximate the relation between (parts of) the input and (parts of) the output. Recent literature on signal processing, and in particular computer vision, has shown that capturing these aspects is indeed essential to achieve state-of-the-art performance. While such algorithms are typically derived on a case-by-case basis, in we propose the first theoretical framework to deal with part-based data from a general perspective. We derive a novel approach to deal with these problems and study its generalization properties within the setting of statistical learning theory. Our analysis is novel in that it explicitly quantifies the benefits of leveraging the part-based structure of the problem with respect to the learning rates of the proposed estimator.

Applications of optimal transport have recently gained remarkable attention thanks to the computational advantages of entropic regularization. However, in most situations the Sinkhorn approximation of the Wasserstein distance is replaced by a regularized version that is less accurate but easy to differentiate. In we characterize the differential properties of the original Sinkhorn distance, proving that it enjoys the same smoothness as its regularized version and we explicitly provide an efficient algorithm to compute its gradient. We show that this result benefits both theory and applications: on one hand, high order smoothness confers statistical guarantees to learning with Wasserstein approximations. On the other hand, the gradient formula allows us to efficiently solve learning and optimization problems in practice. Promising preliminary experiments complement our analysis.

Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In , we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments.

Structured prediction provides a general framework to deal with supervised problems where the outputs have semantically rich structure. While classical approaches consider finite, albeit potentially huge, output spaces, in we discuss how structured prediction can be extended to a continuous scenario. Specifically, we study a structured prediction approach to manifold valued regression. We characterize a class of problems for which the considered approach is statistically consistent and study how geometric optimization can be used to compute the corresponding estimator. Promising experimental results on both simulated and real data complete our study.

Leverage score sampling provides an appealing way to perform approximate computations for large matrices. Indeed, it allows to derive faithful approximations with a complexity adapted to the problem at hand. Yet, performing leverage scores sampling is a challenge in its own right requiring further approximations. In , we study the problem of leverage score sampling for positive definite matrices defined by a kernel. Our contribution is twofold. First we provide a novel algorithm for leverage score sampling and second, we exploit the proposed method in statistical learning by deriving a novel solver for kernel ridge regression. Our main technical contribution is showing that the proposed algorithms are currently the most efficient and accurate for these problems.

Microsoft Research: “Structured Large-Scale Machine Learning”. Machine learning is now ubiquitous in
industry, science, engineering, and personal life. While early successes were obtained by applying off-the-shelf techniques, there are two main challenges faced by machine learning in the “big data” era: structure and
scale. The project proposes to explore three axes, from theoretical, algorithmic and practical perspectives: (1)
large-scale convex optimization, (2) large-scale combinatorial optimization and (3) sequential decision making
for structured data. The project involves two Inria sites (Paris and Grenoble) and four MSR sites (Cambridge,
New England, Redmond, New York). Project website: http://

Alexandre d’Aspremont, Francis Bach, Martin Jaggi (EPFL): Google Focused award.

Francis Bach: Gift from Facebook AI Research.

Alexandre d’Aspremont: AXA, "mécénat scientifique, chaire Havas-Dauphine", machine learning.

Alexandre d'Aspremont: IRIS, PSL “Science des données, données de la science”.

**ITN Spartan**

Title: Sparse Representations and Compressed Sensing Training Network

Type: FP7

Instrument: Initial Training Network

Duration: October 2014 to October 2018

Coordinator: Mark Plumbley (University of Surrey)

Inria contact: Francis Bach

Abstract: The SpaRTaN Initial Training Network will train a new generation of interdisciplinary researchers in sparse representations and compressed sensing, contributing to Europe’s leading role in scientific innovation. By bringing together leading academic and industry groups with expertise in sparse representations, compressed sensing, machine learning and optimisation, and with an interest in applications such as hyperspectral imaging, audio signal processing and video analytics, this project will create an interdisciplinary, trans-national and inter-sectorial training network to enhance mobility and training of researchers in this area. SpaRTaN is funded under the FP7-PEOPLE-2013- ITN call and is part of the Marie Curie Actions — Initial Training Networks (ITN) funding scheme: Project number - 607290

**ITN Macsenet**

Title: Machine Sensing Training Network

Type: H2020

Instrument: Initial Training Network

Duration: January 2015 - January 2019

Coordinator: Mark Plumbley (University of Surrey)

Inria contact: Francis Bach

Abstract: The aim of this Innovative Training Network is to train a new generation of creative, entrepreneurial and innovative early stage researchers (ESRs) in the research area of measurement and estimation of signals using knowledge or data about the underlying structure. We will develop new robust and efficient Machine Sensing theory and algorithms, together methods for a wide range of signals, including: advanced brain imaging; inverse imaging problems; audio and music signals; and non-traditional signals such as signals on graphs. We will apply these methods to real-world problems, through work with non-Academic partners, and disseminate the results of this research to a wide range of academic and non-academic audiences, including through publications, data, software and public engagement events. MacSeNet is funded under the H2020-MSCA-ITN-2014 call and is part of the Marie Sklodowska- Curie Actions — Innovative Training Networks (ITN) funding scheme.

**ERC Sequoia**
Title: Robust algorithms for learning from modern data

Programm: H2020

Type: ERC

Duration: 2017-2022

Coordinator: Inria

Inria contact: Francis Bach

Abstract: Machine learning is needed and used everywhere, from science to industry, with a growing impact on many disciplines. While first successes were due at least in part to simple supervised learning algorithms used primarily as black boxes on medium-scale problems, modern data pose new challenges. Scalability is an important issue of course: with large amounts of data, many current problems far exceed the capabilities of existing algorithms despite sophisticated computing architectures. But beyond this, the core classical model of supervised machine learning, with the usual assumptions of independent and identically distributed data, or well-defined features, outputs and loss functions, has reached its theoretical and practical limits. Given this new setting, existing optimization-based algorithms are not adapted. The main objective of this project is to push the frontiers of supervised machine learning, in terms of (a) scalability to data with massive numbers of observations, features, and tasks, (b) adaptability to modern computing environments, in particular for parallel and distributed processing, (c) provable adaptivity and robustness to problem and hardware specifications, and (d) robustness to non-convexities inherent in machine learning problems. To achieve the expected breakthroughs, we will design a novel generation of learning algorithms amenable to a tight convergence analysis with realistic assumptions and efficient implementations. They will help transition machine learning algorithms towards the same wide- spread robust use as numerical linear algebra libraries. Outcomes of the research described in this proposal will include algorithms that come with strong convergence guarantees and are well-tested on real-life benchmarks coming from computer vision, bioin- formatics, audio processing and natural language processing. For both distributed and non-distributed settings, we will release open-source software, adapted to widely available computing platforms.

Title: Learning from Big Data: First-Order methods for Kernels and Submodular functions

International Partner (Institution - Laboratory - Researcher):

IISc Bangalore (India) - Computer Science Department - Chiranjib Bhattacharyya

Start year: 2016

See also: mllab.

Recent advances in sensor technologies have resulted in large amounts of data being generated in a wide array of scientific disciplines. Deriving models from such large datasets, often known as “Big Data”, is one of the important challenges facing many engineering and scientific disciplines. In this proposal we investigate the problem of learning supervised models from Big Data, which has immediate applications in Computational Biology, Computer vision, Natural language processing, Web, E-commerce, etc., where specific structure is often present and hard to take into account with current algorithms. Our focus will be on the algorithmic aspects. Often supervised learning problems can be cast as convex programs. The goal of this proposal will be to derive first-order methods which can be effective for solving such convex programs arising in the Big-Data setting. Keeping this broad goal in mind we investigate two foundational problems which are not well addressed in existing literature. The first problem investigates Stochastic Gradient Descent Algorithms in the context of First-order methods for designing algorithms for Kernel based prediction functions on Large Datasets. The second problem involves solving discrete optimization problems arising in Submodular formulations in Machine Learning, for which first-order methods have not reached the level of speed required for practical applications (notably in computer vision).

Vijaya Bollapragada from Northwestern University, Chicago, IL, United States, Apr - Jul 2018.

Aaron De Fazio from Facebook Research NY, New York, United States, Feb 2018.

Gauthier Gidel from University of Montreal - MILA, Montreal, Canada, Jan 2018.

Sharan Vaswani from University of British Columbia, Vancouver, Canada, Apr - Jul 2018

Simon Lacoste-Julien from University of Montreal - MILA, Montreal, Canada, Aug 2018.

F. Bach: General Chair of ICML 2018 (Stockholm)

Adrian Taylor, Session Organizer: *Computer-assisted analyses of optimization algorithms I & II*, International Symposium on Mathematical Programming, July 2018.

F. Bach: Co-organization of the workshop “Horizon Maths 2018 : Intelligence Artificielle”, November 23, 2018

F. Bach: Program Chair of the Journées de Statistiques (Saclay)

Conference on Learning Theory (COLT 2018): Pierre Gaillard, Alessandro Rudi

Symposium on Discrete Algorithms (SODA 2019): Adrien Taylor,

Neural Information Processing Systems (NIPS 2018): Pierre Gaillard, Alessandro Rudi

Conference on Learning Theory (COLT 2018): Pierre Gaillard, Alessandro Rudi, Adrien Taylor

Symposium on Discrete Algorithms (SODA 2019): Adrien Taylor

International Conference of Machine Learning (ICML 2018): Pierre Gaillard, Alessandro Rudi

F. Bach: Journal of Machine Learning Research, co-editor-in-chief

F. Bach: Information and Inference, Associate Editor.

F. Bach: Electronic Journal of Statistics, Associate Editor.

F. Bach: Mathematical Programming, Associate Editor.

F. Bach: Foundations of Computational Mathematics, Associate Editor.

A. d’Aspremont: SIAM Journal on Optimization, Associate editor

A. d’Aspremont: SIAM Journal on the Mathematics of Data Science, Associate Editor

A. d’Aspremont: Mathematical Programming, Associate Editor

SIAM Journal on Optimization: Adrien Taylor

Mathematical Programming: Adrien Taylor

Journal of Optimization Theory and Algorithms: Adrien Taylor

Journal of Machine Learning Research: Pierre Gaillard, Alessandro Rudi

Applied Computational Harmonic Analysis: Alessandro Rudi

F. Bach, Trends in Optimization Seminar, University of Washington, November 2018.

Pierre Gaillard. *Distributed averaging of observations in a graph: the gossip problem*. MNL Conference, Paris, November 2018.

Adrien Taylor, *Analysis and design of first-order methods via semidefinite
programming*, Seminaire Parisien dOptimisation (SPO), Paris (France), November 2018.

F. Bach, Frontier Research and Artificial Intelligence, European Research Council, Brussels, October 2018.

F. Bach, IDSS Distinguished Speaker Seminar, MIT, October 2018.

F. Bach, Mathematical Institute Colloquium, Oxford, October 2018.

Adrien Taylor, *Convex Interpolation and Performance Estimation of First-
order Methods* for Convex Optimization, IBM/FNRS innovation award, Brussels (Belgium), October 2018.

F. Bach, Workshop on Structural Inference in High-Dimensional Models, Moscow, September 2018.

F. Bach, Symposium on Mathematical Programming (ISMP), Bordeaux, plenary talk, July 2018.

Alexandre d'Aspremont, *Sharpness, Restart and Compressed Sensing Performance*, ISMP 2018, Bordeaux, July 2018.

Alessandro Rudi, *FALKON: An optimal method for large scale learning with statistical guarantees*, ISMP 2018, Bordeaux, July 2018.

Adrien Taylor, *Computer-assisted Lyapunov-based worst-case analyses of first-
order methods*, International Symposium on Mathematical Programming, Bordeaux (France), July 2018.

F. Bach, SIAM Conference on Imaging Science, Bologna, Italy, invited talk, June 2018.

Pierre Gaillard. *Online prediction of arbitrary time-series with application to electricity consumption*. Conference on nonstationarity. Cergy Pontoise University. June 2018.

Adrien Taylor, *Convex Interpolation and Performance Estimation of First-order Methods for Convex Optimization*, International Symposium on Mathematical Programming: Tucker prize finalist, Bordeaux (France), July 2018.

Alexandre d'Aspremont, *An approximate Shapley-Folkman Theorem*, Isaac Newton Institute, Cambridge, June 2018.

F. Bach,Workshop on Future challenges in statistical scalability, Newton Institute, Cambridge, UK, June 2018.

Adrien Taylor, *Automated design of first-order optimization methods*, Operation Research Seminar, UCLouvain, Louvain-la-Neuve (Belgium), May 2018.

Adrien Taylor, *Automated design of first-order optimization methods*, LCCC Control Seminar, Lund University, Lund (Sweden), May 2018.

Pierre Gaillard. *Distributed learning with orthogonal polynomials*. Inria DGA meetup. May 2018.

F. Bach, Workshop on Optimisation and Machine Learning in Economics, London, March 2018.

Pierre Gaillard. *An overview of Artificial Intelligence*. Hackaton. PSL University. March 2018.

Alexandre d'Aspremont, *Regularized Nonlinear Acceleration*, US and Mexico Workshop on Optimization and its Applications, Jan 2018.

Alessandro Rudi, *Learning with Random Features*, Isaac Newton Institute, Cambridge, Jan 2018.

Pierre Gaillard. *Online nonparametric regression with adversarial data.* Smile seminar. Paris. Jan 2018.

F. Bach (together with N. Chopin), *Graphical models*, 30h, Master M2 (MVA), ENS Cachan, France.

F. Bach, *Optimisation et apprentissage statistique*, 20h, Master M2 (Mathématiques de l'aléatoire), Université Paris-Sud, France.

Alexandre d'Aspremont, *Optimisation Combinatoire et Convexe*, avec Zhentao Li, (2015-Present) cours magistraux 30h, Master M1, ENS Paris.

Alexandre d'Aspremont, *Optimisation convexe: modélisation, algorithmes et applications* cours magistraux 21h (2011-Present), Master M2 MVA, ENS PS.

F. Bach and P. Gaillard, *Apprentissage statistique*, 35h, Master M1, Ecole Normale Supérieure, France.

P. Gaillard (together with V. Perchet), *Prediction of individual sequences*, 21h, Master M2 MVA, ENS Cachan, France.

Gregoire Mialon, Python for Machine Learning, 21h, M2 MASH, Dauphine-ENS-PSL, Paris.

Anaël Bonneton, PhD defended on July 2018, co-advised by Francis Bach, located in Agence nationale de la sécurité des systèmes d’information (ANSSI).

Damien Scieur, PhD defended on September 2018. *Sur l'accélération des méthodes d’optimisation*, supervised by Alexandre d'Aspremont and Francis Bach.

Jean-Baptiste Alayrac, PhD defended on September 2018, *Structured Learning from Videos and Language*,
supervised by Simon Lacoste-Julien, Josef Sivic and Ivan Laptev.

Antoine Recanati, PhD. defended on November 2018. *Application du problème de sériation au séquençage de l’ADN et autres relaxations convexes appliquées en bioinformatique*, supervised by Alexandre d'Aspremont.

Rémi Leblond, PhD defended on November 2018, *Asynchronous Optimization for Machine Learning*, supervised by Simon Lacoste-Julien.

Mathieu Barre, PhD in progress *Méthodes d'extrapolation, au-delà de la convexité*, supervised by Alexandre d'Aspremont.

Grégoire Mialon, PhD in progress *Algorithmes d'optimisation, méthodes de régularisation et architectures pour les réseaux de neurones profonds dans un contexte où les données labellisées sont rares*, supervised by Alexandre d'Aspremont.

Radu-Alexandru Dragomir, PhD in progress *Non-Euclidean first-order methods*, supervised by Alexandre d'Aspremont and Jérôme Bolte.

Thomas Kerdreux, PhD in progress *Optimisation and machine learning*, supervised by Alexandre d'Aspremont.

Margaux Brégère, PhD in progress started September 2017, supervised by Pierre Gaillard, Gilles Stoltz and Yannig Goude (EDF R&D).

Raphaël Berthier, PhD in progress started September 2017, supervised by Francis Bach and Pierre Gaillard.

Loucas Pillaud-Vivien, PhD in progress, supervised by Francis Bach and Alessandro Rudi.

Alex Nowak, PhD in progress, supervised by Francis Bach and Alessandro Rudi.

Ulysse Marteau Ferey, PhD in progress, supervised by Francis Bach and Alessandro Rudi.

Dmitry Babichev, PhD in progress, started is September 2015, co-advised by Francis Bach and Anatoly Judistky (Univ. Grenoble).

Tatiana Shpakova, PhD in progress, started September 2015, advised by Francis Bach.

Alexandre d'Aspremont, Habilitation à diriger des recherches. Thomas Bruls, Genoscope, Université d’Evry.

Design and implementation of a demonstration for the permanent exhibit at Palais de la Découverte: “L’apprenti illustrateur” (J.-B. Alayrac, F. Bach)