Machine learning is a recent scientific domain, positioned between applied mathematics, statistics and computer science. Its goals are the optimization, control, and modelisation of complex systems from examples. It applies to data from numerous engineering and scientific fields (e.g., vision, bioinformatics, neuroscience, audio processing, text processing, economy, finance, etc.), the ultimate goal being to derive general theories and algorithms allowing advances in each of these domains. Machine learning is characterized by the high quality and quantity of the exchanges between theory, algorithms and applications: interesting theoretical problems almost always emerge from applications, while theoretical analysis allows the understanding of why and when popular or successful algorithms do or do not work, and leads to proposing significant improvements.

Our academic positioning is exactly at the intersection between these three aspects—algorithms, theory and applications—and our main research goal is to make the link between theory and algorithms, and between algorithms and high-impact applications in various engineering and scientific fields, in particular computer vision, bioinformatics, audio processing, text processing and neuro-imaging.

Machine learning is now a vast field of research and the team focuses on the following aspects: supervised learning (kernel methods, calibration), unsupervised learning (matrix factorization, statistical tests), parsimony (structured sparsity, theory and algorithms), and optimization (convex optimization, bandit learning). These four research axes are strongly interdependent, and the interplay between them is key to successful practical applications.

Visit of Prof. Michael Jordan (U.C. Berkeley) and of his research group.

Recruitment of two researchers: Alexandre d'Aspremont (DR2 CNRS) and Simon Lacoste-Julien (Inria Starting researcher position).

Start of a collaboration with Microsoft Research (within the joint MSR/Inria lab).

This part of our research focuses on methods where, given a set of examples of input/output pairs, the goal is to predict the output for a new input, with research on kernel methods, calibration methods, and multi-task learning.

We focus here on methods where no output is given and the goal is to find structure of certain known types (e.g., discrete or low-dimensional) in the data, with a focus on matrix factorization, statistical tests, dimension reduction, and semi-supervised learning.

The concept of parsimony is central to many areas of science. In the context of statistical machine learning, this takes the form of variable or feature selection. The team focuses primarily on structured sparsity, with theoretical and algorithmic contributions (this is the main topic of the ERC starting investigator grant awarded to F. Bach).

Optimization in all its forms is central to machine learning, as many of its theoretical frameworks are based at least in part on empirical risk minimization. The team focuses primarily on convex and bandit optimization, with a particular focus on large-scale optimization.

Machine learning research can be conducted from two main perspectives: the first one, which has been dominant in the last 30 years, is to design learning algorithms and theories which are as generic as possible, the goal being to make as few assumptions as possible regarding the problems to be solved and to let data speak for themselves. This has led to many interesting methodological developments and successful applications. However, we believe that this strategy has reached its limit for many application domains, such as computer vision, bioinformatics, neuro-imaging, text and audio processing, which leads to the second perspective our team is built on: Research in machine learning theory and algorithms should be driven by interdisciplinary collaborations, so that specific prior knowledge may be properly introduced into the learning process, in particular with the following fields:

Computer vision: objet recognition, object detection, image segmentation, image/video processing, computational photography. In collaboration with the Willow project-team.

Bioinformatics: cancer diagnosis, protein function prediction, virtual screening. In collaboration with Institut Curie.

Text processing: document collection modeling, language models.

Audio processing: source separation, speech/music processing. In collaboration with Telecom Paristech.

Neuro-imaging: brain-computer interface (fMRI, EEG, MEG). In collaboration with the Parietal project-team.

This year, our research has focused on new application domains within natural language processing (NLP), with our first two publications in leading conferences in NLP. We have worked on large-scale semantic role labelling (E. Grave, F. Bach, G. Obozinski), where we use syntactic dependency trees and learned representations from large corpora (e.g., 14.7 millions sentences, 310 millions tokens). We also extended our original work on structured sparsity to language models (F. Bach, A. Nelakanti, in collaboration with Xerox), in order to
predict a word given *all* previous words, with a potentially infinite feature space organized with structured regularization.

SPAMS (SPArse Modeling Software) is an optimization toolbox for solving various
sparse estimation problems: dictionary learning and matrix factorization, solving sparse decomposition prob-
lems, solving structured sparse decomposition problems. It is developped by Julien Mairal (former Willow PhD student, co-advised by F. Bach and J. Ponce), with the collaboration of Francis Bach (Inria), Jean
Ponce (Ecole Normale Supérieure), Guillermo Sapiro (University of Minnesota), Rodolphe Jenatton (Inria)
and Guillaume Obozinski (Inria). It is coded in C++ with a Matlab interface. This year, interfaces for R
and Python have been developed by Jean-Paul Chieze (engineer Inria). Currently 650 downloads and between 1500
and 2000 page visits per month. See http://

BCFWstruct is a Matlab implementation of the Block-Coordinate Frank-Wolfe solver for Structural SVMs. See the ICML 2013 paper with the same name.

Participants outside of Sierra: Martin Jaggi (Centre de Mathématiques Appliquées, Ecole Polytechnique); Patrick Pletscher (Machine Learning Laboratory, ETH Zurich)

SAG: Minimizing Finite Sums with the Stochastic Average Gradient.

The SAG code contains C implements (via Matlab mex files) of the stochastic average gradient (SAG) method detailed below, as well as several related methods, for the problem of L2-regularized logistic regression with a finite training set.

The specific methods available in the package are: SGD: The stochastic gradient method with (user-supplied) step-sizes, (optional) projection step, and (optional) (weighted-)averaging. ASGD: A variant of the above code that supports less features, but efficiently implements uniform averaging on sparse data sets. PCD: A basic primal coordinate descent method with step sizes set according the (user-supplied) Lipschitz constants. DCA: A dual coordinate ascent method with a numerical high-accuracy line-search. SAG: The stochastic average gradient method with a (user-supplied) constant step size. SAGlineSearch: The stochastic average gradient method with the line-search described in the paper. SAG-LipschitzLS: The stochastic average gradient method with the line-search and adaptive non-uniform sampling strategy described in the paper.

We showed that HRF estimation improves sensitivity of fMRI encoding and decoding models and propose a new approach for the estimation of Hemodynamic Response Functions from fMRI data. This is an implementation of the methods described in the paper.

Collaboration with: Martin Jaggi (Centre de Mathématiques Appliquées, Ecole Polytechnique), Patrick Pletscher (Machine Learning Laboratory, ETH Zurich).

The primary contribution of this work is the analysis of a new algorithm that we call the *stochastic average
gradient* (SAG) method, a randomized variant of the incremental aggregated
gradient (IAG) method of . The SAG method has the low
iteration cost of SG methods, but achieves the convergence rates stated above for the FG
method. The SAG iterations take the form

where at each iteration a random index *the SAG iterations have an $O(1/k)$ convergence rate for convex objectives and a linear convergence rate for strongly-convex objectives*, like the FG method.
That is, by having access to

In we consider optimizing a function smooth convex function

We write our problem

where we assume that

If

*Deterministic gradient* methods for problems of this form use the iteration

for a sequence of step sizes *stochastic gradient* methods use the iteration

for an individual data sample

The stochastic gradient method is appealing because the cost of its iterations is *independent of*

In contrast, the deterministic gradient method with a *constant* step size has a smaller error of *strongly* convex, meaning that

for all *linear* convergence rate. In particular, the deterministic method satisfies

for some

We show that if the individual gradients

Large-scale machine learning problems are becoming ubiquitous in many areas of science and engineering. Faced with large amounts of data, practitioners typically prefer algorithms that process each observation only once, or a few times. Stochastic approximation algorithms such as stochastic gradient descent (SGD) and its variants, although introduced more than sixty years ago, still remain the most widely used and studied method in this context. In , we consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which includes machine learning methods based on the minimization of the empirical risk. We focus on problems without strong convexity, for which all previously known algorithms achieve a convergence rate for function values of *averaged* stochastic gradient descent *with constant step-size* achieves the desired rate. For logistic regression, this is achieved by a simple novel stochastic gradient algorithm that (a) constructs successive local quadratic approximations of the loss functions, while (b) preserving the same running-time complexity as stochastic gradient descent. For these algorithms, we provide a non-asymptotic analysis of the generalization error (in expectation, and also in high probability for least-squares), and run extensive experiments showing that they often outperform existing approaches.

Large, streaming data sets are increasingly the norm in science and technology. Simple descriptive statistics can often be readily computed with a constant number of operations for each data point in the streaming setting, without the need to revisit past data or have advance knowledge of future data. But these time and memory restrictions are not generally available for the complex, hierarchical models that practitioners often have in mind when they collect large data sets. Significant progress on scalable learning procedures has been made in recent years. But the underlying models remain simple, and the inferential framework is generally non-Bayesian. The advantages of the Bayesian paradigm (e.g., hierarchical modeling, coherent treatment of uncertainty) currently seem out of reach in the Big Data setting.

An exception to this statement is provided by Hofmann et al. (2010),
who have shown that a class of approximation methods known as
*variational Bayes* (VB) can be usefully deployed for large-scale
data sets. They have applied their approach, referred to as
*stochastic variational inference* (SVI), to the domain of topic
modeling of document collections, an area with a major need for
scalable inference algorithms. VB traditionally uses the variational
lower bound on the marginal likelihood as an objective function, and
the idea of SVI is to apply a variant of stochastic gradient descent
to this objective. Notably, this objective is based on the conceptual
existence of a full data set involving D data points (i.e., documents
in the topic model setting), for a fixed value of D. Although the
stochastic gradient is computed for a single, small subset of data
points (documents) at a time, the posterior being targeted is a posterior
for D data points. This value of D must be specified in advance and is
used by the algorithm at each step. Posteriors for D' data points, for
D' not equal to D, are not obtained as part of the analysis.

We view this lack of a link between the number of documents that have been processed thus far and the posterior that is being targeted as undesirable in many settings involving streaming data. In this project we aim at an approximate Bayesian inference algorithm that is scalable like SVI but is also truly a streaming procedure, in that it yields an approximate posterior for each processed collection of D' data points—and not just a pre-specified "final" number of data points D. To that end, we return to the classical perspective of Bayesian updating, where the recursive application of Bayes theorem provides a sequence of posteriors, not a sequence of approximations to a fixed posterior. To this classical recursive perspective we bring the VB framework; our updates need not be exact Bayesian updates but rather may be approximations such as VB.

Although the empirical success of SVI is the main motivation for our work, we are also motivated by recent developments in computer architectures, which permit distributed and asynchronous computations in addition to streaming computations. A streaming VB algorithm naturally lends itself to distributed and asynchronous implementations.

Seriation seeks to reconstruct a linear order between variables using unsorted similarity information. It has direct applications in archeology and shotgun gene sequencing for example. In we prove the equivalence between the seriation and the combinatorial 2-sum problem (a quadratic minimization problem over permutations) over a class of similarity matrices. The seriation problem can be solved exactly by a spectral algorithm in the noiseless case and we produce a convex relaxation for the 2-sum problem to improve the robustness of solutions in a noisy setting. This relaxation also allows us to impose additional structural constraints on the solution, to solve semi-supervised seriation problems. We performed numerical experiments on archeological data, Markov chains and gene sequences.

Phase retrieval seeks to reconstruct a complex signal, given a number of observations on the *magnitude* of linear measurements, i.e. solve

in the variable

In segmentation models, the number of change-points is typically chosen using a penalized cost function. In we propose to learn the penalty and its constants in databases of signals with weak change-point annotations. We propose a convex relaxation for the resulting interval regression problem, and solve it using accelerated proximal gradient methods. We show that this method achieves state-of-the-art change-point detection in a database of annotated DNA copy number profiles from neuroblastoma tumors.

Optimizing submodular functions has been an active area of research with applications
in graph-cut-based image segmentation , sensor placement , or document summarization .
A set function

Submodular functions form an interesting class of discrete functions because minimizing a submodular function can be done in polynomial time , while maximization, although NP-hard, admits constant factor approximation algorithms . In this paper, our ultimate goal is to provide the first (to the best of our knowledge) generic convex relaxation of submodular function maximization, with a hierarchy of complexities related to known combinatorial hierarchies such as the Sherali-Adams hierarchy . Beyond the graphical model tools that we are going to develop, having convex relaxations may be interesting for several reasons: (1) they can lead to better solutions, (2) they provide online bounds that may be used within branch-and-bound optimization and (3) they ease the use of such combinatorial optimization problems within structured prediction framework .

We make the following contributions:

For any directed acyclic graph

We propose an algorithm to maximize submodular functions by maximizing the bound

We propose extensions to constrained problems and maximizing the difference of submodular functions, which include all possible set functions.

We illustrate our results on small-scale experiments.

Recently, it has become evident that submodularity naturally captures widely occurring concepts in machine learning, signal processing and computer vision. Consequently, there is need for efficient optimization procedures for submodular functions, especially for minimization problems. While general submodular minimization is challenging, we propose in a new method that exploits existing decomposability of submodular functions. In contrast to previous approaches, our method is neither approximate, nor impractical, nor does it need any cumbersome parameter tuning. Moreover, it is easy to implement and parallelize. A key component of our method is a formulation of the discrete submodular minimization problem as a continuous best approximation problem that is solved through a sequence of reflections, and its solution can be easily thresholded to obtain an optimal discrete solution. This method solves *both* the continuous and discrete formulations of the problem, and therefore has applications in learning, inference, and reconstruction.
In our experiments, we illustrate the benefits of our method on two image segmentation tasks.

Graphical models provide a versatile set of tools for probabilistic modeling of large collections of interdependent variables. They are defined by graphs that encode the conditional independences among the random variables, together with potential functions or conditional probability distributions that encode the specific local interactions leading to globally well-defined probability distributions , , .

In many domains such as computer vision, natural language processing or bioinformatics, the structure of the graph follows naturally from the constraints of the problem at hand. In other situations, it might be desirable to estimate this structure from a set of observations. It allows (a) a statistical fit of rich probability distributions that can be considered for further use, and (b) discovery of structural relationship between different variables. In the former case, distributions with tractable inference are often desirable, i.e., inference with run-time complexity does not scale exponentially in the number of variables in the model. The simplest constraint to ensure tractability is to impose tree-structured graphs . However, these distributions are not rich enough, and following earlier work , , , , , , we consider models with *treewidth* bounded, not simply by one (i.e., trees), but by a small constant

Beyond the possibility of fitting tractable distributions (for which probabilistic inference has linear complexity in the number of variables), learning bounded-treewidth graphical models is key to design approximate inference algorithms for graphs with higher treewidth. Indeed, as shown by , , , approximating general distributions by tractable distributions is a common tool in variational inference. However, in practice, the complexity of variational distributions is often limited to trees (i.e.,

We make the following contributions:

We provide a novel convex relaxation for learning bounded-treewidth decomposable graphical models from data in polynomial time. This is achieved by posing the problem as a combinatorial optimization problem, which is relaxed to a convex optimization problem that involves the graphic and hypergraphic matroids.

We show how a supergradient ascent method may be used to solve the dual optimization problem, using greedy algorithms as inner loops on the two matroids. Each iteration has a run-time complexity of

We compare our approach to state-of-the-art methods on synthetic datasets and classical benchmarks showing the gains of the novel convex approach.

Unsupervised partitioning problems are ubiquitous in machine learning and other data-oriented fields such as computer vision, bioinformatics or signal processing. They include (a) traditional *unsupervised clustering* problems, with the classical K-means algorithm, hierarchical linkage methods and spectral clustering , (b) *unsupervised image segmentation* problems where two neighboring pixels are encouraged to be in the same cluster, with mean-shift techniques or normalized cuts , and (c) *change-point detection* problems adapted to multivariate sequences (such as video) where segments are composed of contiguous elements, with typical window-based algorithms and various methods looking for a change in the mean of the features (see, e.g., ).

All the algorithms mentioned above rely on a specific distance (or more generally a similarity measure) on the space of configurations. A good metric is crucial to the performance of these partitioning algorithms and its choice is heavily problem-dependent. While the choice of such a metric has been originally tackled manually (often by trial and error), recent work has considered learning such metric directly from data. Without any supervision, the problem is ill-posed and methods based on generative models may learn a metric or reduce dimensionality (see, e.g., ), but typically with no guarantees that they lead to better partitions. In this paper, we follow , , and consider the goal of learning a metric for potentially several partitioning problems sharing the same metric, assuming that several fully or partially labelled partitioned datasets are available during the learning phase. While such labelled datasets are typically expensive to produce, there are several scenarios where these datasets have already been built, often for evaluation purposes. These occur in video segmentation tasks, image segmentation tasks as well as change-point detection tasks in bioinformatics (see ).

W consider partitioning problems based explicitly or implicitly on the minimization of Euclidean distortions, which include K-means, spectral clustering and normalized cuts, and mean-based change-point detection. We make the following contributions:

We review and unify several partitioning algorithms, and cast them as the maximization of a linear function of a rescaled equivalence matrix, which can be solved by algorithms based on spectral relaxations or dynamic programming.

Given fully labelled datasets, we cast the metric learning problem as a large-margin structured prediction problem, with proper definition of regularizers, losses and efficient loss-augmented inference.

Given partially labelled datasets, we propose an algorithm, iterating between labeling the full datasets given a metric and learning a metric given the fully labelled datasets. We also consider extensions that allow changes in the full distribution of univariate time series (rather than changes only in the mean), with application to bioinformatics.

We provide experiments where we show how learning the metric may significantly improve the partitioning performance in synthetic examples, video segmentation and image segmentation problems.

Increasing the sample size is the most common way to improve the performance of statistical estimators. In some cases (see, for instance, the experiments of on customer data analysis or those of on molecule binding problems), having access to some new data may be impossible, often due to experimental limitations. One way to circumvent those constraints is to use datasets from several related (and, hopefully, “similar”) problems, as if it gave additional (in some sense) observations on the initial problem. The statistical methods using this heuristic are called “multi-task” techniques, as opposed to “single-task” techniques, where every problem is treated one at a time. In this paper, we study kernel ridge regression in a multi-task framework and try to understand when multi-task can improve over single-task.

The first trace of a multi-task estimator can be found in the work of . In this article, Charles Stein showed that the usual maximum-likelihood estimator of the mean of a Gaussian vector (of dimension larger than 3, every dimension representing here a task) is not admissible—that is, there exists another estimator that has a lower risk for every parameter. He showed the existence of an estimator that uniformly attains a lower quadratic risk by shrinking the estimators along the different dimensions towards an arbitrary point. An explicit form of such an estimator was given by , yielding the famous James-Stein estimator. This phenomenon, now known as the “Stein's paradox”, was widely studied in the following years and the behaviour of this estimator was confirmed by empirical studies, in particular the one from . This first example clearly shows the goals of the multi-task procedure: an advantage is gained by borrowing information from different tasks (here, by shrinking the estimators along the different dimensions towards a common point), the improvement being scored by the global (averaged) squared risk. Therefore, this procedure does not guarantee individual gains on every task, but a global improvement on the sum of those task-wise risks.

We consider *oracle inequality*, that is, its risk matches (up to constants) the best possible one, the *oracle risk*.
Thus, it suffices to compare the oracle risks for the multi-task procedure and the single-task one to provide an answer to this question.

We study the oracle multi-task risk and compare it to the oracle single-task risk. We then find situations where the multi-task oracle is proved to have a lower risk than the single-task oracle. This allows us to better understand which situation favors the multi-task procedure and which does not. After having defined our model, we write down the risk of a general multi-task ridge estimator and see that it admits a convenient decomposition using two key elements: the mean of the tasks and the resulting variance. This decomposition allows us to optimize this risk and get a precise estimation of the oracle risk, in settings where the ridge estimator is known to be minimax optimal. We then explore several repartitions of the tasks that give the latter multi-task rates, study their single-task oracle risk and compare it to their respective multi-task rates. This allows us to discriminate several situations, depending whether the multi-task oracle either outperforms its single-task counterpart, underperforms it or whether both behave similarly. We also show that, in the cases favorable to the multi-task oracle detailed in the previous sections, the estimator proposed by behaves accordingly and achieves a lower risk than the single-task oracle. We finally study settings where we can no longer explicitly study the oracle risk, by running simulations, and we show that the multi-task oracle continues to retain the same virtues and disadvantages as before.

Kernel methods, such as the support vector machine or kernel ridge regression, are now widely used in many areas of science and engineering, such as computer vision or bioinformatics. However, kernel methods typically suffer from at least quadratic running-time complexity in the number of observations *degrees of freedom* associated with the problem, a quantity which is classically used in the statistical analysis of such methods, and is often seen as the implicit number of parameters of non-parametric estimators. This result enables simple algorithms that have sub-quadratic running time complexity, but provably exhibit the same *predictive performance* than existing algorithms, for any given problem instance, and not only for worst-case situations.

This work proves that significant improvements in recovery of brain activation patterns can be made by estimating the form of the Hemodynamic Response Function instead of using a canonical form for this response.

Language models can be formalized as log-linear regression models where the
input features represent previously observed contexts up to a
certain length

Language models are crucial parts of advanced natural language processing pipelines, such as speech recognition , machine translation , or information retrieval . When a sequence of symbols is observed, a language model predicts the probability of occurrence of the next symbol in the sequence. Models based on so-called back-off smoothing have shown good predictive power . In particular, Kneser-Ney (KN) and its variants are still achieving state-of-the-art results for more than a decade after they were originally proposed. Smoothing methods are in fact clever heuristics that require tuning parameters in an ad-hoc fashion. Hence, more principled ways of learning language models have been proposed based on maximum entropy or conditional random fields , or by adopting a Bayesian approach .

We focus on penalized maximum likelihood estimation in log-linear models.
In contrast to language models based on *unstructured* norms such as *tree-structured* norms , . Structured penalties have been successfully applied to various NLP tasks, including chunking and named entity recognition , but not language modeling. Such penalties are particularly well-suited to this problem as they mimic the nested nature of word contexts. However, existing optimizing techniques are not scalable for large contexts

We show that structured tree norms provide an efficient framework for language modeling.
Furthermore, we give the first algorithm for structured *and* time-efficient learning algorithm for generalized linear language models.

Natural graphs, such as social networks, email graphs, or instant messaging patterns, have become pervasive through the internet. These graphs are massive, often containing hundreds of millions of nodes and billions of edges. While some theoretical models have been proposed to study such graphs, their analysis is still difficult due to the scale and nature of the data. We propose a framework for large-scale graph decomposition and inference. To resolve the scale, our framework in is distributed so that the data are partitioned over a shared-nothing set of machines. We propose a novel factorization technique that relies on partitioning a graph so as to minimize the number of neighboring vertices rather than edges across partitions. Our decomposition is based on a streaming algorithm. It is network-aware as it adapts to the network topology of the underlying computational hardware. We use local copies of the variables and an efficient asynchronous communication protocol to synchronize the replicated values in order to perform most of the computation without having to incur the cost of network communication. On a graph of 200 million vertices and 10 billion edges, derived from an email communication network, our algorithm retains convergence properties while allowing for almost linear scalability in the number of computers.

In we introduce a new framework for the evaluation of speech representations in zero-resource settings, that extends and complements previous work by Carlin, Jansen and Hermansky . In particular, we replace their Same/Different discrimination task by several Minimal-Pair ABX (MP-ABX) tasks. We explain the analytical advantages of this new framework and apply it to decompose the standard signal processing pipelines for computing PLP and MFC coefficients. This method enables us to confirm and quantify a variety of well-known and not-so-well-known results in a single framework.

Speech recognition technology crucially rests on adequate speech features for encoding input data. Several such features have been proposed and studied (MFCCs, PLPs, etc), but they are often evaluated indirectly using complex tasks like phone classification or word identification. Such an evaluation technique suffers from several limitations. First, it requires a large enough annotated corpus in order to train the classifier/recognizer. Such a resource may not be available in all languages or dialects (the so-called “zero or limited resource” setting). Second, supervised classifiers may be too powerful and may compensate for potential defects of speech features (for instance noisy/unreliable channels). However, such defects are problematic in unsupervised learning techniques. Finally, the particular statistical assumptions of the classifier (linear, Gaussian, etc.) may not be suited for specific speech features (for instance sparse neural codes as in Hermansky ). It is therefore important to replace these complex evaluation schemes by simpler ones which tap more directly the properties of the speech features.

We extend and complement the framework proposed by Carlin, Jansen and Hermansky for the evaluation of speech features in zero resource settings. This framework uses a Same-Different word discrimination task that does not depend on phonetically labelled data, nor on training a classifier. It assumes a speech corpus segmented into words, and derives a word-by-word acoustic distance matrix computed by comparing every word with every other one using Dynamic Time Warping (DTW). Carlin et al. then compute an average precision score which is used to evaluate speech features (the higher average precision, the better the features).

We explore an extension of this technique through Minimal-Pair ABX tasks (MP-ABX tasks) tested on a phonetically balanced corpus . This improves the interpretability of the Carlin et al evaluation results in three different ways. First, the Same/Different task requires the computation of a ROC curve in order to derive average precision. In contrast, the ABX task is a discrimination task used in psychophysics (see , chapter 9) which allows for the direct computation of an error rate or a d' measure that are easier to interpret than average precision and involve no assumption about ROC curves. Second, the Same/Different task compares *sets of words*, and as a result is influenced by the mix of similar versus distinct words or short versus long words in the corpus. The ABX task, in contrast, is computed on *word pairs*, and therefore enables to make linguistically precise comparisons, as in word *minimal pairs*, i.e. words differing by only one phoneme. Variants of the task enable to study phoneme discrimination across talkers and/or phonetic contexts, as well as talker discrimination across phonemes. Because it is more controlled and provides a parameter and model-free metric, the MP-ABX error rate also enables to compare performance across databases or across languages. Third, we compute bootstrap-based estimates of the variability of our performance measures, which allows us to derive confidence intervals for the error rates and tests of the significance of the difference between the error rates obtained with different representations.

Most competitive learning methods for computational linguistics are supervised, and thus require labeled examples, which are expensive to obtain. Moreover, those techniques suffer from data scarcity: many words only appear a small number of time, or even not at all, in the training data. It thus helps a lot to first learn word clusters on a large amount of unlabeled data, which are cheap to obtain, and then to use this clusters as features for the supervised task. This scheme has proven to be effective for various tasks such as named entity recognition, syntactic chunking or syntactic dependency parsing. It was also successfully applied for transfer learning of multilingual structure.

The most commonly used clustering method for semi-supervised learning is known as Brown clustering. While still being one of the most efficient word representation method, Brown clustering has two limitations we want to address in this work. First, since it is a hard clustering method, homonymy is ignored. Second, it does not take into account syntactic relations between words, which seems crucial to induce semantic classes. Our goal is thus to propose a method for semantic class induction which takes into account both syntax and homonymy, and then to study their effects on semantic class learning.

We start by introducing a new unsupervised method for semantic classes induction. This is achieved by defining a generative model of sentences with latent variables, which aims at capturing semantic roles of words. We require our method to be scalable, in order to learn models on large datasets containing tens of millions of sentences. More precisely, we make the following contributions:

We introduce a generative model of sentences, based on dependency trees, which can be seen as a generalization of Brown clustering,

We describe a fast approximate inference algorithm, based on message passing and online EM for scaling to large datasets. It allowed us to learn models with 512 latent states on a dataset with hundreds of millions of tokens in less than two days on a single core,

We learn models on two datasets, Wikipedia articles about musicians and the NYT corpus, and evaluate them on two semi-supervised tasks, namely supersense tagging and named entity recognition.

Most natural language processing systems based on machine learning are not robust to domain shift. For example, a state-of-the-art syntactic dependency parser trained on Wall Street Journal sentences has an absolute drop in performance of more than ten points when tested on textual data from the Web. An efficient solution to make these methods more robust to domain shift is to first learn a word representation using large amounts of unlabeled data from both domains, and then use this representation as features in a supervised learning algorithm. In this paper, we propose to use hidden Markov models to learn word representations for part-of-speech tagging. In particular, we study the influence of using data from the source, the target or both domains to learn the representation and the different ways to represent words using an HMM.

Nowadays, most natural language processing systems are based on supervised machine learning. Despite the great successes obtained by those techniques, they unfortunately still suffer from important limitations. One of them is their sensitivity to domain shift: for example, a state-of-the-art part-of-speech tagger trained on the Wall Street Journal section of the Penn treebank achieves an accuracy of

One of the explanations for this drop in performance is the big lexical difference that exists accross domains. This results in a lot of out-of-vocabulary words (OOV) in the test data, *i.e.*, words of the test data that were not observed in the training set. For example, more than

Labeling enough data to obtain a high accuracy for each new domain is not a viable solution. Indeed, it is expensive to label data for natural language processing, because it requires expert knowledge in linguistics. Thus, there is an important need for transfer learning, and more precisely for domain adaptation, in computational linguistics. A common solution consists in using large quantities of unlabeled data, from both source and target domains, in order to learn a good word representation. This representation is then used as features to train a supervised classifier that is more robust to domain shift. Depending on how much data from the source and the target domains are used, this method can be viewed as performing semi-supervised learning or domain adaptation. The goal is to reduce the impact of out-of-vocabulary words on performance. This scheme was first proposed to reduce data sparsity for named entity recognition, before being applied to domain adaptation for part-of-speech tagging or syntactic parsing.

Hidden Markov models have already been considered in previous work to learn word representations for domain adaptation or semi-supervised learning. Our contributions in are mostly experimental: we compare different word representations that can be obtained from an HMM and study the effect of training the unsupervised HMM on source, target or both domains. While previous work mostly use Viterbi decoding to obtain word representations from an HMM, we empirically show that posterior distributions over latent classes give better results.

Collaboration with: Konstantina Palla, Alex Davies, Zoubin Ghahramani (Machine Learning Group, Department of Engineering, University of Cambridge), Gjergji Kasneci (Max Planck Institut für Informatik), Thore Graepel (Microsoft Research Cambridge)

The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm that leverages both the structural information from the relationship graph and flexible similarity measures between entity properties in a greedy local search, which makes it scalable. Despite its greedy nature, our experiments in indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high accuracy. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.

Technicolor: “Tensor factorization algorithms for recommendation systems”.

Xerox: CIFRE PhD student “IMAGE2TXT: From images to text”.

Microsoft Research: “Structured Large-Scale Machine Learning”. Machine learning is now ubiquitous in industry, science, engineering, and personal life. While early successes were obtained by applying off-the-shelf techniques, there are two main challeges faced by machine learning in the “ big data” era : structure and scale. The project proposes to explore three axes, from theoretical, algorithmic and practical perspectives: (1) large-scale convex optimization, (2) large-scale combinatorial optimization and (3) sequential decision making for structured data. The project involves two Inria sites (Paris-Rocquencourt and Grenoble) and four MSR sites (Cambridge, New England, Redmond, New York).

Google Research Award: “Large scale adaptive machine learning with finite data sets”

S. Arlot, Membre du projet ANR Calibration

Titre: Statistical calibration

Coordinator: University Paris Dauphine

Leader: Vincent Rivoirard

Other members: 34 members, mostly among CEREMADE (Paris Dauphine), Laboratoire Jean-Alexandre Dieudonné (Université de Nice) and Laboratoire de Mathématiques de l'Université Paris Sud

Instrument: ANR Blanc

Duration: Jan 2012 - Dec 2015

Total funding: 240 000 euros

S. Arlot, F. Bach, membres du projet "Gargantua"

Titre: Big data; apprentissage automatique et optimisation mathématique pour les données gigantesques

Coordinator: Laboratoire Jean Kuntzmann (UMR 5224)

Leader: Zaid Harchaoui

Other members: 13 members: S. Arlot, F. Bach and researchers from Laboratoire Jean Kuntzmann, Laboratoire d'Informatique de Grenoble (Universite Joseph Fourier) and Laboratoire Paul Painleve (Universite Lille 1).

Instrument: défi MASTODONS du CNRS

Duration: May 2013-Dec 2013 (may be reconducted for 2014)

Total funding: 30 000 euros for 2013

Webpage: http://

Type: IDEAS

Instrument: ERC Starting Grant

Duration: December 2009 - November 2014

Coordinator: Inria (France)

Abstract: Machine learning is now a core part of many research domains, where the abundance of data has forced researchers to rely on automated processing of information. The main current paradigm of application of machine learning techniques consists in two sequential stages: in the representation phase, practitioners first build a large set of features and potential responses for model building or prediction. Then, in the learning phase, off-the-shelf algorithms are used to solve the appropriate data processing tasks. While this has led to significant advances in many domains, the potential of machine learning techniques is far from being reached.

Type: IDEAS

Instrument: ERC Starting Grant

Duration: May 2011 - May 2016

Coordinator: CNRS

Abstract: Interior point algorithms and a dramatic growth in computing power have revolutionized optimization in the last two decades. Highly nonlinear problems which were previously thought intractable are now routinely solved at reasonable scales. Semidefinite programs (i.e. linear programs on the cone of positive semidefinite matrices) are a perfect example of this trend: reasonably large, highly nonlinear but convex eigenvalue optimization problems are now solved efficiently by reliable numerical packages. This in turn means that a wide array of new applications for semidefinite programming have been discovered, mimicking the early development of linear programming. To cite only a few examples, semidefinite programs have been used to solve collaborative filtering problems (e.g. make personalized movie recommendations), approximate the solution of combinatorial programs, optimize the mixing rate of Markov chains over networks, infer dependence patterns from multivariate time series or produce optimal kernels in classification problems. These new applications also come with radically different algorithmic requirements. While interior point methods solve relatively small problems with a high precision, most recent applications of semidefinite programming in statistical learning for example form very large-scale problems with comparatively low precision targets, programs for which current algorithms cannot form even a single iteration. This proposal seeks to break this limit on problem size by deriving reliable first-order algorithms for solving large-scale semidefinite programs with a significantly lower cost per iteration, using for example subsampling techniques to considerably reduce the cost of forming gradients. Beyond these algorithmic challenges, the proposed research will focus heavily on applications of convex programming to statistical learning and signal processing theory where optimization and duality results quantify the statistical performance of coding or variable selection algorithms for example. Finally, another central goal of this work will be to produce efficient, customized algorithms for some key problems arising in machine learning and statistics.

Title: Fast Statistical Analysis of Web Data via Sparse Learning

Inria principal investigator: Francis Bach

International Partner (Institution - Laboratory - Researcher):

University of California Berkeley (United States) - EECS and IEOR Departments - Francis Bach

Duration: 2011 - 2013

See also: http://

The goal of the proposed research is to provide web-based tools for the analysis and visualization of large corpora of text documents, with a focus on databases of news articles. We intend to use advanced algorithms, drawing from recent progresses in machine learning and statistics, to allow a user to quickly produce a short summary and associated timeline showing how a certain topic is described in news media. We are also interested in unsupervised learning techniques that allow a user to understand the difference between several different news sources, topics or documents.

Michael Jordan (U.C. Berkeley), spent one year in our team, until the summer 2013, financed by the Fondation de Sciences Mathématiques de Paris and Inria.

F. Bach: Journal of Machine Learning Research, Action Editor.

F. Bach: IEEE Transactions on Pattern Analysis and Machine Intelligence, Associate Editor.

F. Bach: Information and Inference, Associate Editor.

F. Bach: SIAM Journal on Imaging Sciences, Associate Editor.

F. Bach: International Journal of Computer Vision, Associate Editor

A. d'Aspremont: Optimization Methods and Software

A. d'Aspremont: SIAM Journal on Optimization

F. Bach: International Conference on Machine Learning, 2013

F. Bach: Neural Information Processing Systems, 2013

S. Arlot, member of the program committee of the Second Workshop on Industry & Practices for Forecasting (WIPFOR), EDF R&D, Clamart. 5-7 June 2013.

A. d'Aspremont was co-organizer of the workshop on optimization and machine learning at Les Houches in Janurary 2013.

F. Bach organized a workshop on "Big data: theoretical and practical challenges" - May, 14-15, 2013 - Institut Poincaré (co-organized with Michael Jordan), funded by the Fondation de Sciences Mathématiques de Paris and Inria.

F. Bach and Michael Jordan coorganized the "Fête Parisienne in Computation, Inference and Optimization: A Young Researchers' Forum". A workshop organized in the framework of the the Schlumberger Chair for mathematical sciences at IHÉS. March 20, 2013. http://

F. Bach also coorganized the "Workshop on "Succinct Data Representations and Applications", Theoretical Foundations of Big data. Simons Institute, Berkeley, September 2013.

S. Arlot is member of the board for the entrance exam in Ecole Normale Supérieure (mathematics, voie B/L).

A. d'Aspremont is a member of the scientific committee of the programme Gaspard Monge pour l'optimisation (PGMO).

A. d'Aspremont is a member of the scientific committee of Thales Alenia Space.

S. Arlot, "Kernel change-point detection", Workshop "Non-stationarity in Statistics and Risk Management" (CIRM, Marseille, January, 21-25, 2013).

S. Arlot, "Sélection de modèles par validation croisée et sélection de paramètres pour la régression ridge et le Lasso", Groupe de Travail Neurospin-Select (Saclay, February, 20, 2013).

S. Arlot, "Optimal model selection with V-fold cross-validation: how should V be chosen?", Fête Parisienne in Computation, Inference and Optimization: A Young Researchers' Forum (IHES, Bures-sur-Yvette, March, 20, 2013).

S. Arlot, "Kernel change-point detection", Groupe de Travail de Statistique de Jussieu (Paris, November, 11, 2013).

S. Arlot, "Analyse du biais de forêts purement aléatoires", Séminaire de l'Equipe de Probabilités et Statistiques (Institut Elie Cartan, Nancy, November, 28, 2013).

S. Arlot, "Optimal data-driven estimator selection with minimal penalties", keynote lecture, Workshop "Mathematical Statistics with Applications in Mind" (CIRM, Marseille, December, 9-13, 2013).

Simon Lacoste-Julien, "Harnessing the structure of data for discriminative machine learning":

Department of Statistics, University of Oxford, February 2013

Intelligent Systems Lab Amsterdam, University of Amsterdam, February 2013

Département d'informatique, Université de Sherbrooke, April 2013

School of Computer Science, McGill University, April 2013

Département d'Informatique, École Normale Supérieure, April 2013

"Block-Coordinate Frank-Wolfe Optimization for Structured SVMs"

ICML, Atlanta, USA, June 2013

ICCOPT, Lisbon, Portugal, July 2013

Simon Lacoste-Julien, "SiGMa: Simple Greedy Matching for Aligning Large Knowledge Bases", KDD, Chicago, August 2013

Simon Lacoste-Julien, "Frank-Wolfe optimization insights in machine learning"

Département d'informatique, Université de Sherbrooke, August 2013

SMILE seminar, Paris, November 2013

SAIL meeting, UC Berkeley, December 2013

CILVR Lab , New York University, December 2013

Machine Learning Lab, Columbia University, December 2013

Reasoning and Learning Lab, McGill University, December 2013

Simon Lacoste-Julien, "Making Sense of Big Data", CaFFEET, Stanford University, November

Michael Jordan, Keynote Speaker, ACM Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, 8/15/12

Michael Jordan, Keynote Speaker, 21st Century Computing Conference, Tianjin, China, 10/25/12

Michael Jordan, Keynote Speaker, ICONIP, Doha, Qatar, 11/12/12

Michael Jordan, Invited Speaker, SAMSI Workshop on Massive Data Analysis, 9/9/12

Michael Jordan, Invited Speaker, Méthodes Bayésiennes non Paramétriques pour le Traitement du

Michael Jordan, Signal et des Images, Telecom ParisTech, Paris, France, 9/8/12

Michael Jordan, Invited Speaker, Seminaire Parisien de Statistique, Paris, France, 9/17/12

Michael Jordan, Invited Speaker, Workshop on Random Matrices and their Applications, Paris, France, 10/9/12

Michael Jordan, Colloquium, Department of Informatique, Ecole Normale Superieure, 10/2/12

Michael Jordan, Vincent Meyer Colloquium, Israel Institute of Technology, 11/5/12

Michael Jordan, Invited Speaker, Workshop on Optimization and Statistical Learning, Les Houches, France, 1/8/13

Michael Jordan, Harry Nyquist Lecture, Department of Electrical Engineering, Yale, 1/23/13

Michael Jordan, Invited Speaker, Simons Workshop on Big Data, New York, 1/24/13

Michael Jordan, Keynote Speaker, Workshop on Nonsmooth Optimization in Machine Learning, Liege, Belgium, 3/4/13

Michael Jordan, Keynote Speaker, StatLearn Workshop, Bordeaux, France, 4/8/13

Michael Jordan, Lecture Series, Ecole Nationale de la Statistique et de l'Administration, Paris, 5/13

Michael Jordan, Keynote Speaker, Amazon Machine Learning Conference, Seattle, 4/28/13

Michael Jordan, Keynote Speaker, Bayesian Nonparametrics Workshop, Amsterdam, 6/10/13

Michael Jordan, Invited Speaker, Workshop on High-Dimensional Statistics, Moscow, 6/26/13

Michael Jordan, Distinguished Lecture, Department of Statistics, University of Oxford, 5/7/13

Michael Jordan, Colloquium, Department of Statistics, University of Cambridge, 5/10/13

Michael Jordan, Invited Speaker, GdR ISIS Conference, Telecom ParisTech, Paris 5/16/13

Matthieu Solnon, "Analysis of the oracle risk in multi-task kernel ridge regression", Colloque Statistique Mathématique et Applications, Fréjus, France.

Mark Schmidt, "Opening up the black box: Faster methods for non-smooth and big-data optimization problems". Invited talk at DeepMind Technologies, London (June 2013).

Mark Schmidt, "Linearly-Convergent Stochastic-Gradient Methods". Invited talk at Paris 6, Paris (June 2013).

Mark Schmidt, "Minimizing Finite Sums with the Stochastic Average Gradient Algorithm". "Invited" talk at ICCOPT, Lisbon (July 2013).

Edouard Grave, Alpage, Inria / Paris 7, May 2013

Edouard Grave, Criteo, September 2013

Edouard Grave, Laboratoire de Science Cognitive et Psycholinguistique, EHESS / ENS / CNRS, November 2013

Alexandre d'Aspremont, "Convex Relaxations for Permutation Problems"

Workshop on Succinct Data Representations and Applications, Simons Institute, Berkeley, Sept. 2013.

Workshop MAORI, Ecole Polytechnique, Nov. 2013.

Alexandre d'Aspremont, "Phase Retrieval, MAXCUT and Complex Semidefinite Programming"

GdT CEREMADE, Paris Dauphine, April 2013.

Journée du GdR ISIS, Telecom, May 2013.

Journée du GdR MOA, June 2013.

Alexandre d'Aspremont,"Approximation Bounds for Sparse PCA"

Workshop on Structured families of functions and applications, Oberwolfach, February 2013.

PACM seminar, Princeton, USA, February 2013.

Séminaire ENSAE, France, April 2013.

Big Data workshop, IHP, May 2013.

Alexandre d'Aspremont, "An Optimal Affine Invariant Smooth Minimization Algorithm", International Workshop on Statistical Learning, Moscow, June 2013.

F. Bach: Optimization and Statistical Learning, January 6 - 11, 2013. Les Houches, France (Invited presentation)

F. Bach: international biomedical and astronomical signal processing (BASP) Frontiers workshop, January 2013 (Invited presentation)

F. Bach: Convex Relaxation Methods for Geometric Problems in Scientific Computing, IPAM, Los Angeles, February 2013 (Invited presentation)

F. Bach: Nonsmooth optimization in machine learning. March 04, 2013, University of Liège (Invited presentation)

F. Bach: Microsoft Research Machine Learning Summit: April 22-24, 2013 (Invited presentation)

F. Bach: International Workshop on Advances in Regularization, Optimization, Kernel Methods and Support Vector Machines: theory and applications, July 8 - 10, 2013, Leuven, Belgium (Invited presentation)

F. Bach: European Conference on Dana Analysis, Luxembourg, July 2013 (Invited presentation)

F. Bach: European Meeting of Statisticians (EMS), Budapest, Hungary, 20-25 July 2013 (Invited presentation)

F. Bach: Fourth Cargese Workshop on Combinatorial Optimization. Institut d'Etudes Scientifiques de Cargèse, Corsica (France). September 30 - October 5, 2013 (Invited presentation)

F. Bach: 9èmes Journées Nationales de la Recherche en Robotique, Annecy, October 16-18, 2013 (invited presentation)

F. Bach: Radboud University, Nijmegen, Nederlands, November 29, 2013 (Seminar)

F. Bach: GlobalSIP: IEEE Global Conference on Signal and Information Processing, December 3-5, 2013 (invited presentation)

F. Bach: NIPS workshops, december 2013 (2 invited presentations)

Licence: F. Bach, G. Obozinski, R. Lajugie: “Apprentissage statistique”, 35h, Ecole Normale Supérieure, Filière “Math-Info”, première année.

Mastère: S. Arlot and F. Bach, "Statistical learning", 24h, Mastère M2, Université Paris-Sud, France.

Mastère: F. Bach, G. Obozibski, Introduction aux modèles graphiques (30h), Master MVA (Ecole Normale Supérieure de Cachan).

Master: S. Arlot and F. Bach, "Statistical learning", 24h, Mastère M2, Université Paris-Sud, France.

Doctorat: S. Arlot, "Classification and statistical machine learning", 1h tutorial for the CEMRACS 2013, Marseille, France.

Licence : A. d'Aspremont, "Optimisation", 36h, L3, ENSAE, France.

Master M2: A. d'Aspremont, "Convex Optimization, Algorithms and Applications", 27h, M2, ENS Cachan, France.

Master M2: A. d'Aspremont, "Optimisation et simulation numérique.", 14h, M2, Paris Sud (Orsay), France.

PhD: Matthieu Solnon, "Multi-task statistical learning", UPMC, November 25, 2013. Advisors: S. Arlot and F. Bach.

PhD in progress: Fajwel Fogel, "Optimisation et Apprentissage", September 2012, A. d'Aspremont.

PhD in progress: Bamdev Mishra

PhD in progress: Loic Landrieu

PhD in progress: Sesh Kumar, "Optimization and submodular functions", May 2012, F. Bach.

PhD in progress: Edouard Grave, "A Markovian approach to distributional semantics", F. Bach, G. Obozinski (defended January 20, 2014).

PhD in progress: Anil Nelakanti, "Structured sparsity and language models", F. Bach, G. Obozinski (to be defended February 11, 2014).

PhD in progress: Rémi Lajugie, September 2012, S. Arlot and F. Bach.

A. d'Aspremont was a member of the PhD committee of Nicholas Boumal's thesis at the Université Catholique de Louvain.

F. Bach was a member of the PhD committee of Clément Calauzènes (UPMC), Azadeh Khaleghi (Inria Lille), Yao-Lian Yu (University of Alberta).

F. Bach was a member of the HDR committee of Ivan Laptev (Inria-ENS), Pawan Kumar (Ecole Centrale).

A. d'Aspremont, Journée Big Data, ENSAI, Rennes, November 2013.