The team is located at Ecole normale supérieure, 45 rue d'Ulm, Paris.

We are a research team on machine learning, with an emphasis on statistical methods. Processing huge amounts of complex data has created a need for statistical methods which could remain valid under very weak hypotheses, in very high dimensional spaces. Our aim is to contribute to a robust, adaptive, computationally efficient and desirably non-asymptotic theory of statistics which could be profitable to learning.

Our theoretical studies bear on the following mathematical tools:

regression models used for supervised learning, from different perspectives: the PAC-Bayesian approach to generalization bounds; robust estimators; model selection and model aggregation;

sparse models of prediction and

interactions between unsupervised learning, information theory and adaptive data representation;

individual sequence theory;

multi-armed bandit problems (possibly indexed by a continuous set).

We are involved in the following applications:

the improvement of prediction through the on-line aggregation of predictors, with an emphasis on the forecasting of air quality, electricity consumption, production data of oil reservoirs;

natural image analysis, and more precisely the use of unsupervised learning in data representation;

computational linguistics;

statistical inference on biological and neurobiological data.

The most obvious contribution of statistics to machine learning is to
consider the supervised learning scenario as a special case of regression
estimation: given

One of the specialties of the team in this direction is to use PAC-Bayes inequalities to combine thresholded exponential moment inequalities. The name of this theory comes from its founder, David McAllester, and may be misleading. Indeed, its cornerstone is rather made of non-asymptotic entropy inequalities, and a perturbative approach to parameter estimation. The team has made major contributions to the theory, first focussed on classification , then on regression . It has introduced the idea of combining the PAC-Bayesian approach with the use of thresholded exponential moments, in order to derive bounds under very weak assumptions on the noise.

Another line of research in regression estimation is the use
of sparse models, and its link with

For instance, the Lasso uses a *sparse* solutions (the estimate has only a few nonzero coordinates),
which usually have a clear interpretation in many settings (e.g., the influence or lack of influence of some variables).
In addition, unlike *computationally feasible* for high-dimensional data.

The next brick of our scientific foundations explains why and how, in certains cases, we may formulate
absolutely no assumption on the data

We are concerned here with *sequential prediction* of outcomes, given some base
predictions formed by *experts*. We distinguish two settings, depending on how
the sequence of outcomes is generated: it is either

the realization of some stationary process,

or is not modeled at all as the realization of any underlying stochastic process (these sequences
are called *individual sequences*).

The aim is to predict almost as well as the best expert. Typical good forecasters maintain one weight per expert, update these weights depending on the past performances, and output at each step the corresponding weighted linear combination of experts' advices.

The difference between the cumulative prediction error of the forecaster and the one of the best expert is called the regret. The goal here is to upper bound the regret by a quantity as small as possible.

We are interested in settings in which the feedback obtained on the predictions is limited, in the sense that it does not fully reveal what actually happened.

This is also a sequential problem in which some regret is to be minimized.

However, this problem is a stochastic problem: a large number of arms, possibly indexed by a continuous set like

Approachability is the ability to control random walks. At each round, a vector payoff is obtained by the first player, depending on his action and on the action of the opponent player. The aim is to ensure that the average of the vector payoffs converges to some convex set. Necessary and sufficient conditions were obtained by Blackwell and others to ensure that such strategies exist, both in the full information and in the bandit cases.

Some of these results can be extended to the case of games with signals (games with partial monitoring), where at each round the only feedback obtained by the first player is a random signal drawn according to a distribution that depends on the action profile taken by the two players, while the opponent player still has a full monitoring.

Our partner is EDF R&D. The goal is to aggregate in a sequential fashion the forecasts made by some (about 20) base experts in order to predict the electricity consumption at a global level (the one of all French customers) at a half-hourly step. We need to abide by some operational constraints: the predictions need to be made at noon for the next 24 hours (i.e., for the next 48 time rounds).

Our partner is the Inria project-team CLIME (Paris-Rocquencourt). The goal is to aggregate in a sequential fashion the forecasts made by some (about 100) base experts in order to output field prediction of the concentration of some pollutants (typically, the ozone) over Europe. The results were and will be transferred to the public operator INERIS, which uses and will use them in an operational way.

Our partner is IFP Energies nouvelles. The goal is to aggregate in a sequential fashion the forecasts made by some (about 100) base experts in order to predict some behaviors (gas/oil ratio, cumulative oil extracted, water cut) of the exploitation of some oil wells.

Our partner is the start-up Safety Line. The purpose of this application is to investigate statistical learning strategies for mining massive data sets originated from aircraft high-frequency recordings and improve security.

We propose and study new language models that bridge the gap between models oriented towards the statistical analysis of large corpora and grammars oriented towards the description of syntactic features as understood by academic experts. We have conceived a new kind of grammar, based on some cut and paste mechanism and some label aggregation principle, that can be fully learnt from a corpus. We are currently testing this model and studying its mathematical properties.

The question is about understanding how interactions between neurons can be detected. A mathematical modeling is given by multivariate Hawkes processes. Lasso-type methods can then be used to estimate interaction functions in the nonparametric setting by using fast algorithms, providing inference of the unitary event activity of individual neurons.

We do not discuss here the contributions provided by , , , , , , , since they were achieved in 2011 or earlier (but only published this year due to the reviewing and publishing process). Also, the book (whose first edition was published in 2009) was augmented and revised for its second edition, published this year.

We wrote extended journal papers of some conference papers discussed in previous annual activity reports; they correspond to refences , , .

Approximate Bayesian Computation (ABC for short) is a family of computational techniques which offer an almost automated solution in situations where evaluation of the posterior likelihood is computationally prohibitive, or whenever suitable likelihoods are not available. In the paper Gérard Biau and his coauthors analyze the procedure from the point of view of

In , Vincent Rivoirard and Judith Rousseau study the asymptotic posterior distribution of linear functionals of the density by deriving general conditions to obtain a semi-parametric version of the Bernstein-von Mises theorem. The special case of the cumulative distributive function evaluated at a specific point is widely considered. In particular, they show that for infinite dimensional exponential families, under quite general assumptions, the asymptotic posterior distribution of the functional can be either Gaussian or a mixture of Gaussian distributions with different centering points. This illustrates the positive but also the negative phenomena that can occur for the study of Bernstein-von Mises results. In Vincent Rivoirard and Judith Rousseau use convergence rates on Besov spaces established in .

We applied our sequential aggregation techniques to a new data set, with IFP Energies nouvelles as a partner. The goal was to aggregate in a sequential fashion the forecasts made by some (about 100) base experts in order to predict some behaviors (gas/oil ratio, cumulative oil extracted, water cut) of the exploitation of some oil wells. Results were obtained with the help of an intern, Charles-Pierre Astolfi, and are described in the technical report (to be transformed into a regular journal / conference paper next year).

We know now that a good part of the statistical performance of regression and classification algorithms relies on the metric chosen to represent the proximity between the data points. Throughout his work, Gérard Biau became convinced that, well beyond the traditional distances, (dis)similarities and other self-reproducing kernel metrics, it is now necessary to attempt to define proximities generated by the sample itself. These metrics are inevitably random and probabilistic, and force us to rethink the nature of the estimates, as shown for example in the preliminary article .

In her PhD started in September 2012, Ilaria Giulini uses dimension free estimates of the principal components of an i.i.d. sample of points in a Reproducing Kernel Hilbert Space to derive new unsupervised clustering algorithms based on the idea of dimension reduction by nonlinear coordinate smoothing along aggregated principal components. The dimension free estimates are obtained using PAC-Bayes bounds derived from thresholded exponential moments.

Semiparametric nonlinear mixed-effects models (SNMMs) have been proposed as an extension of nonlinear mixed-effects models (NLMMs). These models are a good compromise and retain nice features of both parametric and nonparametric models resulting in more flexible models than standard parametric NLMMs. In , Vincent Rivoirard and his coauthors propose new estimation strategies in SNMMs. They propose a Lasso-type method to estimate the unknown nonlinear function. They derive oracle inequalities for this nonparametric estimator. They combine the two approaches in a general estimation procedure that they illustrate with simulations and through the analysis of a real data set of price evolution in on-line auctions.

In a forthcoming paper, Olivier Catoni and Thomas Mainguy study a new statistical model to learn the syntactic structure of natural languages from a training set made of written sentences. This model learns a new type of stochastic grammar and defines a statistical model on sentences. Global constraints are enforced, that set the approach apart from the family of Markov models. On the other hand, the grammar model generates outputs through a split and merge stochastic process that is more elaborate than the production rules defining a context free grammar. Experiments made on small corpora are very encouraging. Working on large corpora will require to speed up the algorithms used to implement the model as well as some code optimization.

Gérard Biau finished supervising the PhD thesis of Benoît Patra, which took place
till March 2012 within an industrial contract (“thèse CIFRE”) with Lokad.com (http://

Gérard Biau has been supervising the PhD thesis of Baptiste Gregorutti since December 2011,
within an industrial contract (“thèse CIFRE”) with Safety Line (http://

Gilles Stoltz has been supervising the PhD thesis of Pierre Gaillard, which takes place
since September 2012 within an industrial contract (“thèse CIFRE”) with EDF R&D (http://

Gilles Stoltz supervised the M.Sc. internship of Charles-Pierre Astolfi, which took place within a
collaboration with IFP Energies nouvelles (http://

ANR project in the conception and simulation track: EXPLO/RA (involves Emilien Joly, Pierre Gaillard, Sébastien Gerchinovitz,
Gilles Stoltz; see http://

ANR project in the blank program: Parcimonie (involves Sébastien Gerchinovitz, Vincent Rivoirard, Gilles Stoltz;
see http://

ANR project in the blank program: Calibration (involves Vincent Rivoirard, who is the coordinator; see https://

Thanks to the PASCAL European network of Excellence (http://

We have some internal collaborations, with

Karine Bertin, University of Valparaiso, Chile;

Luc Devroye, McGill University, Canada;

Shie Mannor, Technion, Israel.

In particular, Pierre Gaillard spent 5 months working with Shie Mannor from January to May 2012.

Gilles Stoltz was the co-chair of the program committee of the 23rd International Conference on Algorithmic Learning Theory (ALT'12); see the edited volume . He was also a member of the program committee of the 25th Conference on Learning Theory (COLT'12).

We (co-)organized the following seminars:

Statistical machine learning in Paris – SMILE (Gérard Biau, Gilles Stoltz; see http://

Parisian seminar of statistics at IHP (Vincent Rivoirard; see https://

Gérard Biau serves as an Associate Editor for the
journals *Annales de l'ISUP*, *ESAIM: Probability and Statistics* and *International Statistical Review*.

Olivier Catoni has been a member of the editorial committee of the joint series of monographs “Mathématiques et Applications” between Springer and SMAI until June 2012.

All permanent members of the team reviewed several journal papers during the year.

Vincent Rivoirard is a member of the Board of SFdS.

Gérard Biau was elected a member of the national board of French universities (CNU) within the applied mathematics section (number 26).

Olivier Catoni is a member of the doctoral commission in mathematics of University Pierre et Marie Curie.

All permanent members of the team participated in several recruitment committees for assistants or full professors in universities.

Gérard Biau was elected a member of the Institut Universitaire de France (IUF).

Licence : Vincent Rivoirard, Statistiques, 39h, niveau L2, Université Paris-Dauphine, France

Licence : Olivier Catoni et Gilles Stoltz, Apprentissage, 20h, niveau L3, Ecole normale supérieure, France

Licence : Gérard Biau, Théorie des probabilités, 40h, niveau L3, ISUP – Université Pierre et Marie Curie, France

Licence : Gilles Stoltz, Statistiques pour citoyens d'aujourd'hui et managers de demain, 40h, niveau L3, HEC Paris, France

Master : Gérard Biau, Statistique mathématique, 30h, niveau M1, Ecole normale supérieure, France

Master : Vincent Rivoirard, Statistique non-paramétrique, 8h, niveau M1, Ecole normale supérieure, France

Master : Vincent Rivoirard, Statistique non-paramétrique, 35h, niveau M1, Université Paris-Dauphine, France

Master : Vincent Rivoirard, Classification et statistique en grandes dimensions, 18h, niveau M2, Université Paris-Sud, France

Master : Gilles Stoltz, Statistiques et théorie de l'information, 10h, niveau M2, Université Paris-Sud, France

Master : Vincent Rivoirard, Méthodes pour les modèles de régression, 21h, niveau M2, Université Paris-Dauphine, France

Master : Vincent Rivoirard, Statistique bayésienne non-paramétrique, 21h, niveau M2, Université Paris-Dauphine, France

Master : Gilles Stoltz, Examinateur à l'oral de probabilités et statistiques de l'agrégation de mathématiques, France

PhD in progress : Thomas Mainguy, Statistical learning in computational linguistics, since September 2010, supervised by Olivier Catoni

PhD in progress : Emilien Joly, Phase transition of optimal risk and detection of contamination, since September 2011, supervised by Gábor Lugosi and co-supervised by Gilles Stoltz

PhD in progress : Pierre Gaillard, Aggregation of specialized predictors for the forecasting of electricity consumption, since September 2011, supervised by Gilles Stoltz

PhD in progress : Ilaria Giulini, Dimension free PAC-Bayes bounds for the Gram matrix and unsupervised clustering on the sphere of a Reproducing Kernel Hilbert space, since September 2012, supervised by Olivier Catoni

PhD in progress : Paul Baudin, Robust aggregation of predictors for the forecasting of air quality, with measures of uncertainties, since October 2012, supervised by Gilles Stoltz and co-supervised by Vivien Mallet

Several other PhD in progress : Gérard Biau and Vincent Rivoirard [co-]supervise[d] several other PhD students who are not members of our project-team (respectively, Benoît Patra, Clément Levrard, Benjamin Guedj, Svetlana Gribkova, Baptiste Gregorutti, Erwan Scornet, Nedjmeddine Allab for Gérard Biau, and Laure Sansonnet for Vincent Rivoirard)

MSc theses: Gilles Stoltz supervised the MSc theses of Charles-Pierre Astolfi (MVA, ENS Cachan) and Paul Baudin (MVA, ENS Cachan)

Gérard Biau was a reviewer for the following PhD defenses:

Ekaterina Sergienko, Université Toulouse III, November 2012

Christophe Denis, Université Paris Descartes, November 2012

Emmanuel Onzon, Université Paris VI, November 2012

and a jury member of the following PhD defenses:

Mohamed Achibi, Université Paris VI, July 2012

Moïse Jérémie, Université Paris VI, September 2012

Virgile Caron, Université Paris VI, October 2012

Caroline Meynet, Université Paris-Sud, November 2012

Nicolas Jégou, Université Rennes 2, November 2012

Sarah Ouadah, Université Paris VI, December 2012

Sylvain Girard, Ecole Nationale Supérieure des Mines de Paris, December 2012

Vincent Rivoirard was a jury member for the following habilitation defenses:

Wintenberger Olivier, Université Paris Dauphine, November 2012

Céline Vial, Université Claude Bernard Lyon 1, Décember 2012

Gérard Biau was a reviewer for the following habilitation defense:

Céline Vial, Université Claude Bernard Lyon 1, Décember 2012

and a jury member of the following habilitation defense:

Fadoua Balabdaoui, Université Paris-Dauphine, May 2012

Gilles Stoltz gave a conference for an audience of students of “classes préparatoires” at Mathematic Park (http://