The team has been created on July the 1
^{st}, 2009 and became an INRIA project on January the 1
^{st}, 2010.

We are a research team on machine learning, with an emphasis on statistical methods. Processing huge amounts of complex data has created a need for statistical methods which could remain valid under very weak hypotheses, in very high dimensional spaces. Our aim is to contribute to a robust, adaptive, computationally efficient and desirably non-asymptotic theory of statistics which could be profitable to learning.

Our theoretical studies bear on the following mathematical tools:

regression models used for supervised learning, from different perspectives: the PAC-Bayesian approach to generalization bounds; robust estimators; model selection and model aggregation;

sparse models of prediction and

interactions between unsupervised learning, information theory and adaptive data representation;

individual sequence theory;

multi-armed bandit problems (possibly indexed by a continuous set).

We are involved in the following applications:

improving prediction through the on-line aggregation of predictors applied to air quality control, electricity consumption, stock management in the retail supply chain;

natural image analysis, and more precisely the use of unsupervised learning in data representation;

computational linguistics;

statistical inference on biological data.

The most obvious contribution of statistics to machine learning is to consider the supervised learning scenario as a special case of regression estimation: given

One of the specialties of the team in this direction is to use PAC-Bayes inequalities to combine thresholded exponential moment inequalities. The name of this theory comes from its founder, David McAllester, and may be misleading. Indeed, its cornerstone is rather made of non-asymptotic entropy inequalities, and a perturbative approach to parameter estimation. The team has made major contributions to the theory, first focussed on classification , then on regression . It has introduced the idea of combining the PAC-Bayesian approach with the use of thresholded exponential moments, in order to derive bounds under very weak assumptions on the noise.

Another line of research in regression estimation is the use of sparse models, and its link with

For instance, the Lasso uses a
*sparse*solutions (the estimate has only a few nonzero coordinates), which usually have a clear interpretation in many settings (e.g., the influence or lack of influence of some
variables). In addition, unlike
*computationally feasible*for high-dimensional data.

The next brick of our scientific foundations explains why and how, in certains cases, we may formulate absolutely no assumption on the data

We are concerned here with
*sequential prediction*of outcomes, given some base predictions formed by
*experts*. We distinguish two settings, depending on how the sequence of outcomes is generated: it is either

the realization of some stationary process,

or is not modeled at all as the realization of any underlying stochastic process (these sequences are called
*individual sequences*).

The aim is to predict almost as well as the best expert. Typical good forecasters maintain one weight per expert, update these weights depending on the past performances, and output at each step the corresponding weighted linear combination of experts' advices.

The difference between the cumulative prediction error of the forecaster and the one of the best expert is called the regret. The game consists here of upper bounding the regret by a quantity as small as possible.

We are interested here in settings in which the feedback obtained on the predictions is limited, in the sense that it does not fully reveal what actually happened.

This is also a sequential problem in which some regret is to be minimized.

However, this problem is a stochastic problem: a large number of arms, possibly indexed by a continuous set like

Approachability is the ability to control random walks. At each round, a vector payoff is obtained by the first player, depending on his action and on the action of the opponent player. The aim is to ensure that the average of the vector payoffs converges to some convex set. Necessary and sufficient conditions were obtained by Blackwell and others to ensure that such strategies exist, both in the full information and in the bandit cases.

Some of these results can be extended to the case of games with signals (games with partial monitoring), where at each round the only feedback obtained by the first player is a random signal drawn according to a distribution that depends on the action profile taken by the two players, while the opponent player still has a full monitoring.

Our partner is EDF R&D. The goal is to aggregate in a sequential fashion the forecasts made by some (about 20) base experts in order to predict the electricity consumption at a global level (the one of all French customers) at a half-hourly step. We need to abide by some operational constraints: the predictions need to be made at noon for the next 24 hours (i.e., for the next 48 time rounds).

Our partner is the INRIA project-team CLIME (Paris-Rocquencourt). The goal is to aggregate in a sequential fashion the forecasts made by some (about 100) base experts in order to output field prediction of the concentration of some pollutants (typically, the ozone) over Europe. The results were and will be transferred to the public operator INERIS, which uses and will use them in an operational way.

Our partner is the start-up Lokad.com. The purpose of this application is to investigate nonparametric expert-oriented strategies for time series prediction from a practical perspective.

The aim is to propose and study new language models that bridge the gap between models oriented towards statistical analysis of large corpora and grammars oriented towards the description of syntactic features as understood by academic experts.

We have here two specific applications in mind.

One is about understanding how the transcription of human genome is performed: transcription regulatory elements need to be identified. A natural modeling is provided by multivariate Hawkes processes but an excessive computational time is necessary for their implementation. Lasso type methods may overcome this numerical issue.

The second is about estimating the division rate of a size-structured population in a nonparametric setting. The size of the system evolves according to a transport-fragmentation equation: each individual grows with a given transport rate and splits into two offsprings of the same size, following a binary fragmentation process with an unknown division rate that depends on its size.

We do not discuss here the contributions provided by , , , , , , , , , since they were achieved in 2009 or earlier (but only published this year due to long queues in publication tracks of journals). was revised but is still under review.

Sébastien Gerchinovitz and Jia Yuan Yu continued the work initiated by the former in the above-mentioned conference paper
; they derived from the sparsity results in individual sequences
presented therein the minimax optimal rates of aggregation for individual sequences on

Other results were obtained in a stochastic framework, where input–output pairs are given by i.i.d. variables; they are described in the technical report
. Let

Last but not least, we mention the edited book , which provides a modern overview on high-dimensional estimation.

Some of the results cited below are summarized or stated as open problems in the habilitation thesis .

We achieved three contributions. The first is described in the conference paper
: it revisits asymptotically optimal results of Lai and Robbins,
Burnetas and Katehakis in a non-asymptotic way. The second is stated in the journal article
and is concerned with obtaining fast convergence rates for the
regret in case of a continuum of arms (of course under some regularity and topological assumptions on the mean-payoff function

This line of research is in collaboration with the Geometrica project-team (INRIA Saclay). As the latter says:

Due to the fast evolution of data acquisition devices and computational power, scientists in many areas are demanding efficient algorithmic tools for analyzing, manipulating and visualizing more and more complex shapes or complex systems from approximating data. Many of the existing algorithmic solutions which come with little theoretical guarantees provide unsatisfactory and/or unpredictable results. Since these algorithms take as input discrete geometric data, it is mandatory to develop concepts that are rich enough to robustly and correctly approximate continuous shapes and their geometric properties by discrete models. Ensuring the correctness of geometric estimations and approximations on discrete data is a sensitive problem in many applications.

Thus, motivated by a broad range of potential applications in topological and geometric inference, we introduce in
a weighted version of the

We still keep an eye on more traditional mathematical statistics; in particular, the technical report
takes place within this field. It shows, for a large class of
distributions and large samples, that estimates of the variance

Gérard Biau has been supervising the PhD thesis of Benoît Patra, which takes places within an industrial contract (“thèse CIFRE”) with Lokad.com (
http://

We (co-)organized the following seminars:

Statistical machine learning in Paris – SMILE (Gérard Biau, Gilles Stoltz; see
http://

Parisian seminar of statistics at IHP (Vincent Rivoirard; see
https://

Grants:

ANR project in the conception and simulation track: EXPLO/RA (involves Emilien Joly, Sébastien Gerchinovitz, Gilles Stoltz, Jia Yuan Yu; see
http://

ANR project in the blank program: Parcimonie (involves Sébastien Gerchinovitz, Vincent Rivoirard, Gilles Stoltz; see
http://

two other ANR blank projects only involve each one member of the team: Banhdits (Vincent Rivoirard), CLARA (Gérard Biau).

Thanks to the PASCAL European network of Excellence (
http://

We have some internal collaborations, mostly on one-to-one bases, with

Karine Bertin, University of Valparaiso, Chile;

Luc Devroye, McGill University, Canada;

Shie Mannor, Technion, Israel.

Gérard Biau serves as an Associate Editor for the journals
*Annales de l'ISUP*,
*ESAIM: Probability and Statistics*and
*International Statistical Review*.

Olivier Catoni is a member of the editorial committee of the joint series of monographies “Mathématiques et Applications” between Springer and SMAI.

All permanent members of the team reviewed several journal papers during the year.

We wrote reports on PhD (1 by Gilles Stoltz) and habilitation (1 by Olivier Catoni) theses.

We were examinators for other PhD (4 by Gérard Biau, 1 by Olivier Catoni) and habilitation (1 by Gérard Biau) defenses.

Vincent Rivoirard was elected at the Conseil de la SFdS.

Gérard Biau was elected member of the national council of French universities (CNU) within the applied mathematics section (number 26).

Olivier Catoni is a member of the doctoral commission in mathematics of Universities Pierre et Marie Curie and Paris Diderot.

All permanent members of the team participated in several recruitment committees for assistant or full professors in universities.

Gilles Stoltz was a member of the program committee of the 24th Conference on Learning Theory (COLT'11); Vincent Rivoirard was a member of the program committee of the Journées de la SFdS 2011.

Gérard Biau organized with Pierre Alquier in December 2011 an invited session entitled “High-dimensional statistics, sparsity and applications" at the 4th International Conference of the ERCIM Working Group on Computing & Statistics.

Gille Stoltz participated in a meeting [Rencontres S'Cube] between a crowd of 4 professional mathematicians and a general audience; the theme was “Perdre ou gagner, peut-on prévoir ?” and the meeting took place in Gif-sur-Yvette, in May 2011.

The permanent members of the team (Gérard Biau, Olivier Catoni, Vincent Rivoirard, and Gilles Stoltz) taught the following classes.

Licence : Statistiques, 39h, niveau L2, Université Paris-Dauphine, par Vincent Rivoirard

Licence : Apprentissage, 20h, niveau L3, Ecole normale supérieure, par Olivier Catoni et Gilles Stoltz

Licence : Théorie des probabilités, 40h, niveau L3, ISUP – Université Pierre et Marie Curie), par Gérard Biau

Licence : Statistiques pour citoyens d'aujourd'hui et managers de demain, 40h, niveau L3, HEC Paris, par Gilles Stoltz

Master : Groupe de travail en statistique, 12h, niveau M1, Ecole normale supérieure, par Gérard Biau, Olivier Catoni et Gilles Stoltz

Master : Statistique mathématique, 30h, niveau M1, Ecole normale supérieure, par Gérard Biau

Master : Statistique non-paramétrique, 8h, niveau M1, Ecole normale supérieure, par Vincent Rivoirard

Master : Statistique non-paramétrique, 35h, niveau M1, Université Paris-Dauphine, par Vincent Rivoirard

Master : Classification et statistique en grandes dimensions, 18h, niveau M2, Université Paris-Sud, par Vincent Rivoirard

Master : Statistiques et théorie de l'information, 10h, niveau M2, Université Paris-Sud, par Gilles Stoltz

Master : Apprentissage statistique, 36h, niveau M2, Université Pierre et Marie Curie, par Gérard Biau

Master : Méthodes pour les modèles de régression, 21h, niveau M2, Université Paris-Dauphine, par Vincent Rivoirard

Master : Statistique bayésienne non-paramétrique, 21h, niveau M2, Université Paris-Dauphine, par Vincent Rivoirard

Master : Examinateur à l'oral de probabilités et statistiques de l'agrégation de mathématiques, par Gilles Stoltz

PhD & HdR

HdR : Gilles Stoltz,
*Contributions à la prévision séquentielle de suites arbitraires : applications à la théorie des jeux répétés et études empiriques des performances de l'agrégation d'experts*,
Université Paris-Sud; defended on February 3, 2011

PhD : Sébastien Gerchinovitz,
*Prédiction de suites individuelles et cadre statistique classique : étude de quelques liens autour de la régression parcimonieuse et des techniques d'agrégation*, Université
Paris-Sud; defended on December 12, 2011; supervised by Gilles Stoltz

PhD in progress : Thomas Mainguy,
*Modèles statistiques pour la linguistique computationnelle*, since September 2009, supervised by Olivier Catoni

PhD in progress : Pierre Gaillard, since September 2011, supervised by Gilles Stoltz

PhD in progress : Emilien Joly, since September 2011, supervised by Gábor Lugosi and co-supervised by Gilles Stoltz

Several other PhD in progress : Gérard Biau and Vincent Rivoirard [co-]supervise[d] several other PhD students who are not members of our project-team (respectively, Benjamin Auder, Aurélie Fischer, Benoît Patra, Clément Levrard, Benjamin Guedj, Svetlana Gribkova, Baptiste Gregorutti for Gérard Biau, and Laure Sansonnet for Vincent Rivoirard)

MSc thesis: Pierre Gaillard (Master MVA, ENS Cachan) was supervised by Gilles Stoltz during this MSc thesis, whose subject was the use of aggregation techniques (based on random forests and/or steming from the theory of the prediction of individual sequences) for the forecasting of electricity consumption.