During its last year, the team was reduced. Olivier Catoni and his two PhD students focussed on the study of new statistical models for corpus linguistics and on dimension free bounds for the estimation of the Gram matrix of an i.i.d. sample (possibly in an infinite dimensional Hilbert space) and its application to Principal Component Analysis.

We recall hereafter the themes that were more broadly studied during the lifespan of the project.

We are a research team on machine learning, with an emphasis on statistical methods. Processing huge amounts of complex data has created a need for statistical methods which could remain valid under very weak hypotheses, in very high dimensional spaces. Our aim is to contribute to a robust, adaptive, computationally efficient and desirably non-asymptotic theory of statistics which could be profitable to learning.

Our theoretical studies bear on the following mathematical tools:

regression models used for supervised learning, from different perspectives: the PAC-Bayesian approach to generalization bounds; robust estimators; model selection and model aggregation;

sparse models of prediction and

interactions between unsupervised learning, information theory and adaptive data representation;

individual sequence theory;

multi-armed bandit problems (possibly indexed by a continuous set);

statistical modeling applied to linguistics, statistical inference of grammar models.

We are involved in the following applications:

the improvement of prediction through the on-line aggregation of predictors, with an emphasis on the forecasting of air quality, electricity consumption, production data of oil reservoirs, exchange rates;

natural image analysis, and more precisely the use of unsupervised learning in data representation;

computational linguistics;

statistical inference on biological and neurobiological data.

The most obvious contribution of statistics to machine learning is to
consider the supervised learning scenario as a special case of regression
estimation: given

One of the specialties of the team in this direction is to use PAC-Bayes inequalities to combine thresholded exponential moment inequalities. The name of this theory comes from its founder, David McAllester, and may be misleading. Indeed, its cornerstone is rather made of non-asymptotic entropy inequalities, and a perturbative approach to parameter estimation. The team has made major contributions to the theory, first focussed on classification , then on regression and on principal component analysis of a random sample of points in high dimension. It has introduced the idea of combining the PAC-Bayesian approach with the use of thresholded exponential moments , in order to derive bounds under very weak assumptions on the noise.

Thomas Mainguy and Olivier Catoni studied a new statistical model for natural language modeling, called Markov substitute processes. This model is based on a set of conditional independence properties that are more general than the Markov field assumption. It has connections with context free grammars and forms a collection of exponential families having for this reason nice estimation properties.

Ilaria Giulini and Olivier Catoni continued their study of dimension free bounds for the estimation of the Gram matrix and more generally for the estimation of the expectation of a random symmetric matrix from an i.i.d. sample. This study, using PAC-Bayes bounds, both leads to new robust estimators with applications to Principal Component Analysis in high of even infinite dimension, and new bounds for the usual empirical Gram matrix estimate. Getting dimension free bounds is important to get new results on Kernel PCA. Applications were also studied to density estimation and to spectral clustering.

ANR project in the blank program: Calibration (2012–2015; involves Vincent Rivoirard, who is the coordinator; see https://

**E-learning**

Visio-conferencing at IFCAM (Indo-French Centre for Applied Mathematics), Summer School on Applied Mathematics, Indian Institute of Science, Bangalore (July 2014). Olivier Catoni gave a three hour presentation on PAC-Bayes bounds applied to statistical learning. The conference is still available on the author's web page.

PhD : Thomas Mainguy, Markov Substitute Processes, a statistical model for linguistics, Université Pierre et Marie Curie, supervised by par Olivier Catoni, (defended on December 11, 2014).

PhD in progress : Ilaria Giulini, data analysis in high dimension, started in September 2012, Olivier Catoni.