Section: Overall Objectives

Overall Objectives

During its last year, the team was reduced. Olivier Catoni and his two PhD students focussed on the study of new statistical models for corpus linguistics and on dimension free bounds for the estimation of the Gram matrix of an i.i.d. sample (possibly in an infinite dimensional Hilbert space) and its application to Principal Component Analysis.

We recall hereafter the themes that were more broadly studied during the lifespan of the project.

We are a research team on machine learning, with an emphasis on statistical methods. Processing huge amounts of complex data has created a need for statistical methods which could remain valid under very weak hypotheses, in very high dimensional spaces. Our aim is to contribute to a robust, adaptive, computationally efficient and desirably non-asymptotic theory of statistics which could be profitable to learning.

Our theoretical studies bear on the following mathematical tools:

  • regression models used for supervised learning, from different perspectives: the PAC-Bayesian approach to generalization bounds; robust estimators; model selection and model aggregation;

  • sparse models of prediction and 1–regularization;

  • interactions between unsupervised learning, information theory and adaptive data representation;

  • individual sequence theory;

  • multi-armed bandit problems (possibly indexed by a continuous set);

  • statistical modeling applied to linguistics, statistical inference of grammar models.

We are involved in the following applications:

  • the improvement of prediction through the on-line aggregation of predictors, with an emphasis on the forecasting of air quality, electricity consumption, production data of oil reservoirs, exchange rates;

  • natural image analysis, and more precisely the use of unsupervised learning in data representation;

  • computational linguistics;

  • statistical inference on biological and neurobiological data.