## Section: New Results

### Model selection in Regression and Classification

Participants : Gilles Celeux, Serge Cohen, Clément Levrard, Erwan Le Pennec, Pascal Massart, Nelo Molter Magalhaes, Lucie Montuelle.

Unsupervised segmentation is an issue similar to unsupervised classification with an added spatial aspect. Functional data is acquired on points in a spatial domain and the goal is to segment the domain in homogeneous domain. The range of applications includes hyperspectral images in conservation sciences, fMRi data and all spatialized functional data. Erwan Le Pennec and Lucie Montuelle are focusing on the questions of the way to handle the spatial component from both the theoretical and the practical point of views. They study in particular the choice of the number of clusters. Furthermore, as functional data require heavy computation, they are required to propose numerically efficient algorithms. With Serge Cohen and an X intern some progress have been made on the use of logistic weights in the hyperspectral setting.

Lucie Montuelle has studied a model of mixture of Gaussian regressions in which the proportions are modeled using logistic weights. Using maximum likelihood estimators, a model selection procedure has been applied, supported by a theoretical guarantee. Numerical experiments have been conducted for regression mixtures with parametric logistic weights, using EM and Newton algorithms. This work is published in Electronic Journal of Statistics.

Another subject considered by Erwan Le Pennec and Lucie Montuelle was the obtention of oracle inequalities in deviation for model selection aggregation in the fixed design regression framework. Exponential weights are widely used but sub-optimal. They aggregate linear estimators and penalize Stein's unbiased risk estimate used in exponential weights to derive such inequalities. Furthermore if the infinity norm of the regression function is known and taken into account in the penalty, then a sharp oracle inequality is available. Pac-Bayesian tools and concentration inequalities play a key role in this work. These results may be found in a prepublication on arxiv or in Lucie Montuelle's PhD thesis.

In collaboration with Sylvain Arlot, Matthieu Lerasle an Patricia Reynaud-Bourret (CNRS) Nelo Molter Magalhaes considers estimator selection problem with the ${L}^{2}$ loss. They provide a theoretical minimal and optimal penalty. They define practical cross-validation procedures and provide non-asymptotic and first order optimal results for these procedures.

Emilie Devijver and Pascal Massart focused on the Lasso for high dimension finite mixture regression models. An ${\ell}_{1}$ oracle inequality have been get for this estimator for this model, for a specific regularization parameter. Moreover, for maximum likelihood estimators, restricted to relevant variables and to low rank, theoretical results have been proved to support methodology.

Pascal Massart and Clément Levrard continue their work on the properties of the $k$-means algorithm in collaboration with Gérard Biau (Université Paris 6). Most of the work achieved this year was devoted to the obtention of fast convergence rates for the $k$-means quantizer of a source distribution in the high-dimensional case. It has been proved that the margin condition for vector quantization introduced last year can be extended to the infinite dimensional Hilbert case, and that this condition is sufficient for the source distribution to satisfy some natural properties, such as the finiteness of the set of optimal quantizers. When this condition is satisfied, a dimension-free fast convergence rate can be derived. In addition, this margin condition provides theoretical guarantees for methods combining $k$-means and variable selection through a Lasso-type procedure. Its implementation is still in process, however early experiments shows that this procedure can retrieve active variables in the Gaussian mixture case.

Among selection methods for nonparametric estimators, a recent one is the procedure of Goldenshluger-Lespki. This method proposes a data-driven choice of $m$ to select an estimator among a collection ${\left({\widehat{s}}_{m}\right)}_{m\in M}$. The selected $\widehat{m}$ is chosen as a minimiser of $B\left(m\right)+V\left(m\right)$ where $B\left(m\right)=sup\left\{\right[\parallel {\widehat{s}}_{m}-{\widehat{s}}_{{m}^{\text{'}}}\parallel -V\left({m}^{\text{'}}\right){]}_{+},\phantom{\rule{0.222222em}{0ex}}{m}^{\text{'}}\in M\}$ and $V\left(m\right)$ is a penalty term to be suitably chosen. Previous results have established oracle inequalities to ensure that if $V\left(m\right)$ is large enough the final estimator ${\widehat{s}}_{\widehat{m}}$ is almost as efficient as the best one in the collection. The aim of the work of Claire Lacour and Pascal Massart was to give a practical way to calibrate $V\left(m\right)$. To do this they have evidenced an explosion phenomenon: if $V$ is chosen smaller than some critical ${V}_{0}$, the risk $\parallel s-{\widehat{s}}_{\widehat{m}}\parallel $ is proven to dramatically increase, though for $V>{V}_{0}$ this risk is quasi-optimal. Simulations have corroborated this behavior.

The well-documented and consistent variable selection procedure in model-based cluster analysis and classification, that Cathy Maugis (INSA Toulouse) has designed during her PhD. thesis in select , makes use of stepwise algorithms which are painfully slow in high dimensions. In order to circumvent this drawback, Gilles Celeux in collaboration with Mohammed Sedki (Université Paris XI) and Cathy Maugis), proposed to sort the variables using a lasso-like penalization adapted to the Gaussian mixture model context. Using this rank to select the variables they avoid the combinatory problem of stepwise procedures. After tests on challenging simulated and real data sets, their algorithm finalised and show good performances.

In collaboration with Jean-Michel Marin (Université de Montpellier) and Olivier Gascuel (LIRMM), Gilles Celeux has continued a research aiming to select a short list of models rather a single model. This short list of models is declared to be compatible with the data using a $p$-value derived from the Kullback-Leibler distance between the model and the empirical distribution. And, the Kullback-Leibler distances at hand are estimated through non parametric and parametric bootstrap procedures.