Section: New Results

Model selection in Regression and Classification

Participants : Gilles Celeux, Mohammed El Anbari, Clément Levrard, Robin Genuer, Erwan Le Pennec, Lucie Montuelle, Pascal Massart, Caroline Meynet, Jean-Michel Poggi.

Erwan Le Pennec continues his work with Serge Cohen (IPANEMA Soleil) on hyperspectral image segmentation based on a spatialized Gaussian Mixture Model. They derive, and implement within MIXMOD, an efficient minimization algorithm combining EM algorithm, dynamic programming and model selection[37] . They have applied this technique to analyze ancient material[9] This scheme is supported by a theoretical work on conditional density estimation[40] . In the framework of her PhD, Lucie Montuelle has studied some extension to this model to spatiay varying logistic weights.

In collaboration with Marie-Laure Martin-Magniette (URGV et UMR AgroParisTech/INRA MIA 518) and Cathy Maugis (INSA Toulouse) has extended their variable selection procedure for model-based clustering and supervised classification to deal with high dimensional data sets with a backward selection procedure which is more efficient that the previous forward selection procedure in this context. [17] . Moreover they have shown the advantage of the model-based approach over a geometrical approach to select variable for clustering [13] . These variable selection procedures are in particular used for genomics applications which is the result of a collaboration with researchers of of URGV (Evry Genopole).

Caroline Meynet provided an 1 -oracle inequality satisfied by the Lasso estimator with the Kullback-Leibler loss in the framework of a finite mixture of Gaussian regressions model for high-dimensional heterogeneous data where the number of covariates may be much larger than the sample size. In particular, she has given a condition on the regularization parameter of the Lasso to obtain such an oracle inequality. This oracle inequality extends the 1 -oracle inequality established by Massart and Meynet [16] in the homogeneous Gaussian linear regression case. It is deduced from a finite mixture Gaussian regression model selection theorem for 1 -penalized maximum likelihood conditional density estimation, which is inspired from Vapnik's method of structural risk minimization and from the theory on model selection for maximum likelihood estimators developed by Massart.

From an practical point of view, Caroline Meynet has introduced a procedure to select variables in model-based clustering in a high-dimensional context. In order to tackle with the problem of high-dimension, she has proposed to first use the Lasso in order to select different sets of variables and then estimate the density by a standard EM algorithm by reducing the inference to the linear space of the selected variables by the Lasso. Numerical experiments show that this method can outperform direct estimation by the Lasso.

In collaboration with Professor Abdallah Mkhadri (University of Marrakesh, Marocco), Gilles Celeux supervised the thesis of Mohammed El Anbari which concern regularisation methods in linear regression. In collaboration with Professor Abdallah Mkhadri (University of Marrakesh, Marocco), Mohammed El Anbari proposed a method to simultaneously select variables and favor a grouping effect where strongly correlated predictors tend to be in or out of the model together. Numerical experiments showed that their method can be preferred to Elastic-Net when the number of variables is less or equal to the sample size and remain competitive otherwise. Moreover, they have proposed AdaGril an extension of the the adaptive Elastic Net which incorporates information redundancy among correlated variables for model selection and estimation. Under weak conditions, They have established an oracle property of AdaGril. Numerical experiments show in some cases of AdaGril outperforms its competitors.

In collaboration with Jean-Michel Marin (Université de Montpellier) and Christian P. Robert (CEREMADE, Université Paris Dauphine) Gilles Celeux and Mohammed El Anbari highlight the interest of Bayesian regularization methods, using hierarchical non informative priors, compared with standard regularization methods in a poorly informative context through numerical experiments [47] .

Clément Levrard worked on the obtention of fast rates of convergence for vector quantization. Using theoretical analogies between quantization seen as an unsupervised learning probel and the one of supervised learning by empirical contrast minimzation, he has obtained a logarithmic improvement on the previously obtained bound. He has been furthermore able to define intellegible "margin type" condition under which fast rates can be obtained.

Since September 2008, Pascal Massart is the cosupervisor with Frédéric Chazal (GEOMETRICA) of the thesis of Claire Caillerie (GEOMETRICA). The project intends to explore and to develop new researches at the crossing of information geometry, computational geometry and statistics.