Section: New Results
Model selection in Regression and Classification
Participants : Gilles Celeux, Serge Cohen, Clément Levrard, Erwan Le Pennec, Pascal Massart, Nelo Molter Magalhaes, Lucie Montuelle.
Unsupervised segmentation is an issue similar to unsupervised classification with an added spatial aspect. Functional data is acquired on points in a spatial domain and the goal is to segment the domain in homogeneous domain. The range of applications includes hyperspectral images in conservation sciences, fMRi data and all spatialized functional data. Erwan Le Pennec and Lucie Montuelle are focusing on the questions of the way to handle the spatial component from both the theoretical and the practical point of views. They study in particular the choice of the number of clusters. Furthermore, as functional data require heavy computation, they are required to propose numerically efficient algorithms. With Serge Cohen and an X intern some progress have been made on the use of logistic weights in the hyperspectral setting.
Lucie Montuelle has studied a model of mixture of Gaussian regressions in which the proportions are modeled using logistic weights. Using maximum likelihood estimators, a model selection procedure has been applied, supported by a theoretical guarantee. Numerical experiments have been conducted for regression mixtures with parametric logistic weights, using EM and Newton algorithms. This work is published in Electronic Journal of Statistics.
Another subject considered by Erwan Le Pennec and Lucie Montuelle was the obtention of oracle inequalities in deviation for model selection aggregation in the fixed design regression framework. Exponential weights are widely used but sub-optimal. They aggregate linear estimators and penalize Stein's unbiased risk estimate used in exponential weights to derive such inequalities. Furthermore if the infinity norm of the regression function is known and taken into account in the penalty, then a sharp oracle inequality is available. Pac-Bayesian tools and concentration inequalities play a key role in this work. These results may be found in a prepublication on arxiv or in Lucie Montuelle's PhD thesis.
In collaboration with Sylvain Arlot, Matthieu Lerasle an Patricia Reynaud-Bourret (CNRS) Nelo Molter Magalhaes considers estimator selection problem
with the
Emilie Devijver and Pascal Massart focused on the Lasso for high dimension finite mixture regression models.
An
Pascal Massart and Clément Levrard continue their work on the properties of the
Among selection methods for nonparametric estimators, a recent one is the procedure of Goldenshluger-Lespki. This method proposes a data-driven choice of
The well-documented and consistent variable selection procedure in model-based cluster analysis and classification, that Cathy Maugis (INSA Toulouse) has designed during her PhD. thesis in select , makes use of stepwise algorithms which are painfully slow in high dimensions. In order to circumvent this drawback, Gilles Celeux in collaboration with Mohammed Sedki (Université Paris XI) and Cathy Maugis), proposed to sort the variables using a lasso-like penalization adapted to the Gaussian mixture model context. Using this rank to select the variables they avoid the combinatory problem of stepwise procedures. After tests on challenging simulated and real data sets, their algorithm finalised and show good performances.
In collaboration with Jean-Michel Marin (Université de Montpellier) and Olivier Gascuel (LIRMM), Gilles Celeux has continued a research
aiming to select a short list of models rather a single model. This short list of models is declared to be compatible with the data using a