Project-Team:SELECT

Inria | Raweb 2016 | Presentation of the Project-Team SELECT | SELECT Web Site


	PDF	e-Pub

Previous |

Home | Next next

Section: New Results

Statistical learning methodology and theory

Participants : Gilles Celeux, Christine Keribin, Michel Prenat, Kaniav Kamary, Sylvain Arlot, Benjamin Auder, Jean-Michel Poggi, Neska El Haouij, Kevin Bleakley.

Gaussian graphical models are widely used to infer and visualize networks of dependencies between continuous variables. However, inferring the graph is difficult when the sample size is small compared to the number of variables. To reduce the number of parameters to estimate in the model, the past PhD. students Emilie Devijver (supervisors: Pascal Massart and Jean-Michel Poggi) and Mélina Gallopin (supervisor: Gilles Celeux) proposed a non-asymptotic model selection procedure supported by strong theoretical guarantees based on an oracle inequality and a minimax lower bound. The covariance matrix of the model is approximated by a block-diagonal matrix. The structure of this matrix is detected by thresholding the sample covariance matrix, where the threshold is selected using the slope heuristic. Based on the block-diagonal structure of the covariance matrix, the estimation problem is divided into several independent problems: subsequently, the network of dependencies between variables is inferred using the graphical lasso algorithm in each block. The performance of the procedure has been illustrated on simulated data. An application to a real gene expression dataset with a limited sample size has been achieved: the dimension reduction allows attention to be objectively focused on interactions among smaller subsets of genes, leading to a more parsimonious and interpretable modular network. This work has been accepted for publication in the Journal of the American Statistical Association.

J-M. Poggi, with A. Bar-Hen, have focused on individual observation diagnosis issues for graphical models. The use of an influence measure is a classical diagnostic method to measure the perturbation induced by single elements. The stability issue is here considered using jackknife. For a given graphical model, tools to perform diagnosis on observations are provided. In the second step, a filtering of the dataset to obtain a stable network is proposed.

Latent Block Models (LBM) are a model-based method to cluster simultaneously the $d$ columns and $n$ rows of a data matrix. The Blockcluster package estimates such LBMs. Parameter estimation in LBM is a difficult and multifaceted problem. Although various estimation strategies have been proposed and are now well-understood empirically, theoretical guarantees about their asymptotic behavior is rather rare. Christine Keribin, in collaboration with Mahendra Mariadassou (INRA) and Vincent Brault (Université de Grenoble) have shown that under some mild conditions on the parameter space, and in an asymptotic regime where $log (d) / n$ and $log (n) / d$ go to 0 when $n$ and $d$ go to $+ \infty$ , (1) the maximum likelihood estimate of the complete model (with known labels) is consistent and (2) the log-likelihood ratios are equivalent under the complete and observed (with unknown labels) models. This equivalence allows us to transfer the asymptotic consistency to the maximum likelihood estimate under the observed model. Moreover, the variational estimator is also consistent. These results extends the results of Bickel et al. (2013) on stochastic block models, and detail the case where the parameter exhibits symmetry.

For the same LBM, Valérie Robert and Yann Vasseur have extended the popular Adjusted Rand Index (ARI) to the task of simultaneous clustering of the rows and columns of a given matrix. This new index, called the Coclustering Adjusted Rand Index (CARI), overcomes the label switching phenomenon while remaining useful and competitive with respect to other indices. Indeed, partitions with high numbers of clusters can be considered, and no convention is required when the numbers of clusters in partitions are different. They are now exploring links with other indices.

Gilles Celeux continued his collaboration with Jean-Patrick Baudry on model-based clustering. This year, they proposed to consider the model selection criterion ICL as a validity index. They show how it can be coupled with a null model of homogeneity focusing on clustering. This null model, which includes the Gaussian distributions, can be difficult to analyze. They find an explicit representation for simple models and show how the parametric bootstrap test can be applied in such situations. In more general situations, they propose a solution for applying this approach involving an “acceptance-rejection” procedure which explores the parameter space to approximate the maximum likelihood estimator inside the null model of homogeneity. The uncovering of this null model highlights the notion of class underlying ICL, and confirms the results of earlier results which show that ICL is consistent for a loss function taking clustering into account.

In collaboration with Arthur White and Jason Wyse (Trinity College, Dublin) Gilles Celeux has evaluated for multivariate Poisson mixture models the performance of a greedy search method compared to the expectation maximization (EM) algorithm, to optimize the ICL model selection criterion, which can be computed exactly for such models. It appears that EM gives often slightly better results, but the greedy search is computationally is more efficient.

The Dutch and French schools of data analysis differ in their approaches to the question: How does one understand and summarize the information contained in a data set? Julie Josse, in collaboration with François Husson (Agro Rennes) and Gibert Saporta (CNAM, Paris), explored the shared factors and differences between the schools, with a focus on methods dedicated to the analysis of categorical data, which are known either as homogeneity analysis (HOMALS) or multiple correspondence analysis (MCA). MCA is a dimension-reduction method which plays a large role in the analysis of tables with categorical nominal variables such as survey data. Though it is usually motivated and derived using geometric considerations, they proved that it amounts to a single proximal Newton step of a natural bilinear exponential family model for categorical data: the multinomial logit bilinear model. They compared and contrasted the behavior of MCA with that of the model on simulations, and discussed new insights into the properties of both exploratory multivariate methods and their cognate models. The main conclusion is to recommend approximating the multilogit model parameters using MCA. Indeed, estimating the parameters of the model is not a trivial task, whereas MCA has the great advantage of being easily solved by a singular value decomposition, as well as being scalable to large datasets.

Julie Josse, with Sobczyk and Bogdan, have discussed the problem of estimating the number of principal components in Principal Components Analysis (PCA). They address this issue by presenting an approximate Bayesian approach based on Laplace approximation, and introduce a general method for building the model selection criteria, called PEnalized SEmi-integrated Likelihood (PESEL). This general framework encompasses a variety of existing approaches based on probabilistic models, like e.g., Bayesian Information Criterion for the Probabilistic PCA (PPCA), and allows for construction of new criteria, depending on the size of the data set at hand. Specifically, they define PESEL when the number of variables substantially exceeds the number of observations. Numerical simulations show that PESEL-based criteria can be quite robust against deviations from probabilistic model assumptions. Selected PESEL-based criteria for estimation of the number of principal components are implemented in the R package varclust, which is available on Github.

Gillies Celeux and Julie Josse have started research on missing data for model-based clustering in collaboration with Christophe Biernacki (Modal, Inria Lille). The aim of this research is to propose appropriate and efficient tools for the packages Mixmod and Mixtcomp.

In collaboration with Jean-Michel Marin (Université de Montpellier) and Christian Robert (Université Paris 9-Dauphine), Gilles Celeux and Kaniav Kamary investigated the ability of Bayesian inference to properly estimate the parameters of Gaussian mixtures in high dimensions. Their study shows how the choice of the prior distributions is important. In particular, independent prior distributions give much better performances. Moreover, when the dimension $d$ becomes very large (say $d > 40$ ) Bayesian inference becomes questionable. The results of this study will be gathered in a chapter of a book on mixture models that Gilles Celeux is preparing with Christian Robert and Sylvia Fruhwirth Schnatter.

Sylvain Arlot, in collaboration with Robin Genuer (ISPED), studied the reasons why random forests work so well in practice. Focusing on the problem of quantifying the impact of each ingredient of random forests on their performance, they showed that such a quantification is possible for a simple pure forest, leading to conclusions that could apply more generally. Then, they considered “hold-out” random forests, which are a good midpoint between “toy” pure forests and Breiman's original random forests.

J.-M. Poggi and N. El Haouij (with R. Ghozi, S. Sevestre Ghalila and M. Jaïdane) provide a random forest-based method for the selection of physiological functional variables in order to classify stress levels during a real-world driving experience. The contribution of this study is twofold: on the methodological side, it considers physiological signals as functional variables and offers a procedure for data processing and variable selection. On the applied side, the proposed method provides a “blind” procedure of driver's stress level classification that does not depend on expert-based studies of physiological signals.

J-M. Poggi (with R. Genuer, C. Tuleau-Malot, N. Villa-Vialaneix), have focused on random forests in Big Data classification problems, and have performed a review of available proposals about random forests in parallel environments as well as on online random forests. Three variants involving subsampling, Big Data-bootstrap and MapReduce respectively are tested on two massive datasets, one simulated one, and the other, real-world data.

B. Auder and J-M. Poggi (with M. Bobbia, B. Portier) have tested some methods for sequential aggregation for forecasting PM10 concentrations for the next day, in the context of air quality monitoring in Normandy (France). The main originality is that the set of experts contains at the same time statistical models built by means of various methods and groups of predictors, as well as experts coming from deterministic chemical models of prediction. The obtained results show that such a strategy clearly improves the performances of the best expert both in terms of prediction errors and in terms of alerts. What is more, it obtains, for the non-convex weighting strategy, the “unbiasedness” of observed-forecasted scatterplots, which is extremely difficult to obtain.

J-M. Poggi (with A. Antoniadis, I. Gijbels, S. Lambert-Lacroix) have considered the joint estimation and variable selection for mean and dispersion in proper dispersion models. They used recent results on Bregman divergence for establishing theoretical results for the proposed estimators in fairly general settings, and also studied variable selection when there is a large number of covariates, with this number possibly tending to infinity with the sample size. The proposed estimation and selection procedure is investigated via a simulation study, and illustrated via some real data applications.

Previous |

Home | Next next