<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN" "http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
    <title>Project-Team:SELECT</title>
    <link rel="stylesheet" href="../static/css/raweb.css" type="text/css"/>
    <meta name="description" content="New Results - Statistical learning methodology and theory"/>
    <meta name="dc.title" content="New Results - Statistical learning methodology and theory"/>
    <meta name="dc.creator" content="Gilles Celeux"/>
    <meta name="dc.creator" content="Christine Keribin"/>
    <meta name="dc.creator" content="Michel Prenat"/>
    <meta name="dc.creator" content="Kaniav Kamary"/>
    <meta name="dc.creator" content="Sylvain Arlot"/>
    <meta name="dc.creator" content="Benjamin Auder"/>
    <meta name="dc.creator" content="Jean-Michel Poggi"/>
    <meta name="dc.creator" content="Neska El Haouij"/>
    <meta name="dc.creator" content="Kevin Bleakley"/>
    <meta name="dc.subject" content=""/>
    <meta name="dc.publisher" content="INRIA"/>
    <meta name="dc.date" content="(SCHEME=ISO8601) 2016-01"/>
    <meta name="dc.type" content="Report"/>
    <meta name="dc.language" content="(SCHEME=ISO639-1) en"/>
    <meta name="projet" content="SELECT"/>
    <script type="text/javascript" src="https://raweb.inria.fr/rapportsactivite/RA2016/static/MathJax/MathJax.js?config=TeX-MML-AM_CHTML">
      <!--MathJax-->
    </script>
  </head>
  <body>
    <div class="tdmdiv">
      <div class="logo">
        <a href="http://www.inria.fr">
          <img style="align:bottom; border:none" src="../static/img/icons/logo_INRIA-coul.jpg" alt="Inria"/>
        </a>
      </div>
      <div class="TdmEntry">
        <div class="tdmentete">
          <a href="uid0.html">Project-Team Select</a>
        </div>
        <span>
          <a href="uid1.html">Members</a>
        </span>
      </div>
      <div class="TdmEntry">Overall Objectives<ul><li><a href="./uid3.html">Model selection in Statistics</a></li></ul></div>
      <div class="TdmEntry">Research Program<ul><li><a href="uid5.html&#10;&#9;&#9;  ">General presentation</a></li><li><a href="uid6.html&#10;&#9;&#9;  ">A nonasymptotic view of model selection</a></li><li><a href="uid7.html&#10;&#9;&#9;  ">Taking into account the modeling purpose in model selection</a></li><li><a href="uid8.html&#10;&#9;&#9;  ">Bayesian model selection</a></li></ul></div>
      <div class="TdmEntry">Application Domains<ul><li><a href="uid10.html&#10;&#9;&#9;  ">Introduction</a></li><li><a href="uid11.html&#10;&#9;&#9;  ">Curve classification</a></li><li><a href="uid12.html&#10;&#9;&#9;  ">Computer experiments and reliability</a></li><li><a href="uid13.html&#10;&#9;&#9;  ">Analysis of genomic data</a></li><li><a href="uid14.html&#10;&#9;&#9;  ">Pharmacovigilance</a></li><li><a href="uid15.html&#10;&#9;&#9;  ">Spectroscopic imaging analysis of ancient materials</a></li></ul></div>
      <div class="TdmEntry">New Software and Platforms<ul><li><a href="uid17.html&#10;&#9;&#9;  ">BlockCluster</a></li><li><a href="uid21.html&#10;&#9;&#9;  ">Mixmod</a></li><li><a href="uid26.html&#10;&#9;&#9;  ">MASSICCC</a></li></ul></div>
      <div class="TdmEntry">New Results<ul><li><a href="uid31.html&#10;&#9;&#9;  ">Model selection in Regression and Classification</a></li><li><a href="uid32.html&#10;&#9;&#9;  ">Estimator selection</a></li><li class="tdmActPage"><a href="uid33.html&#10;&#9;&#9;  ">Statistical learning methodology and theory</a></li><li><a href="uid34.html&#10;&#9;&#9;  ">Estimation for conditional densities in high dimension</a></li><li><a href="uid35.html&#10;&#9;&#9;  ">Reliability</a></li><li><a href="uid36.html&#10;&#9;&#9;  ">Statistical analysis of genomic data</a></li><li><a href="uid39.html&#10;&#9;&#9;  ">Model based-clustering for pharmacovigilance data</a></li><li><a href="uid40.html&#10;&#9;&#9;  ">Statistical rating and ranking of scientific journals</a></li></ul></div>
      <div class="TdmEntry">Bilateral Contracts and Grants with Industry<ul><li><a href="uid42.html&#10;&#9;&#9;  ">Contract with SNECMA</a></li></ul></div>
      <div class="TdmEntry">Partnerships and Cooperations<ul><li><a href="uid44.html&#10;&#9;&#9;  ">Regional Initiatives</a></li><li><a href="uid45.html&#10;&#9;&#9;  ">National Initiatives</a></li><li><a href="uid47.html&#10;&#9;&#9;  ">International Initiatives</a></li></ul></div>
      <div class="TdmEntry">Dissemination<ul><li><a href="uid49.html&#10;&#9;&#9;  ">Promoting Scientific Activities</a></li><li><a href="uid75.html&#10;&#9;&#9;  ">Teaching - Supervision - Juries</a></li></ul></div>
      <div class="TdmEntry">
        <div>Bibliography</div>
      </div>
      <div class="TdmEntry">
        <ul>
          <li>
            <a id="tdmbibentyear" href="bibliography.html">Publications of the year</a>
          </li>
        </ul>
      </div>
    </div>
    <div id="main">
      <div class="mainentete">
        <div id="head_agauche">
          <small><a href="http://www.inria.fr">
	    
	    Inria
	  </a> | <a href="../index.html">
	    
	    Raweb 
	    2016</a> | <a href="http://www.inria.fr/en/teams/select">Presentation of the Project-Team SELECT</a> | <a href="http://www.math.u-psud.fr/select/">SELECT Web Site
	  </a></small>
        </div>
        <div id="head_adroite">
          <table class="qrcode">
            <tr>
              <td>
                <a href="select.xml">
                  <img style="align:bottom; border:none" alt="XML" src="../static/img/icons/xml_motif.png"/>
                </a>
              </td>
              <td>
                <a href="select.pdf">
                  <img style="align:bottom; border:none" alt="PDF" src="IMG/qrcode-select-pdf.png"/>
                </a>
              </td>
              <td>
                <a href="../select/select.epub">
                  <img style="align:bottom; border:none" alt="e-pub" src="IMG/qrcode-select-epub.png"/>
                </a>
              </td>
            </tr>
            <tr>
              <td/>
              <td>PDF
</td>
              <td>e-Pub
</td>
            </tr>
          </table>
        </div>
      </div>
      <!--FIN du corps du module-->
      <br/>
      <div class="bottomNavigation">
        <div class="tail_aucentre">
          <a href="./uid32.html" accesskey="P"><img style="align:bottom; border:none" alt="previous" src="../static/img/icons/previous_motif.jpg"/> Previous | </a>
          <a href="./uid0.html" accesskey="U"><img style="align:bottom; border:none" alt="up" src="../static/img/icons/up_motif.jpg"/>  Home</a>
          <a href="./uid34.html" accesskey="N"> | Next <img style="align:bottom; border:none" alt="next" src="../static/img/icons/next_motif.jpg"/></a>
        </div>
        <br/>
      </div>
      <div id="textepage">
        <!--DEBUT2 du corps du module-->
        <h2>Section: 
      New Results</h2>
        <h3 class="titre3">Statistical learning methodology and theory</h3>
        <p class="participants"><span class="part">Participants</span> :
	Gilles Celeux, Christine Keribin, Michel Prenat, Kaniav Kamary, Sylvain Arlot, Benjamin Auder, Jean-Michel Poggi, Neska El Haouij, Kevin Bleakley.</p>
        <p>Gaussian graphical models are widely used to infer and visualize networks of dependencies between continuous variables.
However, inferring the graph is difficult when the sample size is small compared to the number of variables.
To reduce the number of parameters to estimate in the model,
the past PhD. students Emilie Devijver (supervisors: Pascal Massart and Jean-Michel Poggi) and Mélina Gallopin (supervisor: Gilles Celeux)
proposed a non-asymptotic model selection procedure supported by strong theoretical guarantees based on an oracle inequality and a minimax
lower bound. The covariance matrix of the model is approximated by a block-diagonal matrix. The structure of this matrix is detected by
thresholding the sample covariance matrix, where the threshold is selected using the slope heuristic. Based on the block-diagonal structure
of the covariance matrix, the estimation problem is divided into several independent problems: subsequently, the network of dependencies
between variables is inferred using the graphical lasso algorithm in each block. The performance of the procedure has been illustrated
on simulated data. An application to a real gene expression dataset with a limited sample size has been achieved: the dimension reduction
allows attention to be objectively focused on interactions among smaller subsets of genes, leading to a more parsimonious and interpretable
modular network. This work has been accepted for publication in the <i>Journal of the American Statistical Association</i>.</p>
        <p>J-M. Poggi, with A. Bar-Hen, have focused on individual observation
diagnosis issues for graphical models. The use of an influence measure
is a classical diagnostic method to measure the perturbation induced
by single elements. The stability issue is here considered using
jackknife. For a given graphical model, tools to perform diagnosis on
observations are provided. In the second step, a filtering of the dataset
to obtain a stable network is proposed.</p>
        <p>Latent Block Models (LBM) are a model-based method to cluster simultaneously the <span class="math"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi></math></span> columns and <span class="math"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math></span> rows of a data matrix.
The Blockcluster package estimates such LBMs. Parameter estimation in LBM is a difficult and multifaceted problem.
Although various estimation strategies have been proposed and are now well-understood empirically, theoretical guarantees
about their asymptotic behavior is rather rare. Christine Keribin, in collaboration with Mahendra Mariadassou (INRA)
and Vincent Brault (Université de Grenoble) have shown that under some mild conditions on the parameter space, and
in an asymptotic regime where <span class="math"><math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mo form="prefix">log</mo><mo>(</mo><mi>d</mi><mo>)</mo><mo>/</mo><mi>n</mi></mrow></math></span> and <span class="math"><math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mo form="prefix">log</mo><mo>(</mo><mi>n</mi><mo>)</mo><mo>/</mo><mi>d</mi></mrow></math></span> go to 0 when <span class="math"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math></span> and <span class="math"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi></math></span> go to <span class="math"><math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mo>+</mo><mi>∞</mi></mrow></math></span>, (1) the maximum
likelihood estimate of the complete model (with known labels) is consistent and (2) the log-likelihood ratios are equivalent
under the complete and observed (with unknown labels) models. This equivalence allows us to transfer the asymptotic consistency
to the maximum likelihood estimate under the observed model. Moreover, the variational estimator is also consistent. These results
extends the results of Bickel et al. (2013) on stochastic block models, and detail the case where the parameter exhibits symmetry.</p>
        <p>For the same LBM, Valérie Robert and Yann Vasseur have extended the popular Adjusted Rand Index (ARI)
to the task of simultaneous
clustering of the rows and columns of a given matrix. This new index, called the Coclustering
Adjusted Rand Index (CARI), overcomes the label switching phenomenon while remaining
useful and competitive with respect to other indices. Indeed, partitions with high numbers of
clusters can be considered, and no convention is required when the numbers of
clusters in partitions are different. They are now exploring links with other indices.</p>
        <p>Gilles Celeux continued his collaboration with Jean-Patrick Baudry on model-based clustering.
This year, they proposed to consider the model selection criterion ICL as a validity index. They
show how it can be coupled with a null model of homogeneity focusing on clustering.
This null model, which includes the Gaussian distributions, can be difficult to analyze.
They find an explicit representation for simple models and show how the parametric bootstrap test can be applied in such
situations.
In more general situations, they propose a solution for applying this approach involving an “acceptance-rejection” procedure which explores the
parameter space to approximate the maximum likelihood estimator inside the null model of homogeneity.
The uncovering of this null model highlights the notion of class underlying ICL, and confirms the results of earlier results which show
that ICL is consistent for a loss function taking clustering into account.</p>
        <p>In collaboration with Arthur White and Jason Wyse (Trinity College, Dublin)
Gilles Celeux has evaluated for multivariate Poisson mixture
models the performance of a greedy search method
compared to the expectation maximization (EM) algorithm,
to optimize the ICL model selection criterion, which can be computed exactly
for such models.
It appears that EM gives often slightly better results,
but the greedy search is computationally is more efficient.</p>
        <p>The Dutch and French schools of data analysis differ in their approaches to the question: How does one understand and summarize the
information contained in a data set? Julie Josse, in collaboration with François Husson (Agro Rennes) and Gibert Saporta (CNAM, Paris), explored the shared factors and differences between the schools, with a focus on methods
dedicated to the analysis of categorical data, which are known either as homogeneity analysis (HOMALS) or multiple correspondence analysis
(MCA).
MCA is a dimension-reduction method which plays a large role in the analysis of tables with categorical nominal variables such as
survey data. Though it is usually motivated and derived using geometric considerations,
they proved that it amounts to a single
proximal Newton step of a natural bilinear exponential family model for categorical data: the multinomial logit bilinear model. They
compared and contrasted the behavior of MCA with that of the model on simulations, and discussed new insights into the properties of both
exploratory multivariate methods and their cognate models. The main conclusion is to recommend approximating the
multilogit model parameters using MCA. Indeed, estimating the parameters of the model is not a trivial task, whereas MCA has
the great advantage of being easily solved by a singular value decomposition, as well as being scalable to large datasets.</p>
        <p>Julie Josse, with Sobczyk and Bogdan, have discussed the problem of estimating the number of principal components in Principal
Components Analysis (PCA). They address this issue by presenting an approximate Bayesian approach based on Laplace approximation,
and introduce a general method for building the model selection criteria, called PEnalized SEmi-integrated Likelihood (PESEL).
This general framework encompasses a variety of existing approaches based on probabilistic models, like e.g., Bayesian Information Criterion
for the Probabilistic PCA (PPCA), and allows for construction of new criteria, depending on the size of the data set at hand. Specifically,
they define PESEL when the number of variables substantially exceeds the number of observations. Numerical simulations show that PESEL-based criteria can be quite robust against deviations from probabilistic model assumptions. Selected PESEL-based criteria for
estimation of the number of principal components are implemented in the R package varclust, which is available on Github.</p>
        <p>Gillies Celeux and Julie Josse have started research on missing data for model-based clustering in collaboration with Christophe
Biernacki (Modal, Inria Lille). The aim of this research is to propose appropriate and efficient tools for the packages Mixmod and Mixtcomp.</p>
        <p>In collaboration with Jean-Michel Marin (Université de Montpellier) and Christian Robert (Université Paris 9-Dauphine),
Gilles Celeux and Kaniav Kamary investigated the ability of Bayesian inference to properly estimate the parameters of Gaussian mixtures in high
dimensions. Their study shows how the choice of the prior distributions is important. In particular, independent prior distributions give
much better performances. Moreover, when the dimension <span class="math"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi></math></span> becomes very large (say <span class="math"><math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>d</mi><mo>&gt;</mo><mn>40</mn></mrow></math></span>) Bayesian inference becomes questionable.
The results of this study will be gathered in a chapter of a book on mixture models that Gilles Celeux is preparing
with Christian Robert and Sylvia Fruhwirth Schnatter.</p>
        <p>Sylvain Arlot, in collaboration with Robin Genuer (ISPED), studied the reasons why random forests work so well in practice. Focusing on the problem of quantifying the impact of each ingredient of random forests on their performance, they showed that such a quantification is possible for a simple pure forest, leading to conclusions that could apply more generally. Then, they considered “hold-out” random forests, which are a good midpoint between “toy” pure forests and Breiman's original random forests.</p>
        <p>J.-M. Poggi and N. El Haouij (with R. Ghozi, S. Sevestre Ghalila and M. Jaïdane)
provide a random forest-based method for the selection of
physiological functional variables in order to classify stress
levels during a real-world driving experience. The contribution of this
study is twofold: on the methodological side, it considers
physiological signals as functional variables and offers a procedure
for data processing and variable selection. On the applied side, the
proposed method provides a “blind” procedure of driver's stress level
classification that does not depend on expert-based studies of
physiological signals.</p>
        <p>J-M. Poggi (with R. Genuer, C. Tuleau-Malot, N. Villa-Vialaneix), have focused
on random forests in Big Data classification problems, and have performed a
review of available proposals about random forests in parallel
environments as well as on online random forests. Three variants
involving subsampling, Big Data-bootstrap and MapReduce respectively
are tested on two massive datasets, one simulated one, and the other, real-world data.</p>
        <p>B. Auder and J-M. Poggi (with M. Bobbia, B. Portier) have tested some
methods for sequential aggregation for forecasting PM10 concentrations
for the next day, in the context of air quality monitoring in Normandy
(France). The main originality is that the set of experts contains at
the same time statistical models built by means of various methods and
groups of predictors, as well as experts coming from deterministic
chemical models of prediction. The obtained results show that such a
strategy clearly improves the performances of the best expert both in
terms of prediction errors and in terms of alerts. What is more, it
obtains, for the non-convex weighting strategy, the “unbiasedness” of
observed-forecasted scatterplots, which is extremely difficult to obtain.</p>
        <p>J-M. Poggi (with A. Antoniadis, I. Gijbels, S. Lambert-Lacroix) have
considered the joint estimation and variable selection for mean and
dispersion in proper dispersion models. They used recent results on
Bregman divergence for establishing theoretical results for the
proposed estimators in fairly general settings, and also studied variable
selection when there is a large number of covariates, with this number
possibly tending to infinity with the sample size. The proposed
estimation and selection procedure is investigated via a simulation
study, and illustrated via some real data applications.</p>
      </div>
      <!--FIN du corps du module-->
      <br/>
      <div class="bottomNavigation">
        <div class="tail_aucentre">
          <a href="./uid32.html" accesskey="P"><img style="align:bottom; border:none" alt="previous" src="../static/img/icons/previous_motif.jpg"/> Previous | </a>
          <a href="./uid0.html" accesskey="U"><img style="align:bottom; border:none" alt="up" src="../static/img/icons/up_motif.jpg"/>  Home</a>
          <a href="./uid34.html" accesskey="N"> | Next <img style="align:bottom; border:none" alt="next" src="../static/img/icons/next_motif.jpg"/></a>
        </div>
        <br/>
      </div>
    </div>
  </body>
</html>
