The research domain for the selectproject is statistics. Statistical methodology has made great progress over the past few decades, with a variety of statistical learning software packages that support many different methods and algorithms. Users now face the problem of choosing among them, to select the most appropriate method for their data sets and objectives. The problem of model selection is an important but difficult problem both theoretically and practically. Classical model selection criteria, which use penalized minimum-contrast criteria with fixed penalties, are often based on unrealistic assumptions.

selectaims to provide efficient model selection criteria with data-driven penalty terms. In this context, selectexpects to improve the toolkit of statistical model selection criteria from both theoretical and practical perspectives. Currently, selectis focusing its effort on variable selection in statistical learning, hidden-structure models and supervised classification. Its domains of application concern reliability, curves classification, phylogeny analysis and classification in genetics. New developments of selectactivities are concerned with applications in biostatistics (statistical analysis of fMRI data ) and population genetics.

We learned from the applications we treated that some assumptions which are currently used in asymptotic theory for model selection are often irrelevant in practice. For instance, it is not realistic to assume that the target belongs to the family of models in competition. Moreover, in many situations, it is useful to make the size of the model depend on the sample size which make the asymptotic analysis breakdown. An important aim of selectis to propose model selection criteria which take these practical constraints into account.

An important purpose of selectis to build and analyze penalized log-likelihood model selection criteria that are efficient when the number of models in competition grows to infinity with the number of observations. Concentration inequalities are a key tool for that purpose and lead to data-driven penalty choice strategies. A major issue of selectconsists of deepening the analysis of data-driven penalties both from the theoretical and the practical side. There is no universal way of calibrating penalties but there are several different general ideas that we want to develop, including heuristics derived from the Gaussian theory, special strategies for variable selection and using resampling methods.

Choosing a model is not only difficult theoretically. From a practical point of view, it is important to design model selection criteria that accommodate situations in which the data probability distribution P is unknown and which take the model user's purpose into account. Most standard model selection criteria assume that P belongs to one of a set of models, without considering the purpose of the model. By also considering the model user's purpose, we avoid or overcome certain theoretical difficulties and can produce flexible model selection criteria with data-driven penalties. The latter is useful in supervised Classification and hidden-structure models.

The Bayesian approach to statistical problems is fundamentally probabilistic. A joint probability distribution is used to describe the relationships among all the unknowns and the data. Inference is then based on the posterior distribution i.e. the conditional probability distribution of the parameters given the observed data. Exploiting the internal consistency of the probability framework, the posterior distribution extracts the relevant information in the data and provides a complete and coherent summary of post-data uncertainty. Using the posterior to solve specific inference and decision problems is then straightforward, at least in principle.

A key goal of selectis to produce methodological contributions in statistics. For this reason, the selectteam works with applications that serve as an important source of interesting practical problems and require innovative methodologies to address them. Most of our applications involve contracts with industrial partners, e.g. in reliability, although we also have several more academic collaborations, e.g. genomics, genetics and neuroimaging.

The field of classification for complex data as curves, functions, spectra and time series is important. Standard data analysis questions are being revisited to define new strategies that take the functional nature of the data into account. Functional data analysis addresses a variety of applied problems, including longitudinal studies, analysis of fMRI data and spectral calibration.

We are focusing on unsupervised classification. In addition to standard questions as the choice of the number of clusters, the norm for measuring the distance between two observations, and the vectors for representing clusters, we must also address a major computational problem. The functional nature of the data needs to be design efficient anytime algorithms.

Since several years,
selecthas collaborations with EDF-DER
*Maintenance des Risques Industriels*group. An important theme concerns the resolution of inverse problems using simulation tools to analyze incertainty in highly complex physical systems.
A collaboration on an analogous topic is developed with Dassault Aviation.

The other major theme concerns probabilistic modeling in fatigue analysis in the context of a research collaboration with SAFRAN an high-technology group (Aerospace propulsion, Aicraft equipment, Defense Security, Communications).

Since 2007 selectparticipates to a working group with team Neurospin (CEA-INSERM-INRIA) on Classification, Statistics and fMRI (functional Magnetic Resonance Imaging) analysis. In this framework two theses have been co-supervised by selectand Neurospin researchers (Merlin Keller 2006-2009 and Vincent Michel 2007-2010). The aim of this research is to determine which parts of the brain are activated by different types of stimuli. A model selection approach is useful to avoid "false-positive" detections.

For the past few years selecthas collaborated with Marie-Laure Martin-Magniette (URGV) for the analysis of genomic data. An important theme of this collaboration is using statistically sound model-based clustering methods to discover groups of co-expressed genes from microarray and high-throughput sequencing data. In particular, identifying biological entities that share similar profiles across several treatment conditions, such as co-expressed genes, may help identify groups of genes that are involved in the same biological processes.

A study has been achieved by Jean-Michel Poggi, François-Xavier Jollois (Université Paris-Descartes) and Bruno Portier (INSA de Rouen), in the context of a collaboration between AirNormand, Paris Descartes University and INSA of Rouen. They analyzed and forecasted PM10 pollution in Rouen area on six different monitoring sites to quantify the effects of variables of different types, mainly meteorological versus other pollutant measurements. Some recent non parametric statistical methods (random forests, mixture of linear models and nonlinear additive models) have been used and beyond the application, this study shed light on those methods.

mixmodis being developed in collaboration with Christophe Biernacki, Florent Langrognet (Université de Franche-Comté) and Gérard Govaert (Université
de Technologie de Compiègne).
mixmod(
mixture
modelling) software fits mixture models to a given data set with either a clustering or a discriminant analysis purpose.
mixmoduses a large variety of algorithms to estimate mixture parameters, e.g., EM, Classification EM, and Stochastic EM. They can be combined to
create different strategies that lead to a sensible maximum of the likelihood (or completed likelihood) function. Moreover, different information criteria for choosing a parsimonious model,
e.g. the number of mixture component, some of them favoring either a cluster analysis or a discriminant analysis view point, are included. Many Gaussian models for continuous variables and
multinomial models for discrete variable are available. Written in C++,
mixmodis interfaced with
Scilaband
Matlab. The software, the statistical documentation and also the user guide are available on the Internet at the following address:
http://

Since this year, mixmodhas a proper graphical user interface (Version 1) which has been presented at the mixmodday in Lyon in December 2010.A version of mixmodin R is forthcoming.

Erwan Le Pennec with the help of Serge Cohen has proposed a spatial extension in which the mixture weights can vary spatialy.

Erwan Le Pennec continues his work with Serge Cohen (IPANEMA Soleil) on hyperspectral image segmentation based on a spatialized Gaussian Mixture Model. They derive, and implement within MIXMOD, an efficient minimization algorithm combining EM algorithm, dynamic programming and model selection . They have applied this technique to analyze ancient material This scheme is supported by a theoretical work on conditional density estimation . In the framework of her PhD, Lucie Montuelle has studied some extension to this model to spatiay varying logistic weights.

In collaboration with Marie-Laure Martin-Magniette (URGV et UMR AgroParisTech/INRA MIA 518) and Cathy Maugis (INSA Toulouse) has extended their variable selection procedure for model-based clustering and supervised classification to deal with high dimensional data sets with a backward selection procedure which is more efficient that the previous forward selection procedure in this context. . Moreover they have shown the advantage of the model-based approach over a geometrical approach to select variable for clustering . These variable selection procedures are in particular used for genomics applications which is the result of a collaboration with researchers of of URGV (Evry Genopole).

Caroline Meynet provided an

From an practical point of view, Caroline Meynet has introduced a procedure to select variables in model-based clustering in a high-dimensional context. In order to tackle with the problem of high-dimension, she has proposed to first use the Lasso in order to select different sets of variables and then estimate the density by a standard EM algorithm by reducing the inference to the linear space of the selected variables by the Lasso. Numerical experiments show that this method can outperform direct estimation by the Lasso.

In collaboration with Professor Abdallah Mkhadri (University of Marrakesh, Marocco), Gilles Celeux supervised the thesis of Mohammed El Anbari which concern regularisation methods in linear regression. In collaboration with Professor Abdallah Mkhadri (University of Marrakesh, Marocco), Mohammed El Anbari proposed a method to simultaneously select variables and favor a grouping effect where strongly correlated predictors tend to be in or out of the model together. Numerical experiments showed that their method can be preferred to Elastic-Net when the number of variables is less or equal to the sample size and remain competitive otherwise. Moreover, they have proposed AdaGril an extension of the the adaptive Elastic Net which incorporates information redundancy among correlated variables for model selection and estimation. Under weak conditions, They have established an oracle property of AdaGril. Numerical experiments show in some cases of AdaGril outperforms its competitors.

In collaboration with Jean-Michel Marin (Université de Montpellier) and Christian P. Robert (CEREMADE, Université Paris Dauphine) Gilles Celeux and Mohammed El Anbari highlight the interest of Bayesian regularization methods, using hierarchical non informative priors, compared with standard regularization methods in a poorly informative context through numerical experiments .

Clément Levrard worked on the obtention of fast rates of convergence for vector quantization. Using theoretical analogies between quantization seen as an unsupervised learning probel and the one of supervised learning by empirical contrast minimzation, he has obtained a logarithmic improvement on the previously obtained bound. He has been furthermore able to define intellegible "margin type" condition under which fast rates can be obtained.

Since September 2008, Pascal Massart is the cosupervisor with Frédéric Chazal (GEOMETRICA) of the thesis of Claire Caillerie (GEOMETRICA). The project intends to explore and to develop new researches at the crossing of information geometry, computational geometry and statistics.

Unsupervised segmentation is an issue similar to unsupervised classification with an added spatial aspect. Functional data is acquired on points in a spatial domain and the goal is to segment the domain in homogeneous domain. The range of applications includes hyperspectral images in conservation sciences, fMRi data and all spatialized functional data. Erwan Le Pennec and Lucie Montuelle are focusing on the questions of the way to handle the spatial component from both the theoretical and the practical point of views as well as the choice of the number of clusters. Furthermore, as functional data require heavy computation, they are required to propose numerically efficient algorithms.

Gilles Celeux, Christine Keribin and the Ph D. student Vincent Brault continue their work on the Latent Block Model. They have proposed an efficient algorithm coupling a Stochastic version of the EM algorithm including a Gibbs sampling step and the Variational EM algorithm. This SEM-VEM algorithm is insensible to its initial position. On the other hand they got a closed formed expression of the Integrated Completed Likelihood for binary tables which allows for a reliable model selection criterion avoiding asymptotic approximation. Moreover, Christine Keribin derived sufficient conditions ensuring the identifiability of the Latent Block Model.

In the computer experiments field, the goal is to approximate an expensive black box function from a limited number of evaluations. The choice of these evaluations i.e. the choice of a design of (computer) experiments is a major issue.

This year Yves Auffray and Pierre Barbillon, in collaboration with Jean-Michel Marin (Université de Montpellier) have considered estimating the probability of rare events in the context of computer experiments. These rare events depends on the output of a physical model with random input variables. Since the model is only known through an expensive black box function, a crude Monte Carlo estimator does not perform well. Two strategies have been developed to cope with this difficulty: a Bayesian estimate and an importance sampling method. Both methods relies on Kriging metamodeling. They are able to achieve sharp upper confidence bounds on the rare event probability. These methods have been applied to a toy example and a real case study which consists of finding an upper bound of the probability that the trajectory of an airborne load collides the aircraft that has released it.

Following the previous work of the first year, Shuai Fu, under the direction of Gilles Celeux, focus on the design of experiments and its validation, which has become the main issues of the thesis. It leads both to theoretical and computational developments. An original DAC criterion has been proposed and leads to a Bayesian procedure of DAC-test to measure the quality of a design. For improving the design of experiments, an adaptive kriging procedure well adapted to the specific problem has been proposed. However, the algorithms require a too important computation time which should be reduced in future work.

In the framework of a CIFRE convention with Snecma-SAFRAN Rémy Fouchereau has started a thesis on the modeling of fatigue damage for Inco718 supervised by Gilles Celeux. Inco718 is a Zinc-based alloy. To determine its minimum lifetime, a lot of stress tests are made. The lloay lifetimes are reported as function of the stress. The aim of this work is to analyse the resulting curves. A mixture model with a lognormal component and a sum of two lognormals components is considered. Since the sum of two or more lognormal distribution is not closed form. Inference on this model needs Monte Carlo integration within the EM algorithm. Despite some unstability for small sample sizes, this model show encouraging and easily interpretable results.

Andrea Rau and Gilles Celeux, in collaboration with Marie-Laure Martin-Magniette (URGV and UMR AgroParisTech/INRA MIA 518) and Cathy Maugis-Rabusseau (IMT/INSA Toulouse) have developed a
method to cluster digital gene expression observations from high-throughput (HTS) data using Poisson mixture models
. The proposed model has the advantage of accounting for the
particularities of HTS data and providing straightforward procedures for parameter estimation and model selection. A series of simulation experiments was done to compare the performance of the
proposed model to that of previously proposed clustering methods for similar sequence-based data, and the performance of the proposed approach was examined on two real high-throughput
sequencing data sets. The R package
`HTSCluster`used to implement the proposed Poisson mixture model has been made freely available on CRAN.

In collaboration with Farouk Mhamdi and Meriem Jaidane (ENIT, Tunis, Tunisia), Jean-Michel Poggi proposed, in . a method for trend extraction from seasonal time series through the Empirical Mode Decomposition (EMD). Experimental comparison of trend extraction based on EMD, X11, X12 and Hodrick Prescott filter are conducted. First results show the eligibility of the blind EMD trend extraction method. Tunisian real peak load is also used to illustrate the extraction of the intrinsic trend.

In collaboration with Mina Aminghafari (Amirkabir University, Teheran), Jean-Michel Poggi made uses of wavelets in a statistical forecasting purpose for time series. Recent approaches involve wavelet decompositions in order to handle non stationary time series. They study and extended an approach proposed by Renaud et al., to estimate the prediction equation by direct regression of the process on the Haar non-decimated wavelet coefficients depending on its past values. The new variants are used first for stationary data and after for stationary data contaminated by a deterministic trend .

Jean-Michel Poggi was the supervisor (with A. Antoniadis) of the PhD Thesis of Jairo Cugliari-Duhalde which takes place in a CIFRE convention with EDF. It is strongly related to the use of wavelets together with curves clustering in order to perform accurate load comsumption forecasting. The thesis develops methodological and applied aspects linked to the electrical context as well as theoretical ones by introducing exogeneous variables in the context of nonparametric forecasting time series (see and ).

This research takes place as part of a collaboration with Neurospin on brain functional Magnetic Resonance Imaging (fMRI) data. (
http://

selecthas a contract with EDF regarding modelling uncertainty in deterministic models.

selecthas a contract with EDF regarding wavelet analysis of the electrical load consumption for the aggregation and desaggregation of curves to improve total signal prediction.

selecthas a contract with SAFRAN - SNECMA, an high-technology group (Aerospace propulsion, Aicraft equipment, Defense Security, Communications),regarding modelling reliability of Aircraft Equipment (collaboration with Patrick Pamphile (Université Paris-Sud).

selectis animating a working group on model selection and statistical analysis of genomics data with the Biometrics group of Institut Agronomique Nationale Paris-Grignon (INAPG).

Pascal Massart is co-organizing a working group at ENS (Ulm) on Statistical Learning. This year the group focused interest on regularization methods in regression. Most of selectmembers are involved in this working group.

selectis animating a working group on Classification, Statistics and fMRI imaging with Neurospin.

selectis animating a working group on Unsupervised Classification with the CMAP (École Polytechnique)

Gilles Celeux and Pascal Massart are members of the PASCAL (Pattern Analysis, Statistical Learning and Computational Learning) network.

Gilles Celeux is one of the co-organizers of the Working Group on Model-Based Clustering.

Gilles Celeux is Editor-in-Chief of
*Statistics and Computing*. He is Associate Editor of
*CSBIGS*and
*La Revue Modulad*.

Pascal Massart is Associated Editor of
*Annals of Statistics*,
*Confluentes Mathematici*, and
*Foundations and Trends in Machine Learning*.

Jean-Michel Poggi is Associated Editor of
*Journal of Statistical Software*,
*Journal de la SFdS*and
*CSBIGS*.

Gilles Celeux was invited speaker to IFCS 2011 in Frankfurt, to the mixture session of JSM2011 in Miami, to StatSeq 2011 in Toulouse, to the statistical seminar of the Economics departement of Vienna University and to the Summer Model-Based Clustering working group in Glasgow.

Jean-Michel Poggi was invited speaker at SIS 2011, 46th Scient. Meeting of the Italian Stat. Society in Bologna, at ENBIS-11 in Coimbra and at the Worksoph - In honour of Anestis Antoniadis at Villard de Lans.

Gilles Celeux is member of the CSS of INRA.

Gilles Celeux was Chair of the Chikio Hayashi Awards Committee.

Erwan Le Pennec is a member of the Board of the MAS group of the SMAI (french SIAM).

Erwan Le Pennec and Pascal Massart are members of the C.N.U. (section 26).

Pascal Massart is a senior member of the I.U.F.

Pascal Massart is a member of the scientific council of the French Mathematical Society.

Pascal Massart is a member of the scientific council of the Mathematical Department of the Ecole Normale Supérieure de Paris.

Pascal Massart was a member of the scientific committee of the European Meeting of Staticians in Piraeus.

Jean-Michel Poggi is Cochair seminar of Probability and Statistics of the "laboratoire de Mathématiques d'Orsay", seminar ECAIS (Extraction de connaissances : approches informatiques et statistiques) of IUT de Paris 5 Descartes and of "Séminaire Parisien de Statistique".

Jean-Michel Poggi is Chair of the Program Commitee of the «Journées de Statistique de la SFdS», Tunis, mai 2011

Jean-Michel Poggi is President of the French statistical society (SFdS).

Jean-Michel Poggi is member of the Board of the "Environment group" of the French statistical society (SFdS).

Master: Gilles Celeux, modèles à structure cachée ISUP 3ème année (Université Paris 6) 20 heures

Master: Gilles Celeux, modèles pour la classification M2 probabilités et statistique, Université Paris Sud, 24 heures

Master: Erwan Le Pennec, Méthodes d'ondelettes, 24h, Mé, Université Paris Diderot, France

Master: Erwan Le Pennec, Analyse Spectrale, 18h, M1, Ponts Paristech, France

Master: All the other selectmembers are teaching in various courses of different universities and in particular in the M2 “Modélisation stochastique et statistique” of University Paris-Sud.

PhD & HdR :

PhD : Jairo Cugliari Duhalde, Prévision d’un processus à valeurs fonctionnelles. Application à la consommation d’électricité, 22/11/2011 at Paris XI Orsay, J.-M. Poggi and Anestis Antoniadis (Univ. Joseph Fourier, Grenoble)

PhD : Robin Genuer, Forêts aléatoires : aspects théoriques, sélection de variables et applications, 24/11/2010 at Paris XI Orsay, J.-M. Poggi

PhD in progress: Vincent Brault, 2011, Gille Celeux and Christine Keribin

PhD in progress: Claire Caillerie, 2008, Pascal Massart and Frédéric Chazal

PhD in progress: Rémi Fouchereau, 2011, Gille Celeux

PhD in progress: Shuai Fu, 2010, Gille Celeux

PhD in progress: Clément Levrard, 2009, Pascal Massart and Gérard Biau (UPMC)

PhD in progress: Caroline Meynet, 2009, Pascal Massart

PhD in progress: Lucie Montuelle, Sélection de modèles et mélange de gaussiennes en imagerie hyperspectrale, 01/10/2011, Erwan Le Pennec

PhD in progress: Nelo Molter Magalães, 2011, Pascal Massart