Our research domain is statistics. In the last decades, statistical methodology has received a lot of contributions. Many different methods and algorithms are available in current softwares of statistical learning. The user of these methods is facing the problem of choosing a relevant method for its data set and objective. The model selection problem is an important but difficult problem from both theoretical and practical point of views. Classical criteria of models selection, based on often unrealistic assumptions, are penalized minimum contrast criteria with fixed penalties. select is aiming to provide efficient model selection criteria with data driven penalty terms. In this context, select is expecting to improve the toolkit of statistical model selection criteria from both theoretical and practical aspects. Currently, select is focusing its effort on variable selection in regression problems, abrupt changes detection, hidden structure models and supervised classification. Its domain of application concern reliability, curves classification, phylogeny analysis and classification in genetics.
We learned from the applications we treated that some assumptions which are currently used in asymptotic theory for model selection are often irrelevant in practice. For instance, it is not realistic to assume that the target belongs to the family of models in competition. Moreover, in many situations, it is useful to make the size of the model depends on the sample size which make the asymptotic analysis breakdown. An important aim of select is to propose model selection criteria which takes these practical constraints into account.
An important purpose of select is to build and analyze penalized log-likelihood model selection criteria efficient when the number of models in competition grows to infinity with the number of observations. Concentration inequalities are a key tool for that purpose and lead to propose data-driven penalty choices strategies. A major issue of select consists of deepening the analysis of data-driven penalties both from the theoretical and the practical side. There is no universal way of calibrating penalties but there are several different rather general ideas that we want to develop, including heuristics derived from the Gaussian theory, special strategy for variable selection and making use of resampling methods , .
The change-point problem is important in many applications, and has been well-studied for more than forty years. We are focusing on the a posteriori problem which consists of recovering the configuration of change-points using the whole observed series. This problem has many potential applications. It can have a central role in the analysis of seismic signals, ECG or EEG signals for instance. We are developing a procedure is aiming to detect all the change points simultaneously by minimizing a penalized contrast. Different contrast functions can be used according to the problem at hand. But, the difficult problem which remains partly open is to define and analyze proper data driven strategies for calibrating the penalty. Some calibration strategies have already been proposed, But there is still some hard work to be done in order to validate those methods.
Choosing a model is not only a difficult problem from the theoretical point of view. Model selection criteria have
been conceived to answer the difficulty that the data probability distribution P is unknown. But, beyond technical
difficulties which can occur when choosing a model, it can be fruitful to take into account the purpose of the
model user to get reliable and useful models for statistical description or decision tasks. As noticed earlier,
most of standard model selection criteria are assuming that P is belonging to one of the considered models
without considering the modelling purpose. This point of view would be useful not only from the
practical point of view, but also it could help to avoid or overcome theoretical difficulties.
Moreover, taking into account the modelling purpose would produce flexible model selection
criteria with data-driven penalties . This point of view can be expected to be useful
in supervised Classification and hidden structure models. Finally, it is worth to mention that an alternative Bayesian
approach for taking the modelling purpose into account can
be expected to be useful in that setting.
select aims to produce methodological contributions in statistics. For this very reason, the members of select are involved in applications. We are considering that applications are important to provide us interesting practical problems for which there is the need of innovative methodologies. Most of the applications we are involved concern conventions with industrial partners (for instance our activities in reliability), and some of them concern more academic collaborations (as our activity in phylogeny).
An increasing interest is now evident in the field of classification and regression for complex data as curves, functions, spectra, time series and so on. Such questions naturally arise when each observation consists of values of explanatory variables which are not scalar valued but of functional nature. Classical questions widely examined in Data Analysis are now revisited to take into account and to take advantage (if possible) of the functional nature of the data and to define original strategies , . Such questions are now related to a well identified domain called functional data analysis. Various applied problems strongly motivate this interest like longitudinal studies, analysis of fMRI data, spectral calibration, ....
We are focusing on classification problems with a particular emphasis on clustering (unsupervised classification) ones. Of course, in addition to classical questions like the choice of the number of clusters, the choice of the norm or pseudo-norm to measure the distance between two observations, the choice of an observation or a point to reduce a cluster to the most representative observation and so on, a crucial problem naturally arises: due to the functional nature of the data, the computational effort needed is quickly huge and efficient algorithm as well as anytime algorithms are of interest.
An important theme that select considers is aging modelling. This research is done thanks to a convention with EDF-DER Fiabilité des Composants et Structures group. Most of the French nuclear park is approaching forty years which is the warranty age of good running. EDF is interested to examine the possible extension of use of nuclear material components beyond forty years and has planned studies to analyze durability of nuclear components and aging mastership. The collaboration of select with EDF takes place in this framework .
The other theme of research in which select is involved concern changes in a reliability process. It comes from a convention with Altis firm. During the last five years, Altis has drastically changed its production process of chips. Indeed half of the production is nowadays made with brass connexions instead of aluminium connexions. This makes the usual reliability model irrelevant. Some abrupt change of the reliability behavior is suspected. We are working on the selection of a good model fitting data.
Phylogeny is concerned with designing evolutionary trees between species from aligned nucleotide sequences. More
precisely, a nucleotide sequence being an ordered set of sites taking value in a finite
set E (for instance, E = {A, C, G, T}), the problem is to reconstruct the topology of the evolutionary tree
between the species from aligned sequences for the considered species, and to estimate the tree parameters
(branches length) as well as the parameters of the evolutionary model.
The model that we consider is the covarion model. For this model, a site can change of behavior along the evolutionary tree according to two hidden states, active (ON) or nonactive (OFF). We are working to elucidate identifiability conditions and to analyze those conditions from the practical viewpoint. In this research, we are also interested to compare non nested models .
mixmod is developed with Christophe Biernacki, Florent Langrognet (Université de Franche-Comté) and Gérard Govaert (Université de Technologie de Compiègne). mixmod (mixture modelling) software fits mixture models to a given data set with either a clustering or a discriminant analysis purpose. A large variety of algorithms to estimate the mixture parameters are proposed (EM, Classification EM, Stochastic EM) and it is possible to combine them to lead to different strategies in order to get a sensible maximum of the likelihood (or completed likelihood) function. Moreover, different information criteria for choosing a parsimonious model (the number of mixture component, for instance), some of them favoring either a cluster analysis or a discriminant analysis view point, are included. Many Gaussian models for continuous variables and multinomial models for discrete variable are available. can be considered according to different assumptions on the component variance matrix eigenvalue decomposition. Written in C++, mixmod is interfaced with Scilab and Matlab. The software, the statistical documentation and also the user guide are available on the internet at the following address http://www-math.univ-fcomte.fr/mixmod/index.php.
Joint work with Stéphane Boucheron (LRI, Orsay), Olivier Bousquet (Max Planck Institute, Tuebingen) and Gabor Lugosi (Pompeu Fabra, Barcelona). New inequalities for functions of independent random variables have been designed. They prove to be a versatile tool in a wide range of applications. In particular, the Talagrand's exponential inequality for Rademacher chaos of order two has been generalized to any order. Applications for other complex functions of independent random variables, such as suprema of Boolean polynomials which include, as special cases, subgraph counting problems in random graphs have been considered .
Guillaume Bouchard and Gilles Celeux have proposed a new criterion, the so-called Bayesian Entropy Criterion (BEC), to select a classification model taking into account the decisional purpose of a model by minimizing the integrated classification entropy. It provides an interesting alternative to the cross validated error rate which is highly time consuming. The asymptotic behavior of BEC criterion has been studied. Numerical experiments on both simulated and real data sets show that BEC is performing better than the classical BIC criterion to select a model minimizing the classification error rate .
In collaboration with Lucien Birgé (Paris 6), Pascal Massart has analyzed in a precise way what kind of penalties should be used in order to perform model selection via the minimization of a penalized least squares criterion within some general Gaussian framework .
Marie Sauvé and Christine Tuleau are studying a variable selection procedure based on CART in the Gaussian regression framework and in the classification framework. This CART (Classification And Regression Trees) algorithm is a popular algorithm which builds a piecewise constant estimator of a regression function or a classifier from a training sample of observations .
Jean-Michel Poggi and Christine Tuleau are designing classification rules using CART, wavelet-based compression and denoising for measuring the comfort of driving. It a an applied study supported by Renault.
In collaboration with Gilles Blanchard and Régis Vert (Orsay) Pascal Massart and Laurent Zwald have introduced a new kernel algorithm for pattern recognition. They started from a study of the regularization properties of Kernel Principal Component Analysis (KPCA) within the classification framework. KPCA has been previously used as a pre-processing step of support vector machine (SVM) but this method is somewhat redundant from a regularization point of view and they propose a new algorithm called Kernel Projection Machine to avoid this redundancy, based on an analogy with the statistical framework of regression for a Gaussian white noise model. Preliminary experimental results show that this algorithm reaches the same performances as SVM .
Moreover in collaboration of Gilles Blanchard and Olivier Bousquet, Laurent Zwald has studied the properties of the eigenvalues of Gram matrices in a non-asymptotic setting. Using local Rademacher averages, they provide data-dependent tight bounds for their convergence toward eigenvalues of the corresponding kernel operator. They perform these computations in a functional analytic framework which allows to deal implicitly with reproducing kernel Hilbert spaces of infinite dimension. This can have applications to various kernel algorithms. Focusing on KPCA they get sharp excess risk bounds for the reconstruction error .
In collaboration with Bill Triggs (team LEAR, Inria), Guillaume Bouchard has proposed and studied a method providing a compromise between a generative classifier (modelling the joint distribution of the groups and the descriptive variables) and a discriminative classifier (modelling the conditional distribution of the groups knowing the descriptive variables) .
A methodology for model selection based on a penalized contrast has been developed by Marc Lavielle. This methodology is applied to the change-point problem, for estimating the number of change points and their location. An adaptive choice of the penalty function has been proposed for automatically estimating the number of change points. In a Bayesian framework, the posterior distribution of the change-point sequence as a function of the penalized contrast has been defined. Monte Carlo Markov chains (MCMC) procedures are available for sampling this posterior distribution. The parameters of this distribution are estimated with a stochastic version of EM algorithm (SAEM) .
Moreover, in collaboration with researchers of INA, this methodology has been used for analyzing Micro-array-CGH experiments which aim at detecting and mapping chromosomal imbalances, by hybridizing targets of genomic DNA from a test and a reference sample .
On an other hand, a preliminary effort with the training course of Romain Goutaland has been done to transform the DCPC (Detection of Changes using Penalized Contrasts) procedure in a general software for multiple change points detection. Different contrast functions has been considered in the model of software coded in C++ by Romain Goutaland.
In collaboration with Henri Bertholon (CNAM, Paris), Nicolas Bousquet and Gilles Celeux have proposed a simple competing risk distribution as a possible alternative to the Weibull distribution in lifetime analysis. This distribution is the minimum between exponential and Weibull distributions. The motivation was to take account of both accidental and aging failures in lifetime data analysis. The estimation of the parameters of this distribution are considered through maximum likelihood and Bayesian inference. Decision tests to choose between an exponential, Weibull and this competing risk distribution have been proposed , .
In the framework of a convention with EDF, they have investigated the ability of adaptive importance sampling schemes to lead to efficient estimators of posterior reliability distributions from highly censored data. And, they have proposed several strategies for eliciting expert opinions to deal with informative Bayesian inference in a proper way for competing risk models involving Weibull distributions.
In the framework of a convention with Altis and in collaboration of Patrick Pamphile (Orsay), Marc Lavarde and Pascal Massart have adapted and applied the penalized model selection criterion of Birgé-Massart (cf. ) for an accelerated lifetime test problem.
Christine Kéribin has developed The PMCov package which is dedicated to estimate the branch lengths and topological parameters of a covarion model, when the topology is fixed. Attention has been particularly taken in testing the validity of the program. A statistical test using simulations will be soon proposed in order to test a non covarion against a covarion model.
select has a convention with EDF regarding durability of nuclear components and aging mastership.
select has a convention with Altis (Cifre grant) regarding accelerated lifetime tests in the production process of chips.
The thesis of Christine Tuleau is supported by Renault and the thesis of Marie Sauvé is supported by Rhodia.
select is animating a working group on model selection and statistical analysis of genomics data with the biometrics group of Institut Agronomique Nationale (INAPG).
This ACI started in September 2004. Partners of ACI MIST-R are CNRS (section 01), INRIA (select and TAO teams), Paris-Sud University (LRI and mathematical department), University Paul Sabatier of Toulouse (laboratory of statistics et probability) and INRETS (laboratory GRETIA). The coordinator is Jean-Michel Loubes.
MIST-R is concerned with cars traffic prevision. The statistical methods select planned for this purpose are curves classification, mixture analysis of semi parametric distributions, distance tables analysis and variables selection procedures.
This ACI started in September 2003. Partners of ACI DataHighDim are laboratory CLIPS of UJF and laboratory LIS, INPG in Grenoble, select team of INRIA, laboratory DICE, UCL in Louvain la Neuve and laboratory LDG, CEA Bruyères le Châtel. DataHighDim is concerned with exploratory and decisional analysis in high dimensions. This year three meetings of the group has been organized. The first meeting in Grenoble has been entirely devoted to the presentation by Gilles Celeux of probabilistic models for the classification of distances tables and the Bayesian estimation of those models.
Gilles Celeux and Pascal Massart are participants of the PASCAL (Pattern Analysis, Statistical Learning and Computational Learning) network.
Pascal Massart is associated editor of Annales de l'IHP, Journal of the European Mathematical Society, Journal de la SFDS and ESAIM Proceedings.
Gilles Celeux has been plenary invited speaker of IFCS2004 in Chicago. Pascal Massart has been plenary invited speaker of the meeting Mathematical foundations of statistical learning in Barcelona and invited to the 4thECM meeting in Stockholm. He gave a series of five conferences on model selection in Hilversum (The Netherlands).
Guillaume Bouchard received Gold of best Ph. D. Student thesis for his work in .
Pascal Massart is responsible of the M2 ``Modélisation stochastique et statistique'' of Orsay. All the select members are teaching in various courses of different universities.