EN FR
EN FR


Section: New Results

Statistical analysis of genomic data

Participants : Vincent Brault, Gilles Celeux, Mélina Gallopin, Christine Keribin, Yann Vasseur.

In collaboration with Florence Jaffrezic and Andrea Rau (INRA, animal genetic department), Mélina Gallopin is a third year PhD student under the supervision of Gilles Celeux. This thesis is concerned with the modelization and model selection in the analysis of RNA-seq data. This year, they proposed a model selection criterion for model-based clustering of annotated gene expression data. This criterion is a ICL-like criterion taking into account the annotations. They are also working on a objective comparison of discrete and continuous modelling after a transformations for RNA-seq data based on a comparison of the likelihoods (eventually penalized) of the models in competition.

The subject of Yann Vasseur PhD Thesis, supervised by Gilles Celeux and Marie-Laure Martin-Magniette (INRA URGV), is the inference of a regulatory network on Transcriptions Factors (TFs), which are specific genes, of Arabidopsis thaliana. In that purpose, a transciptome dataset with a sensibly equal size of TFS and statistical units is available. The first aim consists of reducing the dimension of the network to avoid high dimension difficulties. Representing this network with a Gaussian Graphical Model, the following procedure has been defined:

  1. Selection step: choosing the set of TFs regulators (supports) of each TF.

  2. Classification step: deducing co-factors groups (TFs with similary expression levels) from these supports.

Thus, the reduced network would be built on the co-factors groups. Currently, several selection methods based on Gauss-LASSO and resampling procedures have been applied on the dataset. The study of the stability and the parameters calibration of these methods are in progress. The TFs are clustered with the Latent Block Model in a number of co-factors groups selected with the BIC or the exact ICL criterion.

In collaboration with Marie-Laure Martin-Magniette, Cathy Maugis and Andrea Rau, Gilles Celeux studied gene expression gotten from high-throughput sequencing technology. They focus on the question of clustering digital gene expression profiles as a means to discover groups of co-expressed genes. They propose a Poisson mixture model using a rigorous framework for parameter estimation as well as the choice of the appropriate number of clusters. They illustrate co-expression analyses using this approach on two real RNA-seq datasets. A set of simulation studies also compares the performance of the proposed model with that of several related approaches developed to cluster RNA-seq or serial analysis of gene expression data. The proposed method is implemented in the open-source R package HTSCluster , available on CRAN.