modalis a team focused on statistical methodology for data analysis (clustering, visualization) and learning (classification, density estimation). In this context, the core of the team's work is to design meaningful generative models for prominent complex data (heterogeneous structured data), which are still almost ignored in the literature. Application domains are numerous (credit scoring, marketing,...), but modalfavours applications related to biology and medicine (see Section ). Members of the team are already experienced in these directions with complementary skills.

The team scientific objectives are split into two main methodological directions: Generative model design (see Section ) and data visualization through such models (see Section ). In each case, several means of dissemination are considered towards academic and/or industrial communities: Publications in international journals (in statistics or biostatistics), workshops to raise or identify ermerging topics, and publicly available specific softwares relying on the proposed new methodologies.

Since November 2011, the team started the development of the co-clustering module in the MIXMOD software, allowing to process efficient and parsimonious generative models on huge data sets (see Section ).

The first objective of modalconsists in designing, analyzing, estimating and evaluating new generative parametric models for multivariate and/or heterogeneous data. It corresponds typically to continuous and categorical data but it includes also other widespread ones like ordinal, functional, ranks,... Designed models have to take into account potential correlations between variables while being (1) justifiable and realistic, (2) meaningful and parsimoniously parameterized, (3) of low computational complexity. The main purpose is to identify a few theoretical and general principles for model generation, loosely dependent on the variable nature. In this context, we propose two concurrent approaches which could be general enough for dealing with correlation between many types of homogeneous or heterogeneous variables:

Designs general models by combining two extreme models (full dependent and full independent) which are well-defined for most of variables;

Uses kernels as a general way for dealing with multivariate and heterogeneous variables.

The second objective of modalis to propose meaningful and quite accurate low dimensional visualizations of data typically in two-dimensional (2D) space, less frequently in one-dimensional (1D) or three-dimensional (3D) spaces, by using the generative models designed in the first objective. We propose also to visualize simultaneously the data and the model. All visualizations will depend on the aim at hand (typically clustering, classification or density estimation). The main originality of this objective lies in the use of models for visualization, strategy from which we expect to have a better control on the subjectivity necessarily induced by any graphical display. In addition, the proposed approach has to be general enough to be independent on the variable nature. Note that the visualization objective is consistent with the dissemination of our methodologies through specific softwares. Indeed, displaying data is an important step in the data analysis process.

Potential application areas of statistical modeling for heterogeneous data are extensive but some particular areas are identified. For historical reasons and considering the background of
the team members,
modalis mainly focused on
*biological applications*where new challenges in high throughput technologies are opened. In addition, other secondary application areas are considered as
*industry*,
*retail*,
*credit scoring*and
*astronomy*.

Several contacts and collaborations are already established with some partners in these application areas and are described in Sections and

MIXMOD (MIXture MODelling) is the core software of the modalteam for two reasons. First, MIXMOD concerns main topics of modalsince it is devoted to model-based supervised, unsupervised and semi-supervised classification for various data situations. Second, MIXMOD is now a well-distributed software since over 250 downloads/month are recorded for several years. Consequently, MIXMOD will be the main software for diffusing future methodological advances of the modalteam.

MIXMOD is written in C++ (more than 10 000 lines), currently interfaced with Scilab and Matlab and distributed under GNU General Public License. An interface between MIXMOD and R is being to be developed by Rémi Lebret and will be soon available (during 2012).

Several other institutions partiticipate in the MIXMOD development since several years: CNRS, INRIA Saclay-Île de France, Université de Franche-Comté, Université Lille 1. The software
already benefits from several APP depositions and leads also to some international publications
*Model-Based Cluster and Discriminant Analysis with the*
mixmod
*Software*, Computational Statistics and Data Analysis, Vol. 52, no 2, 587–600, 2006

In addition, an INRIA ADT grant (Parmeet Bhatia) will also develop co-clustering models for continuous, binary and discrete data. It is a strategic development for MIXMOD since offering the
ability to structure very large data tables both in lines and columns for different data types. In particular, it opens wide potential applications in biology, marketing,
*etc.*

Serge Iovleff is the main supervisor of software engineers who are recruited for all the previously described tasks. More information about MIXMOD can be easily found on its web page
http://

The AAM program is a R library implementing Auto-Associative models. Thus it could with few work transformed into a R package. As the AAM is a statistical model, the R language was well-suited for a diffusion inside the scientific community. It is a prototype for testing the AAM models against other kind of non-linear PCA models.

The first release was a scilab program written by Serge Iovleff and Stéphane Girard. It was rewriten in January 2009 and the code is now faster and produces enhanced graphics. The 2009 release is the result of a conjoint work of Serge Iovleff and a M1 internship of the ENS.

More information on the web site
http://

Computation of the local FDR: R package for biostatisticians allowing to estimate FDR and local FDR by kernel density estimation. This package allows also to deal with truncated data and to
take into account supervision. More information on the website
http://

metaMA is a specialised software for microarrays. It is a R package which combines either p-values or modified effect sizes from different studies to find differentially expressed genes. The main competitor of metaMA is geneMeta. Compared to geneMeta, metaMA offers an improvement for small sample size datasets since the corresponding modelling is based on shrinkage approaches.

Guillemette Marot is the main contributor and the maintainer of this packages and spent around one year full time for this package between the conception, the implementation, and the documentation. Her PhD advisors (Florence Jaffrézic, Claus-Dieter Mayer, Jean-Louis Foulley) helped her with the conception but she implemented alone the code.

First versions have been posted to the CRAN, the official website of the R software, in 2009. New versions for this package were released in August 2011 in order to take into account remarks
from the main users (biologists or biostatisticians analysing gene expression data). This software is routinely used by biologists from INRA, Jouy en Josas (it has been included in a local
analysis pipeline) but its diffusion on the CRAN makes it available to a wider community, as attested by the publications citing the software
*Moderated effect size and p-value combinations for microarray meta-analyses*, in "Bioinformatics", 2009, vol. 25, no 20, p. 2692–-2699.

More information is available on the website
http://

STK++ is a multi-platform toolkit written in C++ for creating fast and easy to use data mining programs. It offers a large set of templated class in C++ which are suitable for projects ranging from small one-off projects to complete statistical application suites. A C equivalent would be gsl. However, STK++ is developed in C++ in order to get speed and reusability.

As the aim of STK++ is to aid developers to new developments, it proposes essentially interfaces classes and various concrete helping classes, like arrays, numerical methods (QR, SVD), input and output (csv files), random number generators... For instance, some part of the project will be integrated to the co-cluster project (in the MIXMOD software, see Section ) actually developed by Parmeet Bathia.

The software is regularly developed since 10 years by Serge Iovleff and it is a work in progress. More information is available on the website
http://

SMVar is a specialised software for microarrays. This R package implements the structural model for variances in order to detect differentially expressed genes from gene expression data. It performs gene expression differential analysis, based on a particular variance modelling. Its main competitor is the Bioconductor R package limma but limma assumes a common variance between the two groups to be compared while SMVar relaxes this assumption.

Guillemette Marot is the main contributor and the maintainer of this packages and spent around one year full time for this package between the conception, the implementation, and the documentation. Her PhD advisors (Florence Jaffrézic, Claus-Dieter Mayer, Jean-Louis Foulley) helped her with the conception but she implemented alone the code. She received some help from Anne de la Foye (INRA, Clermont-Ferrand) to correct the bugs in the first versions.

First versions have been posted to the CRAN, the official website of the R software, in 2009. New versions for this package were released in August 2011 in order to take into account remarks
from the main users (biologists or biostatisticians analysing gene expression data). This software is routinely used by biologists from INRA, Jouy en Josas (it has been included in a local
analysis pipeline) but its diffusion on the CRAN makes it available to a wider community, as attested by the publications citing the software

More information on the website
http://

Defining generative models for dealing with possibly correlated categorical variables is at the core of the
modalactivity. We start by noticing that it is straightforward to build a full independent distribution

Our idea is to combine both extreme distributions
*(i)*which is an intermediate dependent situation between full independence and full dependence and
*(ii)*which is not degenerate. As a consequence,

In addition, since both

A PhD thesis started on October'11 on this topic in continuation of the Master's thesis of Matthieu Marbac-Lourdelle .

In many situations one needs to cluster several datasets, possibly arising from different populations, instead of a single one, into partitions with identical meaning and described by similar features. Such situations involve commonly two kinds of standard clustering processes. The samples are clustered traditionally either as if all units arose from the same distribution, or on the contrary as if the samples came from distinct and unrelated populations. But a third situation should be considered: As the datasets share statistical units of same nature and as they are described by features of same meaning, there may exist some link between the samples.

We propose a linear stochastic link between the samples, what can be justified from some simple but realistic assumptions, both in the Gaussian and in the

A chapter of book about transfer learning (including clustering, classification and regression) is currently submitted for publication (joint work with Julien Jacques and Alexandre Lourme).

Genome Wide Association (GWA) studies have proved the implication of numerous single nucleotides polymorphisms (SNP) in the etiology of common diseases. Nevertheless, only a small part of the expected heritability of those diseases is explained by the most significantly associated SNPs. Many researches that have been lately investigating this missing heritability have considered interactions between genes and/or environmental factors as a plausible and promising explanation. Considering all if not a large number (hundreds of thousands) of variants altogether stresses the problem of the high dimensionality that most regression-based methods cannot afford. To solve this issue one either reduces the number of variants to be analyzed (shrinkage approaches) or groups them according to a certain similarity. We introduce here a regression model that simultaneously clusterizes the variants sharing close effect size while selecting the most informative clusters. The estimation of the model parameters is proposed by maximizing the likelihood. The challenges of this research rely on finding efficient algorithms for the clustering part while studying the consistency of our estimators for which the classical asymptotic theory does not apply , .

During the last fifteen years there has been an increasing interest for using Bayesian methods in mixtures models. However, one of the principal issues of these methods is the
non-identifiability of components caused by symmetric prior (whatever be the kind of variables), which makes the Gibbs outputs useless for inference; this problem is known as label switching.
We propose to condition the posterior distribution by a particular numbering, not on the parameter as it is usually done, but rather on a latent partition, for which the posterior distributions
are not any more strictly invariant up to a renumbering of the partition
,
,
. The importance of this asymmetry depends on the choice of partition
space cutting. The challenge we address is to choose a particular cutting which is justified and also easy to compute. The idea is to use some properties of the (unavailable)
*completed*posterior distribution.

In the case of Gaussian mixtures, unbounded likelihood is an important theoretical and practical problem. Using the weak information that the latent sample size of each component has to be greater than the space dimension, we derive a simple non-asymptotic stochastic lower bound on variances. We prove also that maximizing the likelihood under this data-driven constraint leads to consistent estimates. Currently, such results are proved in the univariate case . The challenge is now not only to extend them in the multivariate situation but also to complete these theoretical results with some practical strategies for properly avoiding degeneracy in softwares devoted to such mixture estimations.

This is a joined work with Gwënaelle Castellan.

Curve clustering in the presence of inter-individual variability has longly been studied, especially using splines to account for functional random effects. However splines are not appropriate when dealing with high-dimensional data and can not be used to model irregular curves such as peak-like data. We propose a wavelet based clustering procedure and apply it to high dimensional data. We suggest a dimension reduction step based on wavelet thresholding adapted to multiple curves and using an appropriate structure for the random effect variance, we ensure that both fixed and random effects lie in the same functional space even when dealing with irregular functions that belong to Besov spaces. In the wavelet domain our model resumes to a linear mixed-effects model that can be used for a model-based clustering algorithm and for which we develop an EM algorithm for maximum likelihood estimation. An R package curvclust implementing this procedure is under building and should be posted to the CRAN, the official website of the R software, before Dec. 2011. An article has been submitted once to Biometrics and received good reports. This paper should also be submitted again to Biometrics once curvclust is on the CRAN.

The continuing technical improvements and decreasing cost of next-generation sequencing technologies have made RNA sequencing (RNA-seq) a popular choice for gene expression studies in recent years. Because the data collected from such studies differ considerably from those measured using microarray technology, the statistical tools used for analysis must be adapted accordingly. In particular, several methods for the normalization of RNAseq data (removal of errors due to the small number of samples, corrections for sequence composition) have been proposed in recent years. With the Statomique Consortium, we have compared seven normalisation methods. First results are given in .

Scan statistics are widely used to detect peaks in tiling array experiments. An extensive analysis study of real biological data is being performed with Florent Sebbane and David Hot teams (Institut Pasteur, Lille) for the study of the Yersinia Pestis bacteria in order to find new small RNAs. First results have been presented in . Given a signal composed of intensities ordered along the genome, the statistical problem is to detect peaks, taking into account the irregular design of the chips, which the biologists had chosen a few years ago. A master student (D. Thuillier) has compared different normalisation methods during a 6 months internship and improved the first analysis results presented in . We also propose a local score procedure, which seems promising according to first biological results obtained. The next step is to work with Alain Célisse in order to choose a generative model on the normalised data which would enable to give appropriate initial values to the local score procedure and associate p-values to local scores.

Two procedures for clustering functional data have been developed.

The second procedure, currently submitted, is a model-based clustering procedure, defined on the basis of an approximation of the density of functional random variables . As previously, the EM algorithm is used for parameter estimation and the maximum a posteriori rule provides the clusters. Simulation study and real data application illustrate the interest of this methodology.

The aim is to study consistency of variational and maximum likelihood estimates built from a particular generative model of random graph where independence between the ridges of the graph is
not assumed. These results are established from concentration inequalities. They have a great practical interest since they justify
*a posteriori*intensive use of variational methods in this context.

It is a joint collaboration with Jean-Jacques Daudin and a paper is submitted .

This aim is to study the

It is a joined work with Tristan Mary-Huard .

“Data analysis from high throughput technologies: Synergy between statistics and combinatorial optimization.”

With the development of new technologies such as high-throughput genotyping and sequencing, data analysis needs to be improved. Genes Diffusion is specialized in animals studies, for which we can read genomics information on around 800 000 markers and we have more and more subjects. The aim of the PhD is to find new methods combining combinatorial optimization and statistics methods in order to characterize the best subjects according to quantitative criteria. A PhD CIFRE grant started on 2010 and it is a joined work with Clarisse Dhaenens ( dolphin).

“Statistical modeling and simulation for card payment at medium distance.”

As part of the "Payment of a hand gesture", the Natural Security company uses a technology at medium distance based on biometrics to authenticate owner and to allow transaction with no card payment manipulation while limiting the risk fraud. Depending to the context of use (frequency of transactions) a theoretical expertise is needed to assess the viability of the system in term of probability of collision or wrong authentication. This collaboration has led to two contracts in 2011, 6 k€ each and about two weeks long each.

“Supervised and semi-supervised classification on large data bases mixing qualitative and quantitative variables.”

Arcelor-Mittal is faced with some quality problems in the steel production which lead to supervised and semi-supervised classification involving (1) a small number of individuals comparing to the number of variables, (2) heterogeneous variables, typically categorical and continuous variables and (3) potentially highly correlated variables. A PhD CIFRE grant started on May 2011.

“Incidence of lymphoma in Nord-Pas-de-Calais, Annual Estimates and study of the evolution over the period 2001-2005.”

It is a contract with ASEL (Association Septentrionale pour l'Etude de Lymphomes) and CRESGE (Centre de Recherches Economiques Sociologiques et de Gestion) from Lille. This project of 6 k€ starts on December 1st 2011 and ends on September 2st 2012.

Christophe Biernacki and Julien Jacques:

Institut de Biologie de Lille, laboratory Génomique et Maladies Métaboliques, L. Yengo

Christophe Biernacki:

Industrial studies, Arcelor-Mittal, C. Théry

Julien Jacques:

Genes Diffusion, J. Hamon

Guillemette Marot:

Institut Pasteur Lille, Équipe Etudes Transcriptomiques et Génomiques Appliquées, D. Hot,

Institut Pasteur Lille, Équipe Peste et Yersinia pestis, F. Sebbane

Institut de Biologie de Lille, Unité d'approches fonctionnelle et structurale des cancers, O. Pluquet

Université Lille 2, Plate-forme de génomique fonctionnelle et Structurale, M. Figeac

CHRU Lille, Centre de Biologie Pathologie, Laboratoire d'Hématologie, C. Preudhomme

Cristian Preda:

ASEL (Association Septentrionale pour l'Etude de Lymphomes) and CRESGE (Centre de Recherches Economiques Sociologiques et de Gestion) from Lille

Alain Célisse co-organized a workshop on Random Graphs in Lille on April '11
http://

Guillemette Marot belongs to the StatOmique working group
http://

Partner 1: University of Granada, Department of Statistics (Spain)

Collaboration with Professor Ana Aguilera in the field of Functional Data Analysis. Form of collaboration: joint paper
*Using basis expansion for estimating functional PLS regression: application with chemometric data*, in "Chemometrics and Intelligent Laboratory Systems", Vol. 104, no 2, p. 289–305,
2010

Partner 2: Luxembourg School of Finance (Luxembourg)

Collaboration with Professor Jang Schiltz for time-series prediction using functional data. Form of collaboration : joint paper , mobility projects research.

C. Biernacki belongs to the scientific committee of “Model mixtures and learning” in SFdS'11 (Gammarth, Tunisia) and to the program comity of "Extraction et gestion des connaissances" in
EGC'12 (Bordeaux, France). Since '10, he is an Associate Editor of the journal “Case Studies in Business, Industry and Government Statistics” (CSBIGS)
http://

C. Biernacki and V. Vandewalle are invited speakers to one conference

C. Preda is an invited speaker in '11 to three conferences , ,

Since '09, C. Biernacki is a treasurer of the data mining and learning group of the French statistical association (SFdS)
http://

Cristian Preda:

organized a session of applied statistics for the Statistics and Probability Society of Romania (Bucarest, April 2011)

was Scientific Supervisor for the statistical methodology developed in the PSIP FP7 European project
http://

performed research conferences and teaching on statistics at the University of Granada, University of Luxembourg and University of Bucharest

organized several conferences for the Seminar of Statistics and Informatics of the Faculty of Medicine, University Lille 2.

Guillemette Marot organizes, in the context of the PPF bioinfo Lille 1, two scientific meetings:

Fouille de texte pour la biologie, Sept. 2011,
http://

Analyse bioinformatique des données NGS, Dec. 2011,
http://

Christophe Biernacki (head of the M2 Ingénierie Statistique et Numérique
http://

Master: Mathematical statistics, 60h, coaching project, 10h, M1, U. Lille 1, France

Master: Data analysis, 97.5h, Analysis of variance and experimental design, 22.5h, coaching internship, 20h, M2, U. Lille 2, France

Alain Célisse:

Master: Statistique Fondamentale, 45h, M2, U. Lille 1, France

DUT: Mathématiques pour l'Informatique, 122h, L1, U. Lille 1, France

DUT: Algèbre, 80h, L2, U. Lille 1, France

Serge Iovleff:

DUT: Discrete mathematics, 72h, Modelization, 88h, Algebra & Geometry, 32h, Probability & statistics & analysis 64h, L1, U. Lille 1, France

Julien Jacques:

Licence: Statistique Inférentielle, 50h, L3, École Polytechnique Universitaire de Lille, U. Lille 1, France

Master: Modélisation Statistique, 30h, M1, École Polytechnique Universitaire de Lille, U. Lille 1, France

Master: Séries Temporelles, 25h, M2, École Polytechnique Universitaire de Lille, U. Lille 1, France

Guillemette Marot:

Licence: Biostatistique, 18h, L1, U. Lille 2, France

Master: Biostatistique, 48h, M1, U. Lille 2, France

Cristian Preda:

Licence: Probabilités, 36h, L3, École Polytechnique Universitaire de Lille, U. Lille 1, France

Master: Statistique Exploratoire, 40h, M1, École Polytechnique Universitaire de Lille, U. Lille 1, France

Master: Functional Data Analysis, 18h, M2, U. Lille 1, France

Master: Functional Data Analysis, 10h, M2, Department of Statistics, University of Granada, Spain

Vincent Vandewalle:

DUT STID: Linear algebra, 93h, Simulation Technics, 31.5h, Descriptives statistics, 36h, Basic mathematics, 12h, Probabilities, 108h, L1 , U. Lille 2, France

DUT STID: Analysis, 20h, L2, U. Lille 2, France

PhD: Alexandre Lourme, Contribution à la Classiﬁcation par Modèles de Mélange & Classification Simultanée d’Echantillons d’Origines Multiples, U. Lille 1, June'11, Christophe Biernacki supervisor

PhD in progress : Alexandru Amarioarei, Statistics, Scan statistics and applications, started in 2010, Cristian Preda supervisor

PhD in progress : Michael Genin, Statistics, Scan statitics and epidemiology, started in 2010, Cristian Preda and Alain Duhamel (CEREM, U. Lille 2) supervisors

PhD in progress : Julie Hamon, Analysis of data from high throughput genotyping: cooperation between statistics and combinatorial optimization, started in 2010, Julien Jacques and Clarisse Dhaenens ( dolphinINRIA Lille team-project) supervisor

PhD in progress : Loïc Yengo, Simultaneous Variables Clustering and Selection in Regression Models, started in 2010, Christophe Biernacki and Julien Jacques supervisors

PhD in progress : Clément Thery, Classification supervisée ou semi-supervisée des bases de grande dimension, avec variables qualitatives et quantitatives, started in 2011, Christophe Biernacki supervisor

PhD in progress : Matthieu Marbac-Lourdelle, Generatives models taking into account the correlation between variables , started in 2011, Christophe Biernacki and Vincent Vandewalle supervisors