The mistis team aims to develop statistical methods for dealing with complex problems or data. Our applications consist mainly of image processing and spatial data problems with some applications in biology and medicine. Our approach is based on the statement that complexity can be handled by working up from simple local assumptions in a coherent way, defining a structured model, and that is the key to modelling, computation, inference and interpretation. The methods we focus on involve mixture models, Markov models, and, more generally, hidden structure models identified by stochastic algorithms on one hand, and semi and non-parametric methods on the other hand.

Hidden structure models are useful for taking into account heterogeneity in data. They concern many areas of statistical methodology (finite mixture analysis, hidden Markov models, random effect models, etc). Due to their missing data structure, they induce specific difficulties for both estimating the model parameters and assessing performance. The team focuses on research regarding both aspects. We design specific algorithms for estimating the parameters of missing structure models and we propose and study specific criteria for choosing the most relevant missing structure models in several contexts.

Semi- and non-parametric methods are relevant and useful when no appropriate parametric model exists for the data under study either because of data complexity, or because information is missing. The focus is on functions describing curves or surfaces or more generally manifolds rather than real valued parameters. This can be interesting in image processing for instance where it can be difficult to introduce parametric models that are general enough (e.g. for contours).

mixture of distributions, EM algorithm, missing data, conditional independence, statistical pattern recognition, clustering, unsupervised and partially supervised learning

In a first approach, we consider statistical parametric models,

These models are interesting in that they may point out hidden
variable responsible for most of the observed variability and so
that the observed variables are *conditionally* independent.
Their estimation is often difficult due to the missing data. The
Expectation-Maximization (EM) algorithm is a general and now
standard approach to maximization of the likelihood in missing
data problems. It provides parameter estimation but also values
for missing data.

Mixture models correspond to independent

graphical models, Markov properties, conditional independence, hidden Markov trees, clustering, statistical learning, missing data, mixture of distributions, EM algorithm, stochastic algorithms, selection and combination of models, statistical pattern recognition, image analysis, hidden Markov field, Bayesian inference

Graphical modelling provides a diagrammatic representation of the logical structure of a joint probability distribution, in the form of a network or graph depicting the local relations among variables. The graph can have directed or undirected links or edges between the nodes, which represent the individual variables. Associated with the graph are various Markov properties that specify how the graph encodes conditional independence assumptions.

It is the conditional independence assumptions that give graphical models their fundamental modular structure, enabling computation of globally interesting quantities from local specifications. In this way graphical models form an essential basis for our methodologies based on structures.

The graphs can be either
directed, e.g. Bayesian Networks, or undirected, e.g. Markov Random Fields.
The specificity of Markovian models is that the dependencies
between the nodes are limited to the nearest neighbor nodes. The
neighborhood definition can vary and be adapted to the problem of
interest. When parts of the variables (nodes) are not observed or missing,
we
refer to these models as Hidden Markov Models (HMM).
Hidden Markov chains or hidden Markov fields correspond to cases where the

Hidden Markov models are very useful in modelling spatial dependencies but these dependencies and the possible existence of hidden variables are also responsible for a typically large amount of computation. It follows that the statistical analysis may not be straightforward. Typical issues are related to the neighborhood structure to be chosen when not dictated by the context and the possible high dimensionality of the observations. This also requires a good understanding of the role of each parameter and methods to tune them depending on the goal in mind. Regarding estimation algorithms, they correspond to an energy minimization problem which is NP-hard and usually performed through approximation. We focus on a certain type of methods based on the mean field principle and propose effective algorithms which show good performance in practice and for which we also study theoretical properties. We also propose some tools for model selection. Eventually we investigate ways to extend the standard Hidden Markov Field model to increase its modelling power.

dimension reduction, extreme value analysis, functional estimation.

We also consider methods which do not assume a parametric model.
The approaches are non-parametric in the sense that they do not
require the assumption of a prior model on the unknown quantities.
This property is important since, for image applications for
instance, it is very difficult to introduce sufficiently general
parametric models because of the wide variety of image contents.
Projection methods are then a way to decompose the unknown
quantity on a set of functions (*e.g.* wavelets). Kernel
methods which rely on smoothing the data using a set of kernels
(usually probability distributions) are other examples.
Relationships exist between these methods and learning techniques
using Support Vector Machine (SVM) as this appears in the context
of *level-sets estimation* (see section ). Such
non-parametric methods have become the cornerstone when dealing
with functional data . This is the case, for
instance, when observations are curves. They enable us to model the
data without a discretization step. More generally, these
techniques are of great use for *dimension reduction* purposes
(section ). They enable reduction of the dimension of the
functional or multivariate data without assumptions on the
observations distribution. Semi-parametric methods refer to
methods that include both parametric and non-parametric aspects.
Examples include the Sliced Inverse Regression (SIR) method
which combines non-parametric regression techniques
with parametric dimension reduction aspects. This is also the case
in *extreme value analysis* , which is based
on the modelling of distribution tails (see section ).
It differs from traditional statistics which focuses on the central
part of distributions, *i.e.* on the most probable events.
Extreme value theory shows that distribution tails can be
modelled by both a functional part and a real parameter, the
extreme value index.

Extreme value theory is a branch of statistics dealing with the extreme
deviations from the bulk of probability distributions.
More specifically, it focuses on the limiting distributions for the
minimum or the maximum of a large collection of random observations
from the same arbitrary distribution.
Let *i.e.*

To estimate such quantiles therefore requires dedicated
methods to
extrapolate information beyond the observed values of

where both the extreme-value index *i.e.* such that

for all

More generally, the problems that we address are part of the risk management theory. For instance, in reliability, the distributions of interest are included in a semi-parametric family whose tails are decreasing exponentially fast. These so-called Weibull-tail distributions are defined by their survival distribution function:

Gaussian, gamma, exponential and Weibull distributions, among others,
are included in this family. An important part of our work consists
in establishing links between models () and ()
in order to propose new estimation methods.
We also consider the case where the observations were recorded with a covariate information. In this case, the extreme-value index and the

Level sets estimation is a
recurrent problem in statistics which is linked to outlier
detection. In biology, one is interested in estimating reference
curves, that is to say curves which bound

Our work on high dimensional data requires that we face the curse of dimensionality phenomenon. Indeed, the modelling of high dimensional data requires complex models and thus the estimation of high number of parameters compared to the sample size. In this framework, dimension reduction methods aim at replacing the original variables by a small number of linear combinations with as small as a possible loss of information. Principal Component Analysis (PCA) is the most widely used method to reduce dimension in data. However, standard linear PCA can be quite inefficient on image data where even simple image distorsions can lead to highly non-linear data. Two directions are investigated. First, non-linear PCAs can be proposed, leading to semi-parametric dimension reduction methods . Another field of investigation is to take into account the application goal in the dimension reduction step. One of our approaches is therefore to develop new Gaussian models of high dimensional data for parametric inference . Such models can then be used in a Mixtures or Markov framework for classification purposes. Another approach consists in combining dimension reduction, regularization techniques, and regression techniques to improve the Sliced Inverse Regression method .

As regards applications, several areas of image analysis can be covered using the tools developed in the team. More specifically, in collaboration with team Perception, we address various issues in computer vision involving Bayesian modelling and probabilistic clustering techniques. Other applications in medical imaging are natural. We work more specifically on MRI data, in collaboration with the Grenoble Institute of Neuroscience (GIN) and LNAO from the NeuroSpin center of CEA Saclay. We also consider other statistical 2D fields coming from other domains such as remote sensing, in collaboration with Laboratoire de Planétologie de Grenoble. In the context of the ANR MDCO project Vahine, we work on hyperspectral multi-angle images. In the context of the "pole de competivite" project I-VP, we work of images of PC Boards.

A second domain of applications concerns biology and medecine. We consider the use of missing data models in epidemiology. We also investigated statistical tools for the analysis of bacterial genomes beyond gene detection. Applications in population genetics and neurosiences are also considered. Finally, in the context of the ANR VMC project Medup, we study the uncertainties on the forecasting and climate projection for Mediterranean high-impact weather events.

Reliability and industrial lifetime analysis are applications developed through collaborations with the EDF research department and the LCFR laboratory (Laboratoire de Conduite et Fiabilité des Réacteurs) of CEA / Cadarache. We also consider failure detection in print infrastructure through collaboration with Xerox, Meylan.

**Joint work with:** Radu Horaud and Manuel Iguel.

The ECMPR (Expectation Conditional Maximization for Point Registration) package implements . It registers two (2D or 3D) point clouds using an algorithm based on maximum likelihood with hidden variables. The method can register both rigid and articulated shapes. It estimates both the rigid or the kinematic transformation between the two shapes as well as the parameters (covariances) associated with the underlying Gaussian mixture model. It has been registered in APP in 2010 under the GPL license.

**Joint work with:** Michel Dojat.

From brain MR images, neuroradiologists are able to delineate tissues such as grey matter and structures such as Thalamus and damaged regions. This delineation is a common task for an expert but unsupervised segmentation is difficult due to a number of artefacts. The LOCUS software and its recent extension P-LOCUS automatically perform this segmentation for healthy and pathological brains An image is divided into cubes on each of which a statistical model is applied. This provides a number of local treatments that are then integrated to ensure consistency at a global level, resulting in low sensitivity to artifacts. The statistical model is based on a Markovian approach that enables to capture the relations between tissues and structures, to integrate a priori anatomical knowledge and to handle local estimations and spatial correlations.

The LOCUS software has been developed in the context of a collaboration between Mistis, a computer science team (Magma, LIG) and a Neuroscience methodological team (the Neuroimaging team from Grenoble Institut of Neurosciences, INSERM). This collaboration resulted over the period 2006-2008 into the PhD thesis of B. Scherrer (advised by C. Garbay and M. Dojat) and in a number of publications. In particular, B. Scherrer received a "Young Investigator Award" at the 2008 MICCAI conference. Its extension (P-LOCUS) for lesion detection is realized by S. Doyle with financial support from Gravit for possible industrial transfer.

The originality of this work comes from the successful combination of the teams respective strengths i.e. expertise in distributed computing, in neuroimaging data processing and in statistical methods.

**Joint work with:** Vasil Khalidov, Radu Horaud, Miles Hansard,
Ramya Narasimha, Elise Arnaud.

POPEYE contains software modules and libraries jointly developed by three partners within the POP STREP project: Inria, University of Sheffield, and University of Coimbra. It includes kinematic and dynamic control of the robot head, stereo calibration, camera-microphone calibration, auditory and image processing, stereo matching, binaural localization, audio-visual speaker localization. Currently, this software package is not distributed outside POP.

**Joint work with:** Charles Bouveyron (Université Paris 1). The High-Dimensional
Discriminant Analysis (HDDA) and the High-Dimensional Data
Clustering (HDDC) toolboxes contain respectively efficient
supervised and unsupervised classifiers for high-dimensional data.
These classifiers are based on Gaussian models adapted for
high-dimensional data . The
HDDA and HDDC toolboxes are available for Matlab and are included
into the software MixMod .
Recently, a R package has been developed and integrated in The
Comprehensive R Archive Network (CRAN). It can be downloaded at
the following URL:
http://

**Joint work with:** Diebolt, J. (CNRS), Laurent Gardes (Univ
Strasbourg) and Garrido, M. (INRA Clermont-Ferrand-Theix).

The Extremes software is a toolbox dedicated to the
modelling of extremal events offering extreme quantile estimation
procedures and model selection methods. This software results from
a collaboration with EDF R&D. It is also a consequence of the
PhD thesis work of Myriam Garrido .
The software is written in C++
with a Matlab graphical interface. It is now available both on
Windows and Linux environments. It can be downloaded at the
following URL:
http://

SpaCEM

This software, developed by present and past members of the team, is the result of several research developments on the subject. The current version 2.09 of the software is CeCILLB licensed.

**Main features.** The approach is based on the EM
algorithm for clustering and on Markov Random Fields (MRF) to
account for dependencies. In addition to standard clustering tools
based on independent Gaussian mixture models, SpaCEM

The unsupervised clustering of dependent objects. Their dependencies are encoded via a graph not necessarily regular and data sets are modelled via Markov random fields and mixture models (eg. MRF and Hidden MRF). Available Markov models include extensions of the Potts model with the possibility to define more general interaction models.

The supervised clustering of dependent objects when standard Hidden MRF (HMRF) assumptions do not hold (ie. in the case of non-correlated and non-unimodal noise models). The learning and test steps are based on recently introduced Triplet Markov models.

Selection model criteria (BIC, ICL and their mean-field approximations) that select the "best" HMRF according to the data.

The possibility of producing simulated data from:

general pairwise MRF with singleton and pair potentials (typically Potts models and extensions)

standard HMRF, ie. with independent noise model

general Triplet Markov models with interaction up to order 2

A specific setting to account for high-dimensional observations.

An integrated framework to deal with missing observations, under Missing At Random (MAR) hypothesis, with prior imputation (KNN, mean, etc), online imputation (as a step in the algorithm), or without imputation.

The software is available at http://

**Joint work with:** Francois, O. (TimB, TIMC) and Chen, C. (former Post-doctoral fellow in Mistis).

The FASTRUCT program is dedicated to the modelling and inference of population structure from genetic data. Bayesian model-based clustering programs have gained increased popularity in studies of population structure since the publication of the software STRUCTURE . These programs are generally acknowledged as performing well, but their running-time may be prohibitive. FASTRUCT is a non-Bayesian implementation of the classical model with no-admixture uncorrelated allele frequencies. This new program relies on the Expectation-Maximization principle, and produces assignment rivaling other model-based clustering programs. In addition, it can be several-fold faster than Bayesian implementations. The software consists of a command-line engine, which is suitable for batch-analysis of data, and a MS Windows graphical interface, which is convenient for exploring data.

It is written for Windows OS and contains a detailed user's guide. It is available at
http://

The functionalities are further described in the related publication:

Molecular Ecology Notes 2006 .

**Joint work with:** Francois, O. (TimB, TIMC) and Chen, C. (former post-doctoral fellow in Mistis).

TESS is a computer program that implements a Bayesian clustering algorithm for spatial population genetics. Is it particularly useful for seeking genetic barriers or genetic discontinuities in continuous populations. The method is based on a hierarchical mixture model where the prior distribution on cluster labels is defined as a Hidden Markov Random Field . Given individual geographical locations, the program seeks population structure from multilocus genotypes without assuming predefined populations. TESS takes input data files in a format compatible to existing non-spatial Bayesian algorithms (e.g. STRUCTURE). It returns graphical displays of cluster membership probabilities and geographical cluster assignments through its Graphical User Interface.

The functionalities and the comparison with three other Bayesian Clustering programs are specified in the following publication:

Molecular Ecology Notes 2007

**Joint work with:** Bouveyron, C. (Université Paris 1), Fauvel, M. (ENSAT Toulouse)

In the PhD work of Charles Bouveyron (co-advised by Cordelia Schmid from the Inria LEAR team) , we propose new Gaussian models of high dimensional data for classification purposes. We assume that the data live in several groups located in subspaces of lower dimensions. Two different strategies arise:

the introduction in the model of a dimension reduction constraint for each group

the use of parsimonious models obtained by imposing to different groups to share the same values of some parameters

This modelling yields a new supervised classification method called High Dimensional Discriminant Analysis (HDDA) . Some versions of this method have been tested on the supervised classification of objects in images. This approach has been adapted to the unsupervised classification framework, and the related method is named High Dimensional Data Clustering (HDDC) . Also, the description of the R package is published in . Our recent work consists in adding a kernel in the previous methods to deal with nonlinear data classification , .

Clustering concerns the assignment of each of

A useful representation of the *infinite mixture of scaled Gaussians* or *Gaussian scale
mixture*,

where

For many applications, the distribution of the data may also be
highly asymmetric in addition to being heavy tailed (or affected
by outliers). A natural extension to the Gaussian scale mixture
case is to consider *location ***and*** scale Gaussian
mixtures* of the form,

where *location and scale Gaussian mixtures*.

Although the above approaches provide for great flexibility in
modelling data of highly asymmetric and heavy tailed form the
above approaches assume

In this work, we show that the location and scale mixture
representation can be further explored and propose a framework
that is considerably simpler than those previously proposed with
distributions exhibiting interesting properties. Using the normal
inverse Gaussian distribution (NIG) as an example, we extend the
standard *location and scale mixture of Gaussian
representation* to allow for the tail behaviour to be set or
estimated differently in each dimension of the variable space. The
key elements of the approach are the introduction of
multidimensional weights and a decomposition of the matrix

For a clustering problem, a parametric mixture model is one of the popular approaches. Most of all, Gaussian mixture models are widely used in various fields of study such as data mining, pattern recognition, machine learning, and statistical analysis. The modeling and computational flexibility of the Gaussian mixture model makes it possible to model a rich class of density, and provides a simple mathematical form of cluster models.

Despite the success of Gaussian mixtures, the parameter
estimations can be severely affected by outliers. By adding an
additional degrees of freedom (dof) parameter, a robustness tuning
parameter, the robust improvement in clustering has been achieved.
Although adopting

Along with robustness from

This work proposes an approach that combines robust clustering
with the HDDC. The use of the mixture of multivariate

**Joint work with:** Antoine Deleforge and Radu Horaud from the
Inria Perception team.

We cast dimensionality reduction and regression in a unified
latent variable model. We propose a two-step strategy consisting
of characterizing a non-linear *reversed* output-to-input
regression with a generative piecewise-linear model, followed by
Bayes inversion to obtain an output density given an input. We
describe and analyze the most general case of this model, namely
when only some components of the output variables are observed
while the other components are latent. We provide two EM inference
procedures and their initialization. Using simulated and real
data, we show that the proposed method outperforms several
existing ones.

**Joint work with:** Antoine Deleforge and Radu Horaud from the
Inria Perception team.

We addressed the problem of sound-source separation and localization in real-world conditions with two microphones. Both tasks are solved within a unified formulation using supervised mapping. While the parameters of the direct mapping are learned during a training stage that uses sources emitting white noise (calibration), the inverse mapping is estimated using a variational EM formulation. The proposed algorithm can deal with natural sound sources such as speech which are known to yield sparse spectrograms, and is able to locate multiple sources both in azimuth and in elevation. Extensive experiments with real data show that the method outperform state-of-the-art both in separation and localization.

**Joint work with:** Michel Dojat (Grenoble Institute of
Neuroscience) and Philippe Ciuciu from Neurospin, CEA in Saclay.

In standard within-subject analyses of event-related fMRI data, two steps are usually performed separately: detection of brain activity and estimation of the hemodynamic response. Because these two steps are inherently linked, we adopt the so-called region-based Joint Detection-Estimation (JDE) framework that addresses this joint issue using a multivariate inference for detection and estimation. JDE is built by making use of a regional bilinear generative model of the BOLD response and constraining the parameter estimation by physiological priors using temporal and spatial information in a Markovian model. In contrast to previous works that use Markov Chain Monte Carlo (MCMC) techniques to sample the resulting intractable posterior distribution, we recast the JDE into a missing data framework and derive a Variational Expectation-Maximization (VEM) algorithm for its inference. A variational approximation is used to approximate the Markovian model in the unsupervised spatially adaptive JDE inference, which allows automatic fine-tuning of spatial regularization parameters. It provides a new algorithm that exhibits interesting properties terms of estimation error and computational cost compared to the previously used MCMC-based approach. Experiments on artificial and real data show that VEM-JDE is robust to model mis-specification and provides computational gain while maintaining good performance in terms of activation detection and hemodynamic shape recovery. Main corresponding paper

**Joint work with:** Philippe Ciuciu from Team Parietal and
Neurospin, CEA in Saclay.

Identifying brain hemodynamics in event-related
functional MRI (fMRI) data is a crucial issue to disentangle the
vascular response from the neuronal activity in the BOLD signal.
This question is usually addressed by estimating the so-called
Hemodynamic Response Function (HRF). Voxelwise or
region-/parcelwise inference schemes have been proposed to achieve
this goal but so far all known contributions commit to
pre-specified spatial supports for the hemodynamic territories by
defining these supports either as individual voxels or a priori
fixed brain parcels. In this paper, we introduce a Joint
Parcellation-Detection-Estimation (JPDE) procedure that
incorporates an adaptive parcel identification step based upon
local hemodynamic properties. Efficient inference of both evoked
activity, HRF shapes and *supports* is then achieved using
variational approximations. Validation on synthetic and real fMRI
data demonstrate the JPDE performance over standard detection
estimation schemes and suggest it as a new brain exploration tool.
Corresponding papers , .

**Joint work with:** Michel Dojat (Grenoble Institute of
Neuroscience) and Philippe Ciuciu from Neurospin, CEA in Saclay.

Brain functional exploration investigates the nature of neural processing following cognitive or sensory stimulation. This goal is not fully accounted for in most functional Magnetic Resonance Imaging (fMRI) analysis which usually assumes that all delivered stimuli possibly generate a BOLD response everywhere in the brain although activation is likely to be induced by only some of them in specific brain regions. Generally, criteria are not available to select the relevant conditions or stimulus types (e.g. visual, auditory, etc.) prior to activation detection and the inclusion of irrelevant events may degrade the results, particularly when the Hemodynamic Response Function (HRF) is jointly estimated. To face this issue, we propose an efficient variational procedure that automatically selects the conditions according to the brain activity they elicit. It follows an improved activation detection and local HRF estimation that we illustrate on synthetic and real fMRI data. This approach is an alternative to our previous approach based on Monte-Carlo Markov Chain (MCMC) inference . Corresponding paper .

**In the context of ARC AINSI project, joint work with:**
Philippe Ciuciu from Neurospin, CEA in Saclay.

In many neuroscience applications, the Arterial Spin Labeling (ASL) fMRI modality arises as a preferable choice to the standard BOLD modality due to its ability to provide a quantitative measure of the Cerebral Blood Flow (CBF). Such a quantification is central but generally performed without consideration of a specific modeling of the perfusion component in the signal often handled via standard GLM approaches using the BOLD canonical response function as regressor. In this work, we propose a novel Bayesian hierarchical model of the ASL signal which allows activation detection and both the extraction of a perfusion and a hemodynamic component. Validation on synthetic and real data sets from event-related ASL show the ability of our model to address the source separation and double deconvolution problems inherent to ASL data analysis.

**In the context of ARC AINSI project, joint work with:** Jan
Warnking (Grenoble Institute of Neuroscience) and Philippe Ciuciu
from Neurospin, CEA in Saclay.

The internship of Marc Guillotin has been supported by Le pole Cognition de Grenoble.

The goal of this work was to investigate Independent component analysis techniques to identify the part of the ASL signal due to physiological sources such as respiratory and cardiac components. Once identified those physiological components should be removed to produce an uncontaminated ASL signal. This preliminary work showed that the physiological effects were affecting all signal components and were therefore not easy to extract without removing some of the useful signal. More experiments should be made on real data from the GIN.

**In the context of ARC AINSI project, joint work with:** Michel
Dojat (Grenoble Institute of Neuroscience), Philippe Ciuciu from
Neurospin, CEA in Saclay, Remi Dubujet, Elise Bannier, Isabelle
Courouge, Christian Barillot, Camille Maudet from EPI Visages in
Rennes.

We assessed and compared the performance of different ASL processing pipelines in order to promote one using specific indexes (Contrast to noise ratio, partial volume effect, et ). We proposed to assess the impact of the pipelines based on the quality of the final corrected ASL images using a common set of subjects for all workflows. We leaned on the expertise of the Visages and GIN teams on ASL, and first started from existing attempts made in the teams. At the moment, there is a striking lack of such guidelines. The recent toolbox ASLtbx proposes a number of procedures that are based on very standard tools (e.g. SPM) and do not make use of more efficient approaches from more recent literature. Similarly, in the BIRN project, processing pipelines are mentioned but none are currently available.

**Joint work with:** Lamiae Azizi, David Abrial and Myriam
Garrido from INRA Clermont-Ferrand-Theix.

Current risk mapping models for pooled data focus on
the estimated risk for each geographical unit. A risk
classification, *i.e.* grouping of geographical units with
similar risk, is then necessary to easily draw interpretable maps,
with clearly delimited zones in which protection measures can be
applied. As an illustration, we focus on the Bovine Spongiform
Encephalopathy (BSE) disease that threatened the bovine production
in Europe and generated drastic cow culling. This example features
typical animal disease risk analysis issues with very low risk
values, small numbers of observed cases and population sizes that
increase the difficulty of an automatic classification. We propose
to handle this task in a spatial clustering framework using a non
standard discrete hidden Markov model prior designed to favor a
smooth risk variation. The model parameters are estimated using an
EM algorithm and a mean field approximation for which we develop a
new initialization strategy appropriate for spatial Poisson
mixtures. Using both simulated and our BSE data, we show that our
strategy performs well in dealing with low population sizes and
accurately determines high risk regions, both in terms of
localization and risk level estimation.

Main corresponding paper .

This is joint work with Eric Coissac and Pierre Taberlet from LECA (Laboratoire d'Ecologie Alpine) and Alain Viari from EPI Bamboo

Biodiversity has been acknowledged as a vital ressource for
ecosystem health and stability, faced with an unprecedented global
decline.
In order to be effective,
conservation actions need to be based on reliable and fast
analysis. Recent advances in DNA sequencing methods now enable
DNA-based identification of multiple species from only few, even
potentially degraded environmental samples (metabarcoding.org,
). This offers a new way of
biodiversity assessment and is of particular interest where
large-scale individual-based diversity assessment is difficult,
for example in tropical environments.
Due to their comparatively low demand in cost and effort, these methods are characterized by their high throughput; they are expected
to produce vast amounts of data as they gain in popularity over the coming years.
The specific properties of these data (e.g. bias from sequencing
errors, notion of species) and their high dimensionality provides
new statistical and computational challenges for biodiversity
assessment. This project aims at extending existing summary
statistics to be used with data from metabarcoding surveys and,
where this is not adequate, to develop new methodology. A special
focus is on the spatial mapping of biodiversity and the
co-occurrence of species. In a first instance, we investigate
spatial clustering algorithms based on Markov random fields
(software SpaCEM3, http://

**Joint work with:** Pierre Fernique (Montpellier 2 University
and CIRAD) and Yann Guédon (CIRAD), Inria Virtual Plants.

The quantity and quality of yields in fruit trees is closely related
to processes of growth and branching, which determine ultimately the
regularity of flowering and the position of flowers. Flowering and
fruiting patterns are explained by statistical dependence between
the nature of a parent shoot (*e.g.* flowering or not) and the
quantity and natures of its children shoots – with potential
effect of covariates. Thus, better characterization of patterns and
dependencies is expected to lead to strategies to control the
demographic properties of the shoots (through varietal selection or crop
management policies), and thus to bring substantial improvements in
the quantity and quality of yields.

Since the connections between shoots can be represented by mathematical trees, statistical models based on multitype branching processes and Markov trees appear as a natural tool to model the dependencies of interest. Formally, the properties of a vertex are summed up using the notion of vertex state. In such models, the numbers of children in each state given the parent state are modeled through discrete multivariate distributions. Model selection procedures are necessary to specify parsimonious distributions. We developed an approach based on probabilistic graphical models to identify and exploit properties of conditional independence between numbers of children in different states, so as to simplify the specification of their joint distribution. The graph building stage was based on exploring the space of possible chain graph models, which required defining a notion of neighbourhood of these graphs. A parametric distribution was associated with each graph. It was obtained by combining families of univariate and multivariate distributions or regression models. These were chosen by selection model procedures among different parametric families.

This work was carried out in the context of Pierre Fernique's first year of PhD (Montpellier 2 University and CIRAD). It was applied to model dependencies between short or long, vegetative or flowering shoots in apple trees. The results highlighted contrasted patterns related to the parent shoot state, with interpretation in terms of alternation of flowering (see paragraph ). It was also applied to the analysis of the connections between cyclic growth and flowering of mango trees. This work will be continued during Pierre Fernique's PhD thesis, with extensions to other fruit tree species and other parametric discrete multivariate families of distributions, including covariates and mixed effects.

**Joint work with:** Jean Peyhardi and Yann Guédon (Mixed
Research Unit DAP, Virtual Plants team), Baptiste Guitton, Yan
Holtz and Evelyne Costes (DAP, AFEF team), Catherine Trottier
(Montpellier University)

The aim of this work was to characterize genetic determinisms of the alternation of flowering in apple tree progenies. Data were collected at two scales: at whole tree scale (with annual time step) and a local scale (annual shoot or AS, which is the portions of stem that were grown during the same year). Two replications of each genotype were available.

Indices were proposed to characterize alternation at tree scale. The difficulty is related to early detection of alternating genotypes, in a context where alternation is often concealed by a substantial increase of the number of flowers over consecutive years. To separate correctly the increase of the number of flowers due to aging of young trees from alternation in flowering, our model relied on a parametric hypothesis for the trend (fixed slopes specific to genotype and random slopes specific to replications), which translated into mixed effect modelling. Then, different indices of alternation were computed on the residuals. Clusters of individuals with contrasted patterns of bearing habits were identified.

To model alternation of flowering at AS scale, a second-order Markov tree model was built. Its transition probabilities were modelled as generalized linear mixed models, to incorporate the effects of genotypes, year and memory of flowering for the Markovian part, with interactions between these components.

Asynchronism of flowering at AS scale was assessed using an entropy-based criterion. The entropy allowed for a characterisation of the roles of local alternation and asynchronism in regularity of flowering at tree scale.

Moreover, our models highlighted significant correlations between indices of alternation at AS and individual scales.

This work was extended by the Master 2 internship of Yan Holtz, supervised by Evelyne Costes and Jean-Baptiste Durand. New progenies were considered, and a methodology based on a lighter measurement protocol was developed and assessed. It consisted in assessing the accuracy of approximating the indices computed from measurements at tree scale by the same indices computed as AS scale. The approximations were shown sufficiently accurate to provide an operational strategy for apple tree selection.

As a perspective of this work, patterns in the production of children ASs (numbers of flowering and vegetative children) depending on the type of the parent AS must be analyzed using branching processes and different types of Markov trees, in the context of Pierre Fernique's PhD Thesis (see paragraph ).

This is joint work with VI-Technology in the context of the IVP project.

Quality and throughput in printed circuit board (PCB) assembly lines constitute a continuous challenge, especially when placing smaller components on boards that are becoming increasingly dense. Automated optical inspection (AOI) technology allows PCB assembly lines to keep operating at a high throughput while visually inspecting production quality in term of paste deposits, mounted components and solder joints in an automatic and non-contact manner. In the AOI, high definition cameras precisely move in both X- and Y-direction to scan the device under test lit by special lighting techniques, e.g. light-emitting diode (LED) lighting. The captured images are then analyzed using specific inspection algorithms to identify defects. The AOI systems can be placed at several stages during the manufacturing process, such as bare board inspection, solder paste inspection, pre-reflow inspection and post-reflow inspection, which usually need some time to be programmed via offline learning of verified boards and expert expertise before online inspection starts. Vi TECHNOLOGY (VIT) offers a wide range of AOI solutions to increase productivity throughout electronics manufacturing lines while enhancing the quality of products. Post-reflow AOI is implemented after the reflow procedure in PCB assembly lines to enable inspection of the major post-reflow defects. This work focus on certain types of post-reflow defects occurring on leaded components, i.e. lifted lead, no solder, excess of solder, contamination on lead, insufficient solder, bad wedding and dry joint. We aim at developing efficient post-reflow lead defect detection approaches by synergizing image analysis, pattern recognition, machine learning, and statistics techniques to improve performance of VIT commercial post-reflow AOI solutions from two aspects: 1) Reducing both detection escape rate and false detection rate; 2) Minimizing programming efforts. The exact nature of the work is confidential.

Modern GPUs enable widely affordable personal computers to carry out massively parallel computation tasks. NVIDIA's CUDA technology provides a wieldy parallel computing platform. Many state-of-the-art algorithms arising from different fields have been redesigned based on CUDA to achieve computational speedup. Differential evolution (DE), as a very promising evolutionary algorithm, is highly suitable for parallelization owing to its data parallel algorithmic structure. However, most existing CUDA based DE implementations suffer from excessive low-throughput memory access and less efficient device utilization. This work presents an improved CUDA-based DE to optimize memory and device utilization: several logically-related kernels are combined into one composite kernel to reduce global memory access; kernel execution configuration parameters are automatically determined to maximize device occupancy; streams are employed to enable concurrent kernel execution to maximize device utilization. Experimental results on several numerical problems demonstrate superior computational time efficiency of the proposed method over two recent CUDA-based DE and the sequential DE across varying problem dimensions and algorithmic population sizes.

This work was nominated for the best paper award (finalist) in the Digital Entertainment Technologies and Arts / Parallel Evolutionary Systems session of the Genetic and Evolutionary Computation Conference 2012 (GECCO12) conference .

Max-stable distribution functions are theoretically grounded models for modelling multivariate extreme values. However they suffer from some striking limitations when applied to real data analysis due to the intractability of the likelihood when the number of variables becomes high. Cumulative Distribution Networks (CDN's) have been introduced recently in the machine learning community and allow the construction of max-stable distribution functions for which the density can be computed. Unfortunately, we show in this work that the dependence structure expected in the data may not be accurately reflected by max-stable CDN's. To face this limitation, we therefore propose to augment max-stable CDN's with the more standard Gumbel max-stable distribution function in order to enrich the dependence structure .

**Joint work with:** Guillou, A. and Gardes, L. (Univ.
Strasbourg).

We introduced a new model of tail distributions depending on two
parameters

We are also working on the estimation of the second
order parameter

**Joint work with:** L. Gardes, Amblard,
C. (TimB in TIMC laboratory, Univ. Grenoble I) and Daouia, A.
(Univ. Toulouse I and Univ. Catholique de Louvain)

The goal of the PhD thesis of Alexandre Lekina was to contribute to
the development of theoretical and algorithmic models to tackle
conditional extreme value analysis, *ie* the situation where
some covariate information

Conditional extremes are studied in climatology where one is interested in how climate change over years might affect extreme temperatures or rainfalls. In this case, the covariate is univariate (time). Bivariate examples include the study of extreme rainfalls as a function of the geographical location. The application part of the study is joint work with the LTHE (Laboratoire d'étude des Transferts en Hydrologie et Environnement) located in Grenoble.

More future work will include the study of multivariate and spatial extreme values. With this aim, a research on some particular copulas has been initiated with Cécile Amblard, since they are the key tool for building multivariate distributions . The PhD theses of Jonathan El-methni and Gildas Mazo should address this issue too.

**Joint work with:** Guillou, A. and Gardes, L. (Univ.
Strasbourg), Stupfler, G. (Univ. Strasbourg)
and Daouia, A. (Univ. Toulouse I and Univ. Catholique de Louvain).

The boundary bounding the set of points is viewed as the larger level set of the points distribution. This is then an extreme quantile curve estimation problem. We proposed estimators based on projection as well as on kernel regression methods applied on the extreme values set, for particular set of points .

In collaboration with A. Daouia, we investigate the application of such methods in econometrics , : A new characterization of partial boundaries of a free disposal multivariate support is introduced by making use of large quantiles of a simple transformation of the underlying multivariate distribution. Pointwise empirical and smoothed estimators of the full and partial support curves are built as extreme sample and smoothed quantiles. The extreme-value theory holds then automatically for the empirical frontiers and we show that some fundamental properties of extreme order statistics carry over to Nadaraya's estimates of upper quantile-based frontiers.

In the PhD thesis of Gilles Stupfler (co-directed by Armelle Guillou and Stéphane Girard), new estimators of the boundary are introduced. The regression is performed on the whole set of points, the selection of the “highest” points being automatically performed by the introduction of high order moments , , .

**Joint work with:** Carreau, J. (Hydrosciences Montpellier),
Gardes, L. (univ. Strasbourg) and Molinié, G. from Laboratoire
d'Etude des Transferts en Hydrologie et Environnement (LTHE),
France.

Extreme rainfalls are generally associated with two different precipitation regimes. Extreme cumulated rainfall over 24 hours results from stratiform clouds on which the relief forcing is of primary importance. Extreme rainfall rates are defined as rainfall rates with low probability of occurrence, typically with higher mean return-levels than the maximum observed level. For example Figure presents the return levels for the Cévennes-Vivarais region that can be obtained. It is then of primary importance to study the sensitivity of the extreme rainfall estimation to the estimation method considered.

The obtained results are published in .

**Joint work with:** Douté, S. from Laboratoire de
Planétologie de Grenoble, France and Saracco, J (University
Bordeaux).

Visible and near infrared imaging spectroscopy is
one of the key techniques
to detect, to map and to characterize mineral and volatile (eg.
water-ice)
species existing at
the surface of planets. Indeed the chemical composition,
granularity, texture, physical state, etc. of the materials
determine the existence and morphology of the absorption bands.
The resulting spectra contain therefore very useful information.
Current imaging spectrometers provide data organized as three
dimensional hyperspectral images: two spatial dimensions and one
spectral dimension. Our goal is to estimate the functional
relationship

**Joint work with:** A. Lombardot and S. Joshi (ST Crolles).

With scaling down technologies to the nanometer regime, the static power dissipation in semiconductor devices is becoming more and more important. Techniques to accurately estimate System On Chip static power dissipation are becoming essential. Traditionally, designers use a standard corner based approach to optimize and check their devices. However, this approach can drastically underestimate or over-estimate process variations impact and leads to important errors.

The need for an effective modeling of process variation for static power analysis has led to the introduction of Statistical static power analysis. Some publication state that it is possible to save up to 50% static power using statistical approach. However, most of the statistical approaches are based on Monte Carlo analysis, and such methods are not suited to large devices. It is thus necessary to develop solutions for large devices integrated in an industrial design flow. Our objective to model the total consumption of the circuit from the probability distribution of consumption of each individual gate. Our preliminary results are published in .

mistis participates in the weekly statistical seminar of Grenoble. F. Forbes is one of the organizers and several lecturers have been invited in this context.

S. Girard is at the head of the probability and statistics department of the LJK since september 2012.

mistis is a partner in a three-year (2010-12) MINALOGIC
project (I-VP for Intuitive Vision Programming) supported by the
French Government. The project is led by VI Technology
(http://

Florence Forbes is coordinating the 2-year Inria ARC project AINSI
(http://

Title: Humanoids with audiovisual skills in populated spaces

Type: COOPERATION (ICT)

Defi: Cognitive Systems and Robotics

Instrument: Specific Targeted Research Project (STREP)

Duration: February 2010 - January 2013

Coordinator: Inria (France)

Others partners: CTU Prague (Czech Republic), University of Bielefeld (Germany), IDIAP (Switzerland), Aldebaran Robotics (France)

See also: http://

Abstract: Humanoids expected to collaborate with people should be able to interact with them in the most natural way. This involves significant perceptual, communication, and motor processes, operating in a coordinated fashion. Consider a social gathering scenario where a humanoid is expected to possess certain social skills. It should be able to explore a populated space, to localize people and to determine their status, to decide to join one or two persons, to synthetize appropriate behavior, and to engage in dialog with them. Humans appear to solve these tasks routinely by integrating the often complementary information provided by multi sensory data processing, from low-level 3D object positioning to high-level gesture recognition and dialog handling. Understanding the world from unrestricted s

MINWOO JAKE LEE (from Jun 2012 until Aug 2012)

Subject: Clustering or classification of high dimensional data in the presence of outliers

Institution: Colorado State University (United States)

El Hadji DEME (from Mar 2012 until May 2012)

Subject: Bias reduction in extreme-value statistics

Institution: Université Gaston Berger (Senegal)

Seydou-Nourou Sylla (from October 2012 to December 2012)

Subject: Classification for medical data

Institution: Université Gaston Berger (Senegal)

Since September 2009, F. Forbes is head of the committee in charge of examining post-doctoral candidates at Inria Grenoble Rhône-Alpes ("Comité des Emplois Scientifiques").

Since September 2009, F. Forbes is also a member of the Inria national committee, "Comité d'animation scientifique", in charge of analyzing and motivating innovative activities in Applied Mathematics.

Florence Forbes is a member of an INRA committee (CSS MBIA) in charge of evaluating INRA researchers once a year.

F. Forbes is part of an INRA (French National Institute for Agricultural Research) Network (MSTGA) on spatial statistics.

F. Forbes and S. Girard were elected as members of the bureau of the “Analyse d'images, quantification, et statistique” group in the Société Française de Statistique (SFdS).

S. Girard is associate editor of the international journal ”Statistics and Computing”.

Stéphane Girard

Master : Statistique inférentielle avancée, 45h, M1, Ensimag (Grenoble INP), France.

Florence Forbes

Master : Mixture models and EM algorithm, 12h, M2, UFR IM2A, Université Grenoble I, France.

M.-J. Martinez is faculty members at Univ. Pierre Mendès France, Grenoble II.

J.-B. Durand is a faculty member at Ensimag, Grenoble INP.

F. Enikeeva is on a half-time ATER position at Ensimag, Grenoble INP.

C. Bakhous and J. El Methni are both moniteur at University Joseph Fourier.

PhD & HdR

PhD in progress : Jonathan El Methni, Modèles en statistique des valeurs extrêmes, since October, 2010, Stéphane Girard

PhD in progress : Christine Bakhous, Problèmes de sélection de modèles en IRM fonctionnelle, since November, 2010, Florence Forbes and Michel Dojat

PhD in progress : Gildas Mazo, Modèles spatiaux en statistique des valeurs extrèmes, since October, 2011, Florence Forbes and Stéphane Girard

PhD in progress : El Hadji Deme, Réduction du biais en statistique des valeurs extrèmes, since October, 2009, Stéphane Girard

PhD in progress : Seydou-Nourou Sylla, Modélisation statistique pour l'analyse de causes de décès décrites par autopsie verbale en milieu rural africain, since October, 2012, Stéphane Girard

Stéphane Girard was a member of the Strasbourg university committee in charge of examining applications for assistant professor in 2012.

Florence Forbes was also a member of an INRA committee in charge of examining applications for junior researcher positions in 2012 at dept MBIA of INRA.

F. Forbes was involved in the PhD committees of

El Ghali Lazrak from Inria Nancy and INRA Aster Mirecourt, Université de Lorraine in October 2012 (reviewer).

Alexandre Janon from Inria team MOISE and LJK Grenoble. November 2012 (Examinateur).

Mahdi Bagher from Inria team Maverick and LJK Grenoble. November 2012 (Examinateur).

F. Forbes was also involved in the HDR committee of Michael Blum, CR CNRS at TimC in Grenoble. December 2012 (Examinateur).