## Section: New Results

### Markov models

#### Spatial modelling of plant diversity from high-throughput environmental DNA sequence data

Participants : Florence Forbes, Angelika Studeny.

**This is joint work with:** Eric Coissac and Pierre Taberlet from LECA
(Laboratoire d'Ecologie Alpine) and Alain Viari from Inria team Bamboo.

This work [48] considers a statistical modelling approach to investigate spatial cross-correlations between species in an ecosystem. A special feature is the origin of the data from high-troughput environmental DNA sequencing of soil samples. Here we use data collected at the Nourague CNRS Field Station in French Guiana. We describe bivariate spatial relationships in these data by a separable linear model of coregionalisation and estimate a cross-correlation parameter. Based on this estimate, we visualise plant taxa co-occurrence pattern in form of `interaction graphs' which can be interpreted in terms of ecological interactions. Limitations of this approach are discussed along with possible alternatives in [48] .

#### Modelling multivariate counts with graphical Markov models.

Participants : Jean-Baptiste Durand, Florence Forbes, Marie-José Martinez, Angelika Studeny.

**Joint work with:**
Pierre Fernique (Montpellier 2 University, CIRAD
and Inria Virtual Plants), Yann Guédon
(CIRAD and Inria Virtual Plants) and
Iragaël Joly (INRA-GAEL and Grenoble INP).

Multivariate count data are defined as the number of items of different categories issued from sampling within a population, which individuals are grouped into categories. The analysis of multivariate count data is a recurrent and crucial issue in numerous modelling problems, particularly in the fields of biology and ecology (where the data can represent, for example, children counts associated with multitype branching processes), sociology and econometrics. Denoting by $K$ the number of categories, multivariate count data analysis relies on modelling the joint distribution of the $K$-dimensional random vector $N=({N}_{0},...,{N}_{K-1})$ with discrete components. Our work focused on I) Identifying categories that appear simultaneously, or on the contrary that are mutually exclusive. This was achieved by identifying conditional independence relationships between the $K$ variables; II)Building parsimonious parametric models consistent with these relationships; III) Characterizing and testing the effects of covariates on the distribution of $N$, particularly on the dependencies between its components.

Our context of application was characterised by zero-inflated, often
right skewed marginal distributions. Thus, Gaussian and Poisson
distributions were not *a priori* appropriate. Moreover, the
multivariate histograms typically had many cells, most of which
were empty. Consequently, nonparametric estimation was not efficient.

To achieve these goals, we proposed an approach based on graphical probabilistic models (Koller & Friedman, 2009 [70] ) to represent the conditional independence relationships in $N$, and on parametric distributions to ensure model parsimony [51] . The considered graphs were partially directed, so as to represent both marginal independence relationships and cyclic dependencies between quadruplets of variables (at least).

Graph search was achieved by a stepwise approach, issued from unification of previous algorithms presented in Koller & Friedman (2009) for DAGs: Hill climbing, greedy search, first ascent and simulated annealing. The search algorithm was improved by taking into account our parametric distribution assumptions, which led to caching overlapping graphs at each step. An adaptation to PDAGs of graph search algorithms for DAGs was developed, by defining new operators specific to PDAGs.

Comparisons between different algorithms in the literature for directed and undirected graphical models was performed on simulated datasets to: (i) Assess gain in speed induced by caching; (ii) Compare the graphs obtained under parametric and nonparametric distributions assumptions; (iii) Compare different strategies for graph initialization. Strategies based on several random graphs were compared to those based on a fast estimation of an undirected graph, assumed to be the moral graph.

First results were obtained in modelling individual daily activity program [50] and interactions between flowering and vegetative growth in plants (see sections below).

#### Statistical characterization of tree structures based on Markov tree models and multitype branching processes, with applications to tree growth modelling.

Participant : Jean-Baptiste Durand.

**Joint work with:**
Pierre Fernique (Montpellier 2 University and CIRAD) and Yann Guédon
(CIRAD), Inria Virtual Plants.

The quantity and quality of yields in fruit trees is closely related
to processes of growth and branching, which determine ultimately the
regularity of flowering and the position of flowers. Flowering and
fruiting patterns are explained by statistical dependence between
the nature of a parent shoot (*e.g.* flowering or not) and the
quantity and natures of its children shoots – with potential
effect of covariates. Thus, better characterization of patterns and
dependences is expected to lead to strategies to control the
demographic properties of the shoots (through varietal selection or crop
management policies), and thus to bring substantial improvements in
the quantity and quality of yields.

Since the connections between shoots can be represented by mathematical trees, statistical models based on multitype branching processes and Markov trees appear as a natural tool to model the dependencies of interest. Formally, the properties of a vertex are summed up using the notion of vertex state. In such models, the numbers of children in each state given the parent state are modeled through discrete multivariate distributions. Model selection procedures are necessary to specify parsimonious distributions. We developed an approach based on probabilistic graphical models (see Section 6.3.2 ) to identify and exploit properties of conditional independence between numbers of children in different states, so as to simplify the specification of their joint distribution [51] , [32] .

This work was carried out in the context of Pierre Fernique's first year of PhD (Montpellier 2 University and CIRAD). It was applied to model dependencies between short or long, vegetative or flowering shoots in apple trees. The results highlighted contrasted patterns related to the parent shoot state, with interpretation in terms of alternation of flowering (see paragraph 6.3.4 ). It was also applied to the analysis of the connections between cyclic growth and flowering of mango trees [32] . This work will be continued during Pierre Fernique's PhD thesis, with extensions to other fruit tree species and other parametric discrete multivariate families of distributions, including covariates and mixed effects.

#### Statistical characterization of the alternation of flowering in fruit tree species

Participant : Jean-Baptiste Durand.

**Joint work with:**
Jean Peyhardi and Yann Guédon (Mixed Research Unit DAP, Virtual Plants
team), Baptiste Guitton, Yan Holtz and Evelyne Costes (DAP, AFEF
team), Catherine Trottier (Montpellier University)

A first study was performed to characterize genetic determinisms of the alternation of flowering in apple tree progenies [37] , [21] . Data were collected at two scales: at whole tree scale (with annual time step) and a local scale (annual shoot or AS, which is the portions of stem that were grown during the same year). Two replications of each genotype were available.

Indices were proposed to characterize alternation at tree scale. The difficulty is related to early detection of alternating genotypes, in a context where alternation is often concealed by a substantial increase of the number of flowers over consecutive years. To separate correctly the increase of the number of flowers due to aging of young trees from alternation in flowering, our model relied on a parametric hypothesis for the trend (fixed slopes specific to genotype and random slopes specific to replications), which translated into mixed effect modelling. Then, different indices of alternation were computed on the residuals. Clusters of individuals with contrasted patterns of bearing habits were identified.

To model alternation of flowering at AS scale, a second-order Markov tree model was built. Its transition probabilities were modelled as generalized linear mixed models, to incorporate the effects of genotypes, year and memory of flowering for the Markovian part, with interactions between these components.

Asynchronism of flowering at AS scale was assessed using an entropy-based criterion. The entropy allowed for a characterisation of the roles of local alternation and asynchronism in regularity of flowering at tree scale.

Moreover, our models highlighted significant correlations between indices of alternation at AS and individual scales.

This work was extended by the Master 2 internship of Yan Holtz, supervised by Evelyne Costes and Jean-Baptiste Durand. New progenies were considered, and a methodology based on a lighter measurement protocol was developed and assessed. It consisted in assessing the accuracy of approximating the indices computed from measurements at tree scale by the same indices computed as AS scale. The approximations were shown sufficiently accurate to provide an operational strategy for apple tree selection.

As a perspective of this work, patterns in the production of children ASs (numbers of flowering and vegetative children) depending on the type of the parent AS must be analyzed using branching processes and different types of Markov trees, in the context of Pierre Fernique's PhD Thesis (see paragraph 6.3.3 ).