BIGS is a joint team of Inria, CNRS and University of Lorraine, within the Institut Élie Cartan of Lorraine (IECL), UMR 7502 CNRS-UL laboratory in mathematics, of which Inria is a strong partner. One member of BIGS, T. Bastogne, comes from the Research Center of Automatic Control of Nancy (CRAN), with which BIGS has strong relations in the domain “Health-Biology-Signal”. Our research is mainly focused on stochastic modeling and statistics but also aims at a better understanding of biological systems. BIGS involves applied mathematicians whose research interests mainly concern probability and statistics. More precisely, our attention is directed on (1) stochastic modeling, (2) estimation and control for stochastic processes, (3) regression and machine learning, and (4) statistical learning and application in health. The main objective of BIGS is to exploit these skills in applied mathematics to provide a better understanding of issues arising in life sciences, with a special focus on (1) tumor growth and heterogeneity, (2) gene networks, (3) telomere length dynamics, (4) epidemiology and e-health.

We give here the main lines of our research. For clarity, we made the choice to structure them in four items. Note that all of these items deal with stochastic modeling and inference, therefore they are all interconnected.

Our aim is to propose relevant stochastic frameworks for the modeling and the understanding of biological systems. The stochastic processes are particularly suitable for this purpose. Among them, Markov processes provide a first framework for the modeling of population of cells 86, 66. Piecewise deterministic processes are non-diffusion processes that are also frequently used in the biological context 54, 65, 55. Among Markov models, we developed strong expertise about processes derived from Brownian motion and Stochastic Differential Equations 80, 64. For instance, knowledge about Brownian or random walk excursions 85, 79 helps to analyse genetic sequences and to develop inference about them. We also have strong expertise in stochastic modeling of complex biological populations using individual-based models. These models can be used either from the point of view of asymptotic stochastic analysis 51, e.g. to study the long term Darwinian evolution of populations, or from the point of view of numerical analysis of biological phenomena 60, 40. We also develop mathematical tools for the analysis of the long-time behavior of stochastic population processes accounting for possible extinction of (sub)populations 52.

We develop inference about the stochastic processes that we use for modeling. Control of stochastic processes is also a way to optimise administration (dose, frequency) of therapy, such as targeted therapies in cancer. Our team has a good expertise about inference of the jump rate and the kernel of piecewise-deterministic Markov processes (PDMP) 45, 41, 44, 43, but there are many directions to go further into. For instance, previous work made the assumption of a complete observation of jumps and mode, which is unrealistic in practice. We also tackle the problem of inference of “hidden PDMP”. For example, in pharmacokinetics modeling inference, we want to account for the presence of timing noise and identification from longitudinal data. We have expertise on these subjects 46, and we also use mixed models to estimate tumor growth or heterogeneity 47.

We consider the control of stochastic processes within the framework of Markov Decision Processes 77 and their generalization known as multi-player stochastic games, with a particular focus on infinite-horizon problems. In this context, we are interested in the complexity analysis of standard algorithms, as well as the proposition and analysis of numerical approximate schemes for large problems in the spirit of 49. Regarding complexity, a central topic of research is the analysis of the Policy Iteration algorithm, which has made significant progress in the last years 88, 76, 62, 83, but is still not fully understood. For large problems, we have an extensive experience of sensitivity analysis of approximate dynamic programming algorithms for Markov Decision Processes 81, 69, 82, and we currently investigate whether/how similar ideas may be adapted to multi-player stochastic games.

Recently, our group has focused its attention on modeling and inference for graph data. A graph data structure consists of a set of nodes, together with a set of pairs of these nodes called edges. This type of data is frequently used in biology because they provide a mathematical representation of many concepts such as biological networks of relationships in a population or between genes in a cell.

Network inference is the process of making inference about the link between two variables, by taking into account the information about other variables. 87 gives a very good introduction and many references about network inference and mining. Many methods are available to infer and test edges in Gaussian graphical models 87, 71, 59, 61. However, the Gaussian assumption does not hold when dealing with typical “zero-inflated” abundance data, and we want to develop inference in this case.

Concerning gene networks, most studies have been based on population-averaged data: now that technologies enable us to observe mRNA levels in individual cells, a revolution in terms of precision, the network reconstruction problem paradoxically becomes more challenging than ever. Indeed, the traditional way of seeing a gene regulatory network as a deterministic system with some small external noise is being challenged by the probabilistic, bursty nature of gene expression revealed at single-cell level. Our objective is to propose dynamical models and inference methods that fully exploit the particular time structure of single-cell data. We described a promising strategy in which the network inference problem is seen as a calibration procedure for a new PDMP model that is able to acceptably reproduce real single-cell data 63, 78.

Among graphs, trees play a special role because they offer a powerful model for many biological concepts, from RNA to phylogenetic trees in heterogeneous tumors or through plant structures. Our research deals with several aspects of tree data. In particular, we work on statistical inference for this type of data under a given stochastic model. We also work on lossy compression of trees via directed acyclic graphs. These methods enable us to compute distances between tree data faster than from the original structures and with a high accuracy.

Regression models and machine learning aim at inferring statistical links between a variable of interest and covariates. In biological studies, it is always important to develop adapted learning methods both in the context of standard data and also for data of high dimension (sometimes with few observations) and very massive or online data.

Many methods are available to estimate conditional quantiles and test dependencies 75, 67. Among them we have developed nonparametric estimation by local analysis via kernel methods 57, 58 and we want to study properties of this estimator in order to derive a measure of risk based e.g. on confidence band and test. We study also other regression models like survival analysis, spatio-temporal models with covariates. Among the multiple regression models, we want to develop omnibus tests that examine several assumptions together.

Concerning the analysis of high dimensional data, our view on the topic relies on the French data analysis school, specifically on Factorial Analysis. In this context, stochastic approximation is an essential tool 68, which allows one to approximate eigenvectors in a stepwise manner 73, 72, 74. We aim at performing accurate classification or clustering by taking advantage of the possibility of updating the information "online" using stochastic approximation algorithms 50. We focus on several incremental procedures for regression and data analysis like linear and logistic regressions and PCA (Principal Component Analysis).

We also focus on the biological context of high-throughput bioassays in which several hundreds or thousands of biological signals are measured for a posterior analysis. We have to account for the inter-individual variability within the modeling procedure. We aim at developing a new solution based on an ARX (Auto Regressive model with eXternal inputs) model structure using the EM (Expectation-Maximisation) algorithm for the estimation of the model parameters.

We want to propose stochastic processes to model the appearance of mutations and the evolution of their frequencies in tumor samples, through new collaborations with clinicians who measure a particular quantity called circulating tumor DNA (ctDNA). The final purpose is to use ctDNA as an early biomarker of the resistance to a targeted therapy: this is the aim of the project funded by ITMO Cancer that we coordinate. In the ongoing work on low-grade gliomas, a local database of 400 patients will be soon available to construct models. We plan to extend it through national and international collaborations (Montpellier CHU, Montreal CRHUM). Our aim is to build a decision-aid tool for personalised medicine.

We already mentioned in Section 3.4 our interest in the modeling and inference of transcriptomic bursting in gene regulatory networks from single-cell data. We are also currently working on the prediction and identification of therapeutic targets for chronic lymphocytic leukemia from gene expression data. Our goal is to propose new models allowing to make prediction of gene silencing experiments. Inference will be performed on gene expression data from patients’ cells suffering from different forms of chronic lymphocytic leukemia. The goal is to identify therapeutic targets which could be silenced to reduce cell proliferation.

In the context of personalized medicine, we have many ongoing projects with CHU Nancy. They deal with biomarkers research, prognostic value of quantitative variables and events, scoring, and adverse events. We also want to develop our expertise in rupture detection in a project with APHP (Assistance Publique Hôpitaux de Paris) for the detection of adverse events, earlier than the clinical signs and symptoms. The clinical relevance of predictive analytics is obvious for high-risk patients such as those with solid organ transplantation or severe chronic respiratory disease for instance. The main challenge is the rupture detection in multivariate and heterogeneous signals (for instance daily measures of electrocardiogram, body temperature, spirometry parameters, sleep duration, etc.). Other collaborations with clinicians concern foetopathology and we want to use our work on conditional distribution function to explain fetal and child growth. To that end, we use data from the “Service de fœtopathologie et de placentologie” of the “Maternité Régionale Universitaire” (CHU Nancy).

Telomeres are disposable buffers at the ends of chromosomes which are truncated during cell division; so that, over time, due to each cell division, the telomere ends become shorter. By this way, they are markers of aging. Through a collaboration with Pr A. Benetos, geriatrician at CHU Nancy, we recently obtained data on the distribution of the length of telomeres from blood cells 84. We want to work in three connected directions: (1) refine methodology for the analysis of the available data; (2) propose a dynamical model for the lengths of telomeres and study its mathematical properties (long term behavior, quasi-stationarity, etc.); and (3) use these properties to develop new statistical methods.

We are continuing our research on quasi-stationary distributions (QSD), that is, distributions of Markov stochastic processes with absorption, which are stationary conditionally on non-absorption. For models of biological populations, absorption usually corresponds to extinction of a (sub-)population. QSDs are fundamental tools to describe the population state before extinction and to quantify the large-time behavior of the probability of extinction.

Thanks to the previous general result of the team in 53, together with B. Cloez (INRAE), we proved in 26 the exponential convergence of a chemostat model, whose dynamics are highly degenerate due to a deterministic part, towards a unique quasi-stationary distributions.

We also finalized an important work 7 that provides general criteria for the exponential convergence of conditional distributions of absorbed Markov processes when the convergence is not uniform with respect to the initial distribution. Our results allow to characterize a large subset of the domain of attraction of the minimal QSD and apply to a large range of stochastic processes, including diffusion processes and perturbed dynamical systems. We completed this work with specific studies of the periodic case in 24 and the case of reducible processes in 25. In this last work, we were in particular able to characterize cases with polynomial speed of convergence to a QSD and to prove the existence of a QSD for general processes in denumerable state spaces, assuming only aperiodicity, the existence of a Lyapunov function and the existence of a point in the state space from which the return time is finite with positive probability.

Motivated by our work with M. Benaïm (Univ. Neuchâtel) on degenerate processes such as hypoelliptic diffusions 48, we studied in 21 the links between Feller properties and quasi-compactness of general semigroups. This work allows to clarify the links between existing results on QSDs for hypoelliptic diffusions. We also provided in 6 a counterexample to the uniqueness of a quasi-stationary distribution for a diffusion process which satisfies the weak Hörmander condition.

In collaboration with A. Watson (Univ. College London, UK), we studied a general fragmentation-growth equation with unbounded rates. The work is decomposed into two parts: in the first part, it is shown that the equation admits a unique solution on a certain definition space; in the second part, spectral properties of the solution are established 36. The proofs are largely based on 7 and on fine properties of piecewise deterministic Markov processes.

In 33, we obtained a central limit theorem and Berry-Esseen estimates for Markov processes conditioned to never be absorbed (so-called

We continued our study of parameter scalings of individual-based models of biological populations under mutation and selection, taking into account the influence of negligible but non-extinct populations. In a work within the ERC SINGER, in collaboration with S. Méléard (École Polytechnique), S. Mirrahimi (Univ. Montpellier) and V.C. Tran (Univ. Paris Est Marne-la-Vallée) 23, we were able to give an individual-based justification of the Hamilton-Jacobi equation of adaptive dynamics (see e.g. 70), with a specific parameter scaling that is promising for the study of local (in space) extinction of sub-populations. The analysis of models allowing for such an extinction is the next step of this project.

We also worked on general evolutionary models of adaptive dynamics under an assumption of large population and small mutations. This year, we obtained in 22 existence, uniqueness and ergodicity results for a centered version of the Fleming-Viot process of population genetics, which is a key step to recover variants of the canonical equation of adaptive dynamics, which describes the long time evolution of the dominant phenotype in the population, under less stringent biological assumptions than in previous works. We plan to complete this project next year.

The asexual multi-type Galton-Watson branching processes as well as the single-type bisexual processes have been studied in the literature. In particular, survival condition of the processes are well known in both cases. However, until now, the multi-type bisexual branching processes have only been studied in very specific situations and no general mathematical description has been established yet.

In 31, we studied general multi-type bisexual branching processes with superadditive mating function. We exhibited a necessary and sufficient condition for almost sure extinction, we proved a law of large numbers for our model and we studied the long-time convergence of the rescaled process.

In collaboration with E. Horton (Inria ASTRAL team) and A. Cox (Univ. of Bath, UK), we proposed a new population size model gathering in the same framework branching processes and processes with Moran type interactions. We studied this process in a setting of large population and long time 27. Its dynamics is related to the evolution of a non-conservative semi-group, whose spectral properties provide information on the long time behavior of the model.

The aim is to better understand how living cells make decisions (e.g., differentiation of a stem cell into a particular specialized type), seeing decision-making as an emergent property of an underlying complex molecular network. Indeed, it is now proven that cells react probabilistically to their environment: cell types do not correspond to fixed states, but rather to “potential wells” of a certain energy landscape (representing the energy of the possible states of the cell) that we are trying to reconstruct. The achievement of this year is to show that the same mathematical model driven by transcriptional bursting can be used simultaneously as an inference tool, to reconstruct biologically relevant networks, and as a simulation tool, to generate realistic transcriptional profiles emerging from gene interactions 35.

Lung exposure to various types of particules, such as those present in cigarette smoke, can lead to chronic obstructive pulmonary disease (COPD). COPD bronchi are an area of intense immunological activity and tissue remodeling, as evidenced by the extensive immune cell infiltration and changes in tissue structures. This allows the persistent contact between resident cells and stimulated immune cells. Our hypothesis is that the contact between cells is a major cause of chronic destructive or fibrotic manifestations. We aim to analyze the potential cell-cell interactions in situ in human tissues, to characterize in vitro the dynamics of the interplay, and to define a computational model with intercellular interactions which fits to experimental measurements and explains the macroscopic properties of cell populations. The effects of potential therapeutic drugs modulating local intercellular interactions will be tested by simulations. Two papers have been submitted this year 29, 30.

In a collaboration with A. Lejay (Inria PASTA team) and their PhD student A. Anagnostakis, D. Villemonais proposed a method for approximating general, singular diffusions by discrete time and state space processes 1. One of the main interests compared to existing methods is to propose a numerical method whose main computational cost is done upstream and thus represents a fixed cost, independently of the number of simulations performed afterwards.

We considered a toy model of an animal foraging in a one-dimensional space. The position of the animal as time

Many goodness-of-fit tests have been developed to assess the different assumptions of a (possibly heteroscedastic) regression model. Most of them are ‘directional’ in that they detect departures from a given assumption of the model. Other tests are ‘global’ (or ‘omnibus’) in that they assess whether a model fits a dataset on all its assumptions. We focus on the task of choosing the structural part of the regression function because it contains easily interpretable information about the studied relationship. We consider two nonparametric ‘directional’ tests and one nonparametric ‘global’ test, all based on generalizations of the Cramér-von Mises statistic.

To perform these goodness-of-fit tests, we have developped the R package cvmgof 42, an easy-to-use tool for practitioners, available from the Comprehensive R Archive Network (CRAN). The use of the library is illustrated through a tutorial on real data and simulation studies are carried out in order to show how the package can be exploited to compare the 3 implemented tests. The practitioner can also easily compare the test procedures with different kernel functions, bootstrap distributions, numbers of bootstrap replicates, or bandwidths. The package was updated last year, this is its third version. An article 2 has been published on this work in 2022.

We are now working on nonparametric tests associated with the functional form of the variance of the regression model. For this, we continue to work on the global test of Ducharme and Ferrigno in order to compare its performance with directional tests associated with the variance of the model. Many simulations are in progress and a part of this work has been presented in CMStatistics 2022 conference 37. This will also make it possible to propose a more general package-type tool allowing to validate the regression models used in practice.

To complete this work, we plan to assess the other assumptions of a regression model such as the additivity of the random error term. The implementation of these directional tests would enrich the cvmgof package and offer a complete easy-to-use tool for validating regression models. Moreover, the assessment of the overall validity of the model when using several directional tests will be compared with that done when using only a global test. In particular, we will discuss the well-known problem of multiple testing by comparing the results obtained from multiple test procedures with those obtained when using a global test strategy.

Stochastic approximation is an important tool for the analysis of streaming data, introduced by Robbins and Monro in 1951, that can be used for example to estimate online parameters of a regression function 56 or centers of clusters in unsupervised classification 50. Another type of stochastic approximation processes was introduced by Benzécri in 1969 for estimating eigenvectors and eigenvalues of the

On this topic, our works with E. Albuisson (CHRU Nancy) on constrained binary logistic regression with online standardized data 13 and on the construction and update of an online ensemble score involving linear discriminant analysis and logistic regression 12, described in the previous activity report, are now published.

In the article 32, we establish an almost sure convergence theorem of two stochastic approximation processes for estimating eigenvectors of the unknown

Our research in the field of epidemiology focuses on fetal development in the last two trimesters of pregnancy. Reference or standard curves are required in this kind of biomedical problems. Values that lie outside the limits of these reference curves may indicate the presence of a disorder. Data are from the French EDEN mother-child cohort (INSERM). This is a mother-child cohort study investigating the prenatal and early postnatal determinants of child health and development. 2002 pregnant women were recruited before 24 weeks of amenorrhoea in two maternity clinics from middle-sized French cities (Nancy and Poitiers). From May 2003 to September 2006, 1899 newborns were then included. The main outcomes of interest are fetal (via ultra-sound) and postnatal growth, adiposity development, respiratory health, atopy, behaviour and bone, cognitive and motor development. We are studying fetal weight and height as a function of the gestional age in the third trimester of pregnancy. Some classical empirical and parametric methods such as polynomial regression are first used to construct these curves. For instance, polynomial regression is one of the most common parametric approaches for modeling growth data, especially during the prenatal period. However, these classical methods build upon restrictive assumptions on estimated curves. We therefore propose to work with semi-parametric LMS methods, by modifying the response variable (fetal weight) with, among others, Box–Cox transformations. An article detailing these methodologies applied to the EDEN data should be submitted next year.

Alternative nonparametric methods as Nadaraya-Watson kernel estimation, local polynomial estimation, B-splines or cubic splines are also developed to construct these curves. The practical implementation of these methods requires working on smoothing parameters or choice of knots for the different types of nonparametric estimation. In particular, optimal choice of these parameters has been proposed. To fit these curves, we have developped the R package quantCurves 39, an easy-to-use tool for practitioners. In addition, a graphical interface (GUI) intended for practitioners is being developed to enable intuitive visualization of the results given by the package and an article is in progress.

We are working with L. Vallat (CHRU Strasbourg) on the inference of dynamical gene networks from RNAseq and proteome data. The goal is to infer a model of gene expression allowing to predict gene expression in cells where the expression of specific genes is silenced (e.g. using siRNA), in order to select the silencing experiments which are more likely to reduce the cell proliferation. We expect the selected genes to provide new therapeutic targets for the treatment of chronic lymphocytic leukemia. This year, we have developed a package for the statistical analysis of temporal gene expression datasets with several biological conditions (in particular for exploratory analysis and the detection of differentially expressed genes), which will be submitted soon to Bioconductor.

We proposed a new methodology for selecting and ranking covariates associated with a variable of interest in a context of high-dimensional data under dependence but few observations. The methodology successively intertwines the clustering of covariates, decorrelation of covariates using Factor Latent Analysis, selection using aggregation of adapted methods and finally ranking. We first applied our method to transcriptomic data of 37 patients with advanced non-small-cell lung cancer who have received chemotherapy, to select the transcriptomic covariates that explain the survival outcome of the treatment. Secondly, we applied our method to 79 breast tumor samples to define patient profiles for a new metastatic biomarker and associated gene network in order to personalize the treatments. This work is published in 3 and is implemented in the R package ‘ARMADA’.

The start-up EMOSIS develops blood tests relying on flow cytometry in order to improve in vitro diagnosis of vascular thrombosis. This technology leads to multiparametric measurements on tens of thousands cells collected from each blood sample. Manual methods of analysis classically used in flow cytometry are based on data visualization by means of histograms or scatter plots. Recent progresses in the active area of computational methods for dimension reduction suggest many directions of improvement of the classical approaches for the analysis of flow cytometry data. Our first goal is to define and operate such methods. We started to focus on methods of information geometry and topological analysis of data. Once appropriate methods will be identified, an important aspect to consider will be visualization of the results in a way easy to interpret by clinicians and ethically permissible. On a longer term, our ambition is to design more accurate prediction tools for diagnosis.

N. Champagnat is scientific collaborator of the ERC SINGER (AdG 101054787) on Stochastic dynamics of sINgle cells, coordinated by S. Méléard (Ecole Polytechnique). He is involved in the research axes “From stochastic processes to singular Hamilton-Jacobi equations” and “Lineages and time reversed trajectories” of this project.

N. Champagnat serves as associate editor of ESAIM: Probability & Statistics and Stochastic Models.

The members of the team wrote referee reports for Acta Applicandae Mathematicae, Annals of Applied Probability, Annals of Applied Statistics, Annals of Probability, Bulletin of Mathematical Biology, COVID, Discrete and Continuous Dynamical Systems Series B, Electronic Journal of Probability, Journal de l'École Polytechnique, Journal of Mathematical Biology, Stochastics and Partial Differential Equations: Analysis and Computations, Stochastic Models, Stochastic Processes and their Applications, Viruses.

A. Gégout-Petit is vice-president of the European Network for Business and Industrial Statistics (ENBIS).

A. Gégout-Petit was in the hiring committees of a biostatistics ‘Professeur’ position at Sorbonne Univ., and three ‘Chaire de professeur junior’ (CPJ) at CNRS (INSMI, SPJ Monaie), Univ. Lorraine (Biostatistics), and Univ. de Pau et des Pays de l'Adour (Artificial Intelligence).

BIGS faculty members have teaching obligations at Univ. Lorraine and are teaching at least 192 hours each year. They teach probability and statistics at different levels (Licence, Master, Engineering school). Many of them have pedagogical responsibilities.