BIGS is a joint team of Inria, CNRS and Université Lorraine, via the Institut Élie Cartan, UMR 7502 CNRS-UL laboratory in mathematics, of which Inria is a strong partner. One member of BIGS, T. Bastogne, comes from the Research Center of Automatic Control of Nancy (CRAN), with which BIGS has strong relations in the domain "Health-Biology-Signal". Our research is mainly focused on stochastic modeling and statistics but also aiming at a better understanding of biological systems. BIGS involves applied mathematicians whose research interests mainly concern probability and statistics. More precisely, our attention is directed on (1) stochastic modeling, (2) estimation and control for stochastic processes, (3) algorithms and estimation for graph data and (4) regression and machine learning. The main objective of BIGS is to exploit these skills in applied mathematics to provide a better understanding of issues arising in life sciences, with a special focus on (1) tumor growth, (2) photodynamic therapy, (3) population studies of genomic data and of micro-organisms genomics, (4) epidemiology and e-health.

We give here the main lines of our research that belongs to the domains of probability and statistics. For clarity, we made the choice to structure them in four items. Although this choice was not arbitrary, the outlines between these items are sometimes fuzzy because each of them deals with modeling and inference and they are all interconnected.

Our aim is to propose relevant stochastic frameworks for the modeling and the understanding of biological systems. The stochastic processes are particularly suitable for this purpose. Among them, Markov chains give a first framework for the modeling of population of cells 73, 50. Piecewise deterministic processes are non diffusion processes also frequently used in the biological context 40, 49, 42. Among Markov model, we developed strong expertise about processes derived from Brownian motion and Stochastic Differential Equations 66, 48. For instance, knowledge about Brownian or random walk excursions 72, 64 helps to analyse genetic sequences and to develop inference about it. However, nature provides us with many examples of systems such that the observed signal has a given Hölder regularity, which does not correspond to the one we might expect from a system driven by ordinary Brownian motion.

This situation is commonly handled by noisy equations driven by Gaussian processes such as fractional Brownian motion of fractional fields. The basic aspects of these differential equations are now well understood, mainly thanks to the so-called rough paths tools 56, but also invoking the Russo-Vallois integration techniques 65. The specific issue of Volterra equations driven by fractional Brownian motion, which is central for the subdiffusion within proteins problem, is addressed in 41. Many generalizations (Gaussian or not) of this model have been recently proposed for some Gaussian locally self-similar fields, or for some non-Gaussian models 53, or for anisotropic models 37.

We develop inference about stochastic processes that we use for modeling. Control of stochastic processes is also a way to optimise administration (dose, frequency) of therapy.

There are many estimation techniques for diffusion processes or coefficients of fractional or multifractional Brownian motion according to a set of observations 52, 33, 39. But, the inference problem for diffusions driven by a fractional Brownian motion is still in its infancy. Our team has a good expertise about inference of the jump rate and the kernel of Piecewise Deterministic Markov Processes (PDMP) 31, 30, 29, 32. However, there are many directions to go further into. For instance, previous works made the assumption of a complete observation of jumps and mode, that is unrealistic in practice. We tackle the problem of inference of "Hidden PDMP". As an example, in pharmacokinetics modeling inference, we want to take into account for presence of timing noise and identification from longitudinal data. We have expertise on this subjects 34, and we also used mixed models to estimate tumor growth 35.

We consider the control of stochastic processes within the framework of Markov Decision Processes 63 and their generalization known as multi-player stochastic games, with a particular focus on infinite-horizon problems. In this context, we are interested in the complexity analysis of standard algorithms, as well as the proposition and analysis of numerical approximate schemes for large problems in the spirit of 36. Regarding complexity, a central topic of research is the analysis of the Policy Iteration algorithm, which has made significant progress in the last years 75, 62, 47, 69, but is still not fully understood. For large problems, we have a long experience of sensitivity analysis of approximate dynamic programming algorithms for Markov Decision Processes 71, 70, 67, 55, 68, and we currently investigate whether/how similar ideas may be adapted to multi-player stochastic games.

A graph data structure consists of a set of nodes, together with a set of pairs of these nodes called edges. This type of data is frequently used in biology because they provide a mathematical representation of many concepts such as biological structures and networks of relationships in a population. Some attention has recently been focused in the group on modeling and inference for graph data.

Network inference is the process of making inference about the link between two variables taking into account the information about other variables. 74 gives a very good introduction and many references about network inference and mining. Many methods are available to infer and test edges in Gaussian graphical models 74, 57, 45, 46. However, when dealing with abundance data, because inflated zero data, we are far from gaussian assumption and we want to develop inference in this case.

Among graphs, trees play a special role because they offer a good model for many biological concepts, from RNA to phylogenetic trees through plant structures. Our research deals with several aspects of tree data. In particular, we work on statistical inference for this type of data under a given stochastic model. We also work on lossy compression of trees via directed acyclic graphs. These methods enable us to compute distances between tree data faster than from the original structures and with a high accuracy.

Regression models and machine learning aim at inferring statistical links between a variable of interest and covariates. In biological study, it is always important to develop adapted learning methods both in the context of standard data and also for data of high dimension (with sometimes few observations) and very massive or online data.

Many methods are available to estimate conditional quantiles and test dependencies 61, 51. Among them we have developed nonparametric estimation by local analysis via kernel methods 43, 44 and we want to study properties of this estimator in order to derive a measure of risk like confidence band and test. We study also many other regression models like survival analysis, spatio temporal models with covariates. Among the multiple regression models, we want to develop omnibus tests that examine several assumptions together.

Concerning the analysis of high dimensional data, our view on the topic relies on the French data analysis
school, specifically on Factorial Analysis tools. In this context, stochastic approximation is an essential tool
54, which allows one to approximate eigenvectors in a stepwise manner 59, 58, 60.
BIGS aims at performing accurate classification or clustering by taking advantage of the possibility of updating the information "online" using stochastic approximation algorithms 38. We focus on several incremental procedures for regression and data analysis like linear and logistic regressions and PCA (Principal Component Analysis).

We also focus on the biological context of high-throughput bioassays in which several hundreds or thousands of biological signals are measured for a posterior analysis. We have to account for the inter-individual variability within the modeling procedure. We aim at developing a new solution based on an ARX (Auto Regressive model with eXternal inputs) model structure using the EM (Expectation-Maximisation) algorithm for the estimation of the model parameters.

On this topic, we want to propose branching processes to model appearance of mutations in tumor through new collaborations with clinicians. The observed process is the "circulating DNA" (ctDNA). The final purpose is to use ctDNA as a early biomarker of the resistance to an immunotherapy treatment. It is the aim of the ITMO project. Another topic is the identification of dynamic network of expression. In the ongoing work on low-grade gliomas, a local database of 400 patients will be soon available to construct models. We plan to extend it through national and international collaborations (Montpellier CHU, Montreal CRHUM). Our aim is to build a decision-aid tool for personalised medicine. In the same context, there is a topic of clustering analysis of a brain cartography obtained by sensorial simulations during awake surgery.

Despite of his 'G' in the name of BIGS, Genetics is not central in the applications of the team. However, we want to contribute to a better understanding of the correlations between genes trough their expression data and of the genetic bases of drug response and disease. We have contributed to methods detecting proteomics and transcriptomics variables linked with the outcome of a treatment.

We have many works to do in our ongoing projects in the context of personalized medicine with CHU Nancy. They deal with biomarkers research, prognostic value of quantitative variables and events, scoring, and adverse events. We also want to develop our expertise in rupture detection in a project with APHP (Assistance Publique Hôpitaux de Paris) for the detection of adverse events, earlier than the clinical signs and symptoms. The clinical relevance of predictive analytics is obvious for high-risk patients such as those with solid organ transplantation or severe chronic respiratory disease for instance. The main challenge is the rupture detection in multivariate and heterogeneous signals (for instance daily measures of electrocardiogram, body temperature, spirometry parameters, sleep duration, etc.). Other collaborations with clinicians concern foetopathology and we want to use our work on conditional distribution function to explain fetal and child growth. We have data from the "Service de foetopathologie et de placentologie" of the "Maternité Régionale Universitaire" (CHU Nancy).

Telomeres are disposable buffers at the ends of chromosomes which are truncated during cell division; so that, over time, due to each cell division, the telomere ends become shorter. By this way, they are markers of aging. Through a collaboration with Pr A. Benetos, geriatrician at CHU Nancy, we recently obtained data on the distribution of the length of telomeres from blood cells. With members of Inria team TOSCA, we want to work in three connected directions: (1) refine methodology for the analysis of the available data; (2) propose a dynamical model for the lengths of telomeres and study its mathematical properties (long term behavior, quasi-stationarity, etc.); and (3) use these properties to develop new statistical methods. A slot of postdoc position is already planned in the Lorraine Université d'Excellence, LUE project GEENAGE (managed by CHU Nancy).

We followed Inria's recommendations to get involved in the fight against COVID 19. We responded to the WHO's encouragement, relayed by our mathematical colleagues at the national level, to conduct seroprevalence studies in randomly drawn samples of the population. This is the purpose of the COVAL study described in the results section, initiated by Pierre Vallois.

The highlight of the year is the merger between BIGS and the members of the former TOSCA team, specialised in modelling for biological sciences and medicine: Nicolas Champagnat, Coralie Fritsch, Denis Villemonais and their post-doc and PhD students. The other highlights of the year are, unsurprisingly, those of the pandemic: most of our teachers devoted a lot of time to distance learning. Other researchers, especially PhD students, suffered from the lack of contacts and meetings. Part of the team was involved in supervising a seroprevalence study. Thanks to the quality of the collaboration with hospital doctors in this study, we are now involved in modelling the amount of coronavirus in wastewater in order to predict the number of hospital admissions.

We are continuing our research on the modelling of the growth of low grade diffuse gliomas. We propose an original MRI-based method to quantify gliomas brain infiltration, easy to implement and to interpret for Neuro-oncologists. The aim is to guide the treatment strategy in giving functional information using only anatomical knowlege and conventional MRI sequences. This work has been the subject of a conference paper 15.

A retrospective survival study over 35 years follow-up has been done 9.

The aim is to better understand how living cells make decisions (e.g., differentiation of a stem cell into a particular specialized type), seeing decision-making as an emergent property of an underlying complex molecular network. Indeed, it is now proven that cells react probabilistically to their environment: cell types do not correspond to fixed states, but rather to “potential wells” of a certain energy landscape (representing the energy of the possible states of the cell) that we are trying to reconstruct. A first paper proposing a reconstruction method has been submitted 24 in the framework of an international collaboration (USA, Switzerland, France). Another paper is about to be submitted, dealing more specifically with the inference of the underlying networks.

In 13, we adapt the optimization's concept of momentum to reinforcement learning. Seeing the state-action value functions as an analog to the gradients in optimization, we interpret momentum as an average of consecutive q-functions. We derive Momentum Value Iteration (MoVI), a variation of Value iteration that incorporates this momentum idea. Our analysis shows that this allows MoVI to average errors over successive iterations. We show that the proposed approach can be readily extended to deep learning. Specifically, we propose a simple improvement on DQN based on MoVI, and experiment it on Atari games. This work has been published in the AISTATS conference.

Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance. Yet, only little is understood theoretically about why KL regularization helps, so far. In 12, we study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values. Leveraging this insight, we provide a very strong performance bound, the very first to combine two desirable aspects: a linear dependency to the horizon (instead of quadratic) and an error propagation term involving an averaging effect of the estimation errors (instead of an accumulation effect). We also study the more general case of an additional entropy regularizer. The resulting abstract scheme encompasses many existing RL algorithms. Some of our assumptions do not hold with neural networks, so we complement this theoretical analysis with an extensive empirical study. This work has been accepted to the Neurips conference and selected for oral presentation (selection rate: 1.1% of all submissions)

Many goodness-of-fit tests have been developed to assess the different assumptions of a (possibly heteroscedastic) regression model. Most of them are 'directional' in that they detect departures from a given assumption of the model. Other tests are 'global' (or 'omnibus') in that they assess whether a model fits a dataset on all its assumptions. We focus on the task of choosing the structural part of the regression function because it contains easily interpretable information about the studied relationship. We consider 2 nonparametric 'directional' tests and one nonparametric 'global' test, all based on generalizations of the Cramér–von Mises statistic.

To perform these goodness-of-fit tests, we develop the R package cvmgof (https://

To complete this work, it would be interesting to assess the other assumptions of a regression model such as the functional form of the variance or the additivity of the random error term. It should be noted that this can already be done using Ducharme and Ferrigno test implemented in cvmgof since it is a global test. However, it would be relevant to compare the results obtained from Ducharme and Ferrigno test with the ones obtained from other directional tests, especially developed to assess one of these specific assumptions. The implementation of these directional tests would enrich cvmgof package and offer a complete easy-to-use tool for validating regression models. Moreover, the assessment of the overall validity of the model when using several directional tests could be compared with that done when using only a global test. In particular, the well-known problem of multiple testing could be discussed by comparing the results obtained from multiple test procedures with those obtained when using a global test strategy. Another perspective of this work would be to develop a similar tool for other statistical models widely used in practice such as generalized linear models.

We consider the problem of variable selection in regression models. In particular, we are interested in selecting explanatory covariates linked with the response variable and we want to determine which covariates are relevant, that is which covariates are involved in the model. In this framework, we deal with L1-penalized regression models. To handle the choice of the penalty parameter to perform variable selection, we develop a new method based on the knockoffs idea. This revisited knockoffs method is general, suitable for a wide range of regressions with various types of response variables. Besides, it also works when the number of observations is smaller than the number of covariates and gives an order of importance of the covariates. Finally, we provide many experimental results to corroborate our method and compare it with other variable selection methods. This work is published in 5 and is implemented in package ‘kosel’.

The next subsections are dedicated to online data analysis

Accepted in Journal of Multivariate Analysis in October 2020 8.

We prove the almost sure convergence of processes of Oja type to eigenvectors of the expectation of a random matrix while relaxing the i.i.d. assumptions on the observed random matrices. As an application of this generalization, we can perform the online PCA of a random vector Z when there is a data stream of i.i.d. observations of Z, even when both the metric used M and the expectation of Z are unknown and estimated online. Moreover, in order to update the stochastic approximation process at each step we are no more bound to using only a data mini-batch of observations of Z, but we can use all the previous observations up to the current step without storing them. This is useful not only when dealing with streaming data but also with Big Data as on can process it sequentially as a data stream. In addition, the general framework of this process, unlike other algorithms in the literature, covers also the case of factorial methods related to PCA.

Accepted in "Journal of Applied Statistics" in December 2020 7.

Online learning is a method for analyzing very large datasets ("big data") as well as data streams. In this article, we consider the case of constrained binary logistic regression and show the interest of using processes with an online standardization of the data, in particular to avoid numerical explosions or to allow the use of shrinkage methods. We prove the almost sure convergence of such a process and propose using a piecewise constant step-size such that the latter does not decrease too quickly and does not reduce the speed of convergence. We compare twenty-four stochastic approximation processes with raw or online standardized data on five real or simulated datasets. Results show that, unlike processes with raw data, processes with online standardized data can prevent numerical explosions and yield the best results.

Submitted in Februray 2021 25, 20.

The present aim is to update, upon arrival of new learning data, the parameters of a score constructed with an ensemble method involving linear discriminant analysis and logistic regression in an online setting, without the need to store all of the previously obtained data. Poisson bootstrap and stochastic approximation processes were used with online standardized data to avoid numerical explosions, the convergence of which has been established theoretically. This empirical convergence of online ensemble scores to a reference "batch" score was studied on five different datasets from which data streams were simulated, comparing six different processes to construct the online scores. For each score, 50 replications using a total of

Our work around change-point theresholds for the score-based CUSUM statistic in a sequential context has been published 11. In this paper, we consider the score-based cumulative sum statistic and propose to evaluate the detection performance of somethresholds on simulated data. Three thresholds come from the literature: the Wald constant, the empirical constant, and the conditional empirical instantaneous threshold. Two new thresholds are built by a simulation-based procedure: the first one is instantaneous, the second is a dynamical version of the previous one. The thresholds' performance measured by an estimation of the mean time between false alarm (MTBFA) and the average detection delay (ADD) are evaluated on independent and autocorrelated data for several scenarios, according to the detection objective and the real change in the data. The simulations allow us to compare the difference between the thresholds' results and to see that their performances prove to be robust when a parameter of the prechange regime is poorly estimated or when the data independence assumption is violated. We found also that the conditional empirical threshold is the best at minimizing the detection delay while maintaining the given false alarm rate. However, on real data, we suggest to use the dynamic instantaneous threshold because it is the easiest to build for practical implementation.

Our collaboration with APHP could not succeed because of the great delay in data collection. To apply our algorithms to real data, we turned to some EMG signal data provided by INRS. The study concerns the development of trapezius muscle myalgia in the workplace. We apply change-point detection to characterise different computer activities carried out during an experimental day.

In Epidemiology, we are working with INSERM to study fetal development in the last two trimesters of pregnancy. Reference or standard curves are required in this kind of biomedical problems. Values which lie outside the limits of these reference curves may indicate the presence of disorder. Data are from the French EDEN mother-child cohort (INSERM). It's a mother-child cohort study investigating the prenatal and early postnatal determinants of child health and development. 2002 pregnant women were recruited before 24 weeks of amenorrhoea in two maternity clinics from middle-sized French cities (Nancy and Poitiers). From May 2003 to September 2006, 1899 newborns were then included. The main outcomes of interest are fetal (via ultra-sound) and postnatal growth, adiposity development, respiratory health, atopy, behaviour and bone, cognitive and motor development. We are studying fetal weight that depends on the gestional age in the second and the third trimesters of mother's pregnancy. Some classical empirical and parametric methods as polynomial are first used to construct these curves. Polynomial regression is one of the most common parametric approach for modelling growth data espacially during the prenatal period. However, some of them requires strong assumptions. So, we propose to work with semi-parametric LMS method, by modifying the response variable (fetal weight) with a Box-cox transformation. A first article detailing these methodologies applied to the data is being written.

Alternative nonparametric methods as Nadaraya-Watson kernel estimation, local polynomial estimation, B-splines or cubic splines are also developed in this context to construct these curves. The practical implementation of these methods required working on smoothing parameters or choice of knots for the different types of nonparametric estimation. In particular, optimal choice of these parameters has been proposed. Then, a first version of an R package has been developed to propose a tool to construct nonparametric reference curves. This should be submitted to CRAN very soon. In addition, a graphical interface (GUI) intended for practitioners has been developed to allow intuitive visualization of the results given by the package.

Submitted in December 2020 27.

Heart failure (HF) is a worldwide major cause of mortality and morbidity for which many predictive scores have been defined. Selecting which explanatory variables to include in a given score is a common difficulty, as a balance must be found between statistical fit and practical application. This article presents a methodology for constructing parsimonious event scores combining a stepwise selection of variables with ensemble scores obtained by aggregation of several scores, using several classifiers, bootstrap samples and various modalities of random selection of variables. The stepwise selection allows constructing a succession of scores with the practitioner able to choose which score best fits his or her needs. The methods proposed herein can be reproduced on any set of variables as long as the training dataset comprises a sufficient number of cases. Three methods were compared in an application to construct parsimonious short-term scores in chronic HF patients. The working sample consisted of 11,411 couples patient-visit dyads from the GISSI-HF database, with 5,595 events and 5,816 non-events. Sixty-two candidate explanatory variables were studied. Focusing on the fastest method, four scores were constructed, yielding out-of-bag AUCs ranging from 0.81 (26 variables) to 0.76 (2 variables). These results are slightly better than those obtained by other scores reported in the literature using a similar number of variables.

Continuation of the ITMO Cancer project, supervised by Nicolas Champagnat, concerning the modeling of circulating tumor DNA (ctDNA) to detect the appearance of resistance to targeted therapies (personalized medicine). After a phase of investigation of possible scenarios in collaboration with Alexandre Harlé of the Institute of Cancerology of Lorraine (ICL), a final model was selected. Based on a mathematical analysis, the members of the project then designed a statistical inference algorithm (learning the parameters of the model, including the genealogical tree of mutations for each patient) which is intended to be validated on real data currently being acquired at the Nancy CHRU. The general idea is to exploit a “variational principle” that allows to explore the discrete space of family trees, of very large size, through a “pivot” space of continuous parameters, easy to optimize (and in reasonable numbers). An article detailing the model and its inference is currently being written.

We propose a new methodology for selecting and ranking covariates associated with a variable of interest in a context of high-dimensional data under dependence but few observations. The methodology successively intertwines the clustering of covariates, decorrelation of covariates using Factor Latent Analysis, selection using aggregation of adapted methods and finally ranking. A simulation study shows the interest of the decorrelation inside the different clusters of covariates. We first apply our method to transcriptomic data of 37 patients with advanced non-small-cell lung cancer who have received chemotherapy, to select the transcriptomic covariates that explain the survival outcome of the treatment. Secondly, we apply our method to 79 breast tumor samples to define patient profiles for a new metastatic biomarker and associated gene network in order to personalize the treatments. This work is published in 2 and is implemented in R package ‘ARMADA’.

Pierre Vallois is the scientific coordinator of the seroprevalence study COVAL Nancy held in Nancy in July 2020 in collaboration with CHRU de Nancy (CIC épidémiologie clinique and Laboratoire de Virologie).

Background. The World Health Organisation recommends monitoring the circulation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). We aimed to estimate anti–SARS-CoV-2 total immunoglobulin (IgT) antibody seroprevalence and describe symptom profiles and in vitro seroneutralization in Nancy, France, in spring 2020.

Methods. Individuals were randomly sampled from electoral lists and invited with household members over 5 years old to be tested for anti–SARS-CoV-2 (IgT, i.e. IgA/IgG/IgM) antibodies by ELISA (Bio-rad). Serum samples were classified according to seroneutralization activity 50 %
(NT50) on Vero CCL-81 cells. Age- and sex-adjusted seroprevalence was estimated. Subgroups were compared by chi-square or Fisher exact test and logistic regression.

Results. Among 2006 individuals, 43 were SARS-CoV-2–positive; the raw seroprevalence was 2.1 % (95 % confidence interval 1.5 to 2.9), with adjusted metropolitan and national standardized seroprevalence 2.5 % (1.8 to 3.3) and 2.3 % (1.7 to 3.1). Seroprevalence was highest for 20- to 34-year-old participants (4.7 % [2.3 to 8.4]), within than out of socially deprived area (2.5 % vs 1 %, P=0.02) and with than without intra-family infection (p<10-6). Moreover, 25 % (23 to 27) of participants presented at least one COVID-19 symptom associated with SARS-CoV-2 positivity (p<10-13), with anosmia or ageusia highly discriminant (odds ratio 27.8 [13.9 to 54.5]), associated with dyspnea and fever. Among the SARS-CoV-2-positives, 16.3 % (6.8 to 30.7) were asymptomatic. For 31 of these individuals, positive seroneutralization was demonstrated in vitro.

Conclusions. In this population of very low anti-SARS-CoV-2 antibody seroprevalence, a beneficial effect of the lockdown can be assumed, with frequent SARS-CoV-2 seroneutralization among IgT-positive patients.

- R. Azaïs, A. Gégout-Petit, F. Greciet collaborated with SAFRAN Aircraft Engines (through a 2016-2019 contract). SAFRAN Aircraft Engines designs and products aircraft engines. For the design of pieces, they have to understand the mechanism of crack propagation under different conditions. BIGS models crack propagation with Piecewise Deterministic Markov Processes (PDMP).

- B. Scherrer collaborate with Google brain on reinforcement learning in the framework of the PhD thesis of Nino Vieillard

In Fall 2020, Bruno Scherrer was invited for 4 months in Berkeley to participate to Simons Institute Programme on the Theory of Reinforcement Learning. Due to the Covid constraints, the semester was eventually hold online.

Juhyun Park (Lancaster University) visited Nancy for one week in the framework of her collaboration with A. Gégout-Petit on statistical test for paired distribution.

Bruno Scherrer and Ulysse Herbach excepted, BIGS members have teaching obligations at "Université Lorraine" and are teaching at least 192 hours each year. They teach probability and statistics at different levels (Licence, Master, Engineering school). Many of them have pedagogical responsibilities.

Defended PhD thesis

PhD thesis

Post-doctoral positions

Other