The main objective of Xpop is to develop new sound and rigorous methods for statistical modeling in the field of biology and life sciences.
These methods for modeling include statistical methods of estimation, model diagnostics, model building and model selection as well as methods for numerical models (systems of ordinary and partial differential equations).
Historically, the key area where these methods have been used is population pharmacokinetics.
However, the framework is currently being extended to sophisticated numerical models in the contexts of viral dynamics, glucoseinsulin processes, tumor growth, precision medicine, spectrometry, intracellular processes, etc.
Furthermore, an important aim of Xpop is to transfer the methods developed into software packages so that they can be used in everyday practice.
Mathematical models that characterize complex biological phenomena are defined by systems of ordinary differential equations when dealing with dynamical systems that evolve with respect to time, or by partial differential equations when there is a spatial component in the model. Also, it is sometimes useful to integrate a stochastic aspect into the dynamical system in order to model stochastic intraindividual variability.
In order to use such methods, we must deal with complex numerical difficulties, generally related to resolving the systems of differential equations. Furthermore, to be able to check the quality of a model (i.e. its descriptive and predictive performances), we require data. The statistical aspect of the model is thus critical in how it takes into account different sources of variability and uncertainty, especially when data come from several individuals and we are interested in characterizing the intersubject variability. Here, the tools of reference are mixedeffects models.
Confronted with such complex modeling problems, one of the goals of Xpop is to show the importance of combining numerical, statistical and stochastic approaches.
Linear mixedeffects models have been wellused in statistics for a long time. They are a classical approach, essentially relying on matrix calculations in Gaussian models. Whereas a solid theoretical base has been developed for such models, nonlinear mixedeffects models (NLMEM) have received much less attention in the statistics community, even though they have been applied to many domains of interest. It has thus been the users of these models, such as pharmacometricians, who have taken them and developed methods, without really looking to develop a clean theoretical framework or understand the mathematical properties of the methods. This is why a standard estimation method in NLMEM is to linearize the model, and few people have been interested in understanding the properties of estimators obtained in this way.
Statisticians and pharmacometricians frequently realize the need to create bridges between these two communities. We are entirely convinced that this requires the development of new standards for population modeling that can be widely accepted by these various communities. These standards include the language used for encoding a model, the approach for representing a model and the methods for using it:
"Interfaces" is the defining characteristic of Xpop:
The interface between statistics, probability and numerical methods.
Mathematical modelling of complex biological phenomena require to combine numerical, stochastic and statistical approaches. The CMAP is therefore the right place to be for positioning the team at the interface between several mathematical disciplines.
The interface between mathematics and the life sciences. The goal of Xpop is to bring the right answers to the right questions. These answers are mathematical tools (statistics, numerical methods, etc.), whereas the questions come from the life sciences (pharmacology, medicine, biology, etc.).
This is why the point of Xpop is not to take part in mathematical projects only, but also pluridisciplinary ones.
The interface between mathematics and software development. The development of new methods is the main activity of Xpop. However, new methods are only useful if they end up being implemented in a software tool. On one hand, a strong partnership with Lixoft (the spinoff company who continue developing Monolix) allows us to maintaining this positioning. On the other hand, several members of the team are very active in the R community and develop widely used packages.
Mixedeffects models are statistical models with both fixed effects and random effects. They are welladapted to situations where repeated measurements are made on the same individual/statistical unit.
Consider first a single subject
In a population framework, the vector of parameters
Then, the probabilistic model is the joint probability distribution
To define a model thus consists in defining precisely these two terms.
In most applications, the observed data
Here,
Function
The vector of individual parameters
The joint model of
Central to modern statistics is the use of probabilistic models. To relate these models to data requires the ability to calculate the probability of the observed data: the likelihood function, which is central to most statistical methods and provides a principled framework to handle uncertainty.
The emergence of computational statistics as a collection of powerful and general methodologies for carrying out likelihoodbased inference made complex models with nonstandard data accessible to likelihood, including hierarchical models, models with intricate latent structure, and missing data.
In particular, algorithms previously developed by Popix for mixed effects models, and today implemented in several software tools (especially Monolix) are part of these methods:
Computational statistics is an area which remains extremely active today. Recently, one can notice that the incentive for further improvements and innovation comes mainly from three broad directions: the high dimensional challenge, the quest for adaptive procedures that can eliminate the cumbersome process of tuning "by hand" the settings of the algorithms and the need for flexible theoretical support, arguably required by all recent developments as well as many of the traditional MCMC algorithms that are widely used in practice.
Working in these three directions is a clear objective for Xpop.
While these Monte Carlo algorithms have turned into standard tools over the past decade, they still face difficulties in handling less regular problems such as those involved in deriving inference for highdimensional models. One of the main problems encountered when using MCMC in this challenging settings is that it is difficult to design a Markov chain that efficiently samples the state space of interest.
The Metropolisadjusted Langevin algorithm (MALA) is a Markov chain Monte Carlo (MCMC) method for obtaining random samples from a probability distribution for which direct sampling is difficult. As the name suggests, MALA uses a combination of two mechanisms to generate the states of a random walk that has the target probability distribution as an invariant measure:
Informally, the Langevin dynamics drives the random walk towards regions of high probability in the manner of a gradient flow, while the MetropolisHastings accept/reject mechanism improves the mixing and convergence properties of this random walk.
Several extensions of MALA have been proposed recently by several authors, including fMALA (fast MALA), AMALA (anisotropic MALA), MMALA (manifold MALA), positiondependent MALA (PMALA), ...
MALA and these extensions have demonstrated to represent very efficient alternative for sampling from high dimensional distributions. We therefore need to adapt these methods to general mixed effects models.
The Stochastic Approximation Expectation Maximization (SAEM) algorithm has shown to be extremely efficient for maximum likelihood estimation in incomplete data models, and particularly in mixed effects models for estimating the population parameters. However, there are several practical situations for which extensions of SAEM are still needed:
High dimensional model: a complex physiological model may have a large number of parameters (in the order of 100). Then several problems arise:
Large number of covariates: the covariate model aims to explain part of the interpatient variability of some parameters.
Classical methods for covariate model building are based on comparisons with respect to some criteria, usually derived from the likelihood (AIC, BIC), or some statistical test (Wald test, LRT, etc.). In other words, the modelling procedure requires two steps: first, all possible models are fitted using some estimation procedure (e.g. the SAEM algorithm) and the likelihood of each model is computed using a numerical integration procedure (e.g. Monte Carlo Importance Sampling); then, a model selection procedure chooses the "best" covariate model. Such a strategy is only possible with a reduced number of covariates, i.e., with a "small" number of models to fit and compare.
As an alternative, we are thinking about a Bayesian approach which consists of estimating simultaneously the covariate model and the parameters of the model in a single run. An (informative or uninformative) prior is defined for each model by defining a prior probability for each covariate to be included in the model. In other words, we extend the probabilistic model by introducing binary variables that indicate the presence or absence of each covariate in the model. Then, the model selection procedure consists of estimating and maximizing the conditional distribution of this sequence of binary variables. Furthermore, a probability can be associated to any of the possible covariate models.
This conditional distribution can be estimated using an MCMC procedure combined with the SAEM algorithm for estimating the population parameters of the model. In practice, such an approach can only deal with a limited number of covariates since the dimension of the probability space to explore increases exponentially with the number of covariates. Consequently, we would like to have methods able to find a small number of variables (from a large starting set) that influence certain parameters in populations of individuals. That means that, instead of estimating the conditional distribution of all the covariate models as described above, the algorithm should focus on the most likely ones.
Fixed parameters: it is quite frequent that some individual parameters of the model have no random component and are purely fixed effects. Then, the model may not belong to the exponential family anymore and the original version of SAEM cannot be used as it is.
Several extensions exist:
None of these methods always work correctly. Furthermore, what are the pros and cons of these methods is not clear at all. Then, developing a robust methodology for such model is necessary.
Convergence toward the global maximum of the likelihood: convergence of SAEM can strongly depend on thie initial guess when the observed likelihood has several local maxima. A kind of simulated annealing version of SAEM was previously developed and implemented in Monolix.
The method works quite well in most situations but there is no theoretical justification and choosing the settings of this algorithm (i.e. how the temperature decreases during the iterations) remains empirical.
A precise analysis of the algorithm could be very useful to better understand why it "works" in practice and how to optimize it.
Convergence diagnostic: Convergence of SAEM was theoretically demonstrated under very general hypothesis. Such result is important but of little interest in practice at the time to use SAEM in a finite amount of time, i.e. in a finite number of iterations.
Some qualitative and quantitative criteria should be defined in order to both optimize the settings of the algorithm, detect a poor convergence of SAEM and evaluate the quality of the results in order to avoid using them unwisely.
Defining an optimal strategy for model building is far from easy because a model is the assembled product of numerous components that need to been evaluated and perhaps improved: the structural model, residual error model, covariate model, covariance model, etc.
How to proceed so as to obtain the best possible combination of these components? There is no magic recipe but an effort will be made to provide some qualitative and quantitative criteria in order to help the modeller for building his model.
The strategy to take will mainly depend on the time we can dedicate to building the model and the time required for running it. For relatively simple models for which parameter estimation is fast, it is possible to fit many models and compare them. This can also be done if we have powerful computing facilities available (e.g., a cluster) allowing large numbers of simultaneous runs.
However, if we are working on a standard laptop or desktop computer, model building is a sequential process in which a new model is tested at each step. If the model is complex and requires significant computation time (e.g., when involving systems of ODEs), we are constrained to limit the number of models we can test in a reasonable time period. In this context, it also becomes important to carefully choose the tasks to run at each step.
Diagnostic tools are recognized as an essential method for model assessment in the process of model building. Indeed, the modeler needs to confront "his" model with the experimental data before concluding that this model is able to reproduce the data and before using it for any purpose, such as prediction or simulation for instance.
The objective of a diagnostic tool is twofold: first we want to check if the assumptions made on the model are valid or not ; then, if some assumptions are rejected, we want to get some guidance on how to improve the model.
As is the usual case in statistics, it is not because this "final" model has not been rejected that it is necessarily the "true" one. All that we can say is that the experimental data does not allow us to reject it. It is merely one of perhaps many models that cannot be rejected.
Model diagnostic tools are for the most part graphical, i.e., visual; we "see" when something is not right between a chosen model and the data it is hypothesized to describe. These diagnostic plots are usually based on the empirical Bayes estimates (EBEs) of the individual parameters and EBEs of the random effects: scatterplots of individual parameters versus covariates to detect some possible relationship, scatterplots of pairs of random effects to detect some possible correlation between random effects, plot of the empirical distribution of the random effects (boxplot, histogram,...) to check if they are normally distributed, ...
The use of EBEs for diagnostic plots and statistical tests is efficient with rich data, i.e. when a significant amount of information is available in the data for recovering accurately all the individual parameters. On the contrary, tests and plots can be misleading when the estimates of the individual parameters are greatly shrunk.
We propose to develop new approaches for diagnosing mixed effects models in a general context and derive formal and unbiased statistical tests for testing separately each feature of the model.
The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analyst.
Missing values occur for a variety of reasons: machines that fail, survey participants who do not answer certain questions, destroyed or lost data, dead animals, damaged plants, etc. Missing values are problematic since most statistical methods can not be applied directly on a incomplete data. Many progress have been made to properly handle missing values. However, there are still many challenges that need to be addressed in the future, that are crucial for the users.
It is important to stress that missing data models are part of the general incomplete data models addressed by Xpop. Indeed, models with latent variables (i.e. non observed variables such as random effects in a mixed effects model), models with censored data (e.g. data below some limit of quantification) or models with dropout mechanism (e.g. when a subject in a clinical trial fails to continue in the study) can be seen as missing data models.
(joint project with the Biochemistry lab of Ecole Polytechnique and Institut Curie)
In cancer, the most dreadful event is the formation of metastases that disseminate tumor cells throughout the organism. Cutaneous melanoma is a cancer, where the primary tumor can easily be removed by surgery. However, this cancer is of poor prognosis; because melanomas metastasize often and rapidly. Many melanomas arise from excessive exposure to mutagenic UV from the sun or sunbeds. As a consequence, the mutational burden of melanomas is generally high
RAC1 encodes a small GTPase that induces cell cycle progression and migration of melanoblasts during embryonic development. Patients with the recurrent P29S mutation of RAC1 have 3fold increased odds at having regional lymph nodes invaded at the time of diagnosis. RAC1 is unlikely to be a good therapeutic target, since a potential inhibitor that would block its catalytic activity, would also lock it into the active GTPbound state. This project thus investigates the possibility of targeting the signaling pathway downstream of RAC1.
Xpop is mainly involved in Task 1 of the project: Identifying deregulations and mutations of the ARP2/3 pathway in melanoma patients.
Association of overexpression or downregulation of each marker with poor prognosis in terms of invasion of regional lymph nodes, metastases and survival, will be examined using classical univariate and multivariate analysis. We will then develop specific statistical models for survival analysis in order to associate prognosis factors to each composition of complexes. Indeed, one has to implement the further constraint that each subunit has to be contributed by one of several paralogous subunits. An original method previously developed by Xpop has already been successfully applied to WAVE complex data in breast cancer.
The developed models will be rendered userfriendly though a dedicated Rsoftware package.
This project can represent a significant step forward in precision medicine of the cutaneous melanoma.
(joint project with APHP Lariboisière and M3DISIM)
Two hundred million general anaesthesias are performed worldwide every year. Low blood pressure during anaesthesia is common and has been identified as a major factor in morbidity and mortality. These events require great reactivity in order to correct them as quickly as possible and impose constraints of reliability and reactivity to monitoring and treatment.
Recently, studies have demonstrated the usefulness of noradrelanine in preventing and treating intraoperative hypotension. The handling of this drug requires great vigilance with regard to the correct dosage. Currently, these drugs are administered manually by the healthcare staff in bolus and/or continuous infusion. This represents a heavy workload and suffers from a great deal of variability in order to find the right dosage for the desired effect on blood pressure.
The objective of this project is to automate the administration of noradrelanine with a closedloop system that makes it possible to control the treatment in real time to an instantaneous blood pressure measurement.
(joint project with the InBio and IBIS inria teams and the MSC lab, UMR 7057)
Significant celltocell heterogeneity is ubiquitouslyobserved in isogenic cell populations. Cells respond differently to a same stimulation. For example, accounting for such heterogeneity is essential to quantitatively understand why some bacteria survive antibiotic treatments, some cancer cells escape druginduced suicide, stem cell do not differentiate, or some cells are not infected by pathogens.
The origins of the variability of biological processes and phenotypes are multifarious. Indeed, the observed heterogeneity of cell responses to a common stimulus can originate from differences in cell phenotypes (age, cell size, ribosome and transcription factor concentrations, etc), from spatiotemporal variations of the cell environments and from the intrinsic randomness of biochemical reactions. From systems and synthetic biology perspectives, understanding the exact contributions of these different sources of heterogeneity on the variability of cell responses is a central question.
The main ambition of this project is to propose a paradigm change in the quantitative modelling of cellular processes by shifting from meancell models to singlecell and population models. The main contribution of Xpop focuses on methodological developments for mixedeffects model identification in the context of growing cell populations.
(joint project with Lixoft)
Pharmacometrics involves the analysis and interpretation of data produced in preclinical and clinical trials. Population pharmacokinetics studies the variability in drug exposure for clinically safe and effective doses by focusing on identification of patient characteristics which significantly affect or are highly correlated with this variability. Disease progress modeling uses mathematical models to describe, explain, investigate and predict the changes in disease status as a function of time. A disease progress model incorporates functions describing natural disease progression and drug action.
The model based drug development (MBDD) approach establishes quantitative targets for each development step and optimizes the design of each study to meet the target. Optimizing study design requires simulations, which in turn require models. In order to arrive at a meaningful design, mechanisms need to be understood and correctly represented in the mathematical model. Furthermore, the model has to be predictive for future studies. This requirement precludes all purely empirical modeling; instead, models have to be mechanistic.
In particular, physiologically based pharmacokinetic models attempt to mathematically transcribe anatomical, physiological, physical, and chemical descriptions of phenomena involved in the ADME (Absorption  Distribution  Metabolism  Elimination) processes. A system of ordinary differential equations for the quantity of substance in each compartment involves parameters representing blood flow, pulmonary ventilation rate, organ volume, etc.
The ability to describe variability in pharmacometrics model is essential. The nonlinear mixedeffects modeling approach does this by combining the structural model component (the ODE system) with a statistical model, describing the distribution of the parameters between subjects and within subjects, as well as quantifying the unexplained or residual variability within subjects.
The objective of Xpop is to develop new methods for models defined by a very large ODE system, a large number of parameters and a large number of covariates.
Contributions of Xpop in this domain are mainly methodological and there is no privileged therapeutic application at this stage.
However, it is expected that these new methods will be implemented in software tools, including Monolix and Rpackages for practical use.
(joint project with the Molecular Chemistry Laboratory, LCM, of Ecole Polytechnique)
One of the main recent developments in analytical chemistry is the rapid democratization of highresolution mass spectrometers. These instruments produce extremely complex mass spectra, which can include several hundred thousand ions when analyzing complex samples. The analysis of complex matrices (biological, agrifood, cosmetic, pharmaceutical, environmental, etc.) is precisely one of the major analytical challenges of this new century. Academic and industrial researchers are particularly interested in trying to quickly and effectively establish the chemical consequences of an event on a complex matrix. This may include, for example, searching for pesticide degradation products and metabolites in fruits and vegetables, photoproducts of active ingredients in a cosmetic emulsion exposed to UV rays or chlorination products of biocides in hospital effluents. The main difficulty of this type of analysis is based on the high spatial and temporal variability of the samples, which is in addition to the experimental uncertainties inherent in any measurement and requires a large number of samples and analyses to be carried out and computerized data processing (up to 16 million per mass spectrum).
A collaboration between Xpop and the Molecular Chemistry Laboratory (LCM) of the Ecole Polytechnique began in 2018. Our objective is to develop new methods for the statistical analysis of mass spectrometry data.
These methods are implemented in the SPIX software.

Marc Lavielle is member of the Scientific Committee of RESPIRE, a French organization for the improvement of air quality.
We have proposed in 11 an extension of the EM algorithm and its stochastic versions for the construction of incomplete data models when the selected model minimizes a penalized likelihood criterion. This optimization problem is particularly challenging in the context of incomplete data, even when the model is relatively simple. However, by completing the data, the Estep of the algorithm allows us to simplify this problem of complete model selection into a classical problem of complete model selection that does not pose any major difficulties. We then have shown that the criterion to be minimized decreases with each iteration of the algorithm. Examples of the use of these algorithms include the identification of regression mixture models and the construction of nonlinear mixedeffects models.
The SAMBA (Stochastic Approximation for Model Building Algorithm) procedure was developed specifically for the construction of mixedeffects models. We have shown in 5 how this algorithm can be used to speed up this process of model building by identifying at each step how best to improve some of the model components. The principle of this algorithm basically consists in learning something about the best model, even when a poor model is used to fit the data. A comparison study of the SAMBA procedure with SCM and COSSAC show similar performances on several real data examples but with a muchreduced computing time. This algorithm is now implemented in Monolix and in the R package Rsmlx.
Shortterm forecasting of the COVID19 pandemic is required to facilitate the planning of COVID19 healthcare demand in hospitals. We have shown in
12how daily hospital data can be used to track the evolution of the COVID19 epidemic in France. A piecewise defined dynamic model allows a very good fit of the available data on hospital admissions, deaths and discharges. The changepoints detected correspond to moments when the dynamics of the epidemic changed abruptly. Although the proposed model is relatively simple, it can serve several purposes: It is an analytical tool to better understand what has happened so far by relating observed changes to changes in health policy or the evolution of the virus. It is also a surveillance tool that can be used effectively to warn of a resurgence of epidemic activity, and finally a shortterm forecasting tool if conditions remain unchanged. The model, data and fits are implemented in an interactive web application.
In collaboration with Institut Pasteur (and other groups), we have evaluated in 13 the performance of 12 individual models and 19 predictors to anticipate French COVID19 related healthcare needs from September 7th 2020 to March 6th 2021. We then built an ensemble model by combining the individual forecasts and tesedt this model from March 7th to July 6th 2021. We found that inclusion of early predictors (epidemiological, mobility and meteorological predictors) can halve the root mean square error for 14day ahead forecasts, with epidemiological and mobility predictors contributing the most to the improvement.
We built in 4 an SIRtype compartmental model with two additional compartments: D (deceased patients); L (individuals who will die but who will not infect anybody due to social or medical isolation) and integration of a timedependent transmission rate and a periodical weekly component linked to the way in which cases and deaths are reported. The model was implemented in a web application (as of 2 June 2020). It was shown to be able to accurately capture the changes in the dynamics of the pandemic for 20 countries whatever the type of pandemic spread or containment measures: for instance, the model explains 97% of the variance of US data (daily cases) and predicts the number of deaths at a 2week horizon with an error of 1%. In early performance evaluation, our model showed a high level of accuracy between prediction and observed data.
We proposed in 2 a novel and practical variance reduction approach for additive functionals of dependent sequences. Our approach combines the use of control variates with the minimization of an empirical variance estimate. We analyzed finite sample properties of the proposed method and derive finitetime bounds of the excess asymptotic variance to zero. We applied our methodology to stochastic gradient Markov chain Monte Carlo (SGMCMC) methods for Bayesian inference on large data sets and combined it with existing variance reduction methods for SGMCMC. We have presented empirical results carried out on a number of benchmark examples showing that our variance reduction method achieves significant improvement as compared to stateoftheart methods at the expense of a moderate increase of computational overhead.
We have studied in 1 the problem of sampling from a probability distribution on
Incremental Expectation Maximization (EM) algorithms were introduced to design EM for the large scale learning framework by avoiding the full data set to be processed at each iteration. Nevertheless, these algorithms all assume that the conditional expectations of the sufficient statistics are explicit. We then proposed in 8 a novel algorithm named Perturbed ProxPreconditioned SPIDER (3PSPIDER), which builds on the Stochastic Path Integral Differential EstimatoR EM (SPIDEREM) algorithm. The 3PSPIDER algorithm addresses many intractabilities of the Estep of EM; it also deals with nonsmooth regularization and convex constraint set. Numerical experiments show that 3PSPIDER outperforms other incremental EM methods.
Studying the convergence in expectation, we have shown in 9 that 3PSPIDER achieves a nearoptimal oracle inequality even when the gradient is estimated by Monte Carlo methods.
Xpop has a contract with ADELIS, a French analytical instrumentation company. The goal of this collaboration is to develop a method to analyze the physiological size profile of circulating DNA to detect different types of pathological abnormalities.
MixedEffects Models of Intracellular Processes: Methods, Tools and Applications (MEMIP)
Period: from 20170101 to 20210630
Coordinator: Gregory Batt (InBio Inria team)
Other partners: InBio and IBIS Inria teams, Laboratoire Matière et Systèmes Complexes (UMR 7057; CNRS and Paris Diderot Univ.)
Targeting Racdependent actin polymerization in cutaneous melanoma  Institut National du Cancer
Period: from 20180101 to 20211120
Coordinator: Alexis Gautreau (Ecole Polytechnique)
Other partners: Laboratoire de Biochimie (Polytechnique), Institut Curie, INSERM.
Marc Lavielle is coleader of the program Numerical Innovation and Data Science for Healthcare, created to address technological developments and data issues in the medical field.
This sponsorship program with Ecole polytechnique and Sanofi was inaugurated in 2020.
: diverse and varied
PhD in progress:
Marc Lavielle developed and maintains the platform
covidix for Covid19 data visualization and modelling.
Marc Lavielle developed and maintains the learning platform Statistics in Action.. The purpose of this online learning platform is to show how statistics (and biostatistics) may be efficiently used in practice using R. It is specifically geared towards teaching statistical modelling concepts and applications for selfstudy. Indeed, most of the available teaching material tends to be quite "static" while statistical modelling is very much subject to "learning by doing”.