The objective of the PreMeDICaL team (Precision Medicine by Data Integration and Causal Learning) is to develop the next generation of methods/algorithms to extract knowledge from health data and improve the care of patients. More specifically, the aim is to develop learning tools for personalized treatment effect prediction and for predicting outcome, while integrating different data sources to guide decisions made by clinicians and authorities.
PreMeDICaL has two research axes:
The aim is to
push methodological innovation up to the stakeholders (patients, clinicians, regulators, etc.).
Consequently, beyond these methodological developments, innovative responses to the public health challenge posed by respiratory allergies are targeted.
In addition to leveraging machine learning algorithms and leveraging appropriate data, combining them with clinical expertise and existing recommendations is necessary. Long- term aims are to have both a strong scientific and societal impact with a substantial impact on the quality of care
for patients and major consequences for the medical profession by providing a much earlier access to innovative solutions and more efficient treatment and care.
With a successful proof of concept in the domain of allergies, by having clear reproducible pipelines, methodologies, software (by providing clinical decision making system tools) we could thereafter consider other pathologies (such as traumatology and oncology studied at IDESP).
Hence, a joint team between Inria and Inserm provides a unique opportunity for trans-disciplinary research and collaboration bringing together mathematical, methodological, technological and medical expertise.
The PreMeDICaL team contributes to precision medicine (where the treatment/device is adapted on a patient basis) and to translational medicine which aims at bridging the gap between fundamental research and its practical use.
Randomized controlled trials are considered the gold standard approach for assessing the causal effect (i.e., the treatment effect) of an intervention or a treatment on an outcome of interest.
Indeed, the allocation of the treatment is under control, which implies that there is no confounding factors (the distribution of covariates for treated and control patients is asymptotically balanced) that could interfere with the treatment and simple estimators (such as the difference in mean effect between the treated and controls) can be used to consistently estimate the average treatment effect (ATE).
However, RCTs can come with drawbacks. They can be expensive, take a long time to set up, and be compromised by insufficient sample size due to either recruitment difficulties or restrictive inclusion/exclusion criteria. These criteria can lead to a narrowly defined trial sample that differs markedly from the population potentially eligible for the treatment (distributional shift).
Therefore, the findings from RCTs can lack generalizability (or external validity). This has been largely published in the field of respiratory and allergic diseases, see for instance (Pahus et al, 2019) which highlights that the population from RCTs represents less than 10% of the population that will receive treatments.
In contrast, there is an abundance of observational data, collected without systematically designed interventions. Such data can come from different sources: they can be collected from research sources (such as disease registries, cohorts, biobanks, epidemiological studies), or they can be routinely collected (through electronic health records, insurance claims, administrative databases, patients' App, etc). In that sense, observational data can be readily available, can include large samples representative of the target populations, and can be less costly than RCTs.
To leverage observational data for treatment effect estimation in health domains, several laws built on studies by the USA Food and Drug
Administration (FDA) encourage the use of “real world data” (RWD), defined as data “derived from sources other than randomized clinical trials”, for regulatory decision making. Clinical evidence regarding the usage and potential benefits or risks of a medical product derived from the analysis of RWD is named Real World Evidence (RWE). The European Medicines Agency (EMA) is also a very active regulatory authority working with RWD to facilitate development and access to medicines. However, despite the large number of methods available to estimate the causal treatment effect from observational data such as matching, inverse probability weighting (IPW) or more recent doubly robust methods based on machine learning
there are often concerns about the quality of these “big data” and causal claims.
Indeed, building on observational data is still not consensual due to the lack of controlled experimental interventions, which opens the door to confounding biases (lack of internal validity).
Observational data and clinical trial data can provide different perspectives when evaluating an intervention or a medical treatment. Combining the information gathered from experimental and observational data is a promising avenue for medical research, because the knowledge acquired from integrative analyses could not be gathered from a single-source analysis alone. Three potential high impact applications of observational and clinical data are:
There is an abundant literature on bridging the findings from an RCT to a target population and combining both sources of information.
Similar problems have been termed as
transportability, and
data fusion and have connections to the covariate shift/domain generalization problem in ML.
Colnet et al, 2020. reviewed the methods to generalize the treatment effect while integrating the distributional shift (IPSW, g-formula, AIPSW, calibration weighting, etc.) (a), or improve the estimate of the conditional average treatment effect (CATE, i.e. heterogeneous effect) while correcting for confounding factors not measured in the observational study (c).
However, these methods have many shortcomings and there are still many challenges to address.
We provide below examples of methodological locks we will overcome.
Integrating heterogeneous data (time series, images, text, numerical or categorical data) potentially from different centers to establish predictive models involves many obstacles. In the case where a patient is described by several sources, problems of high dimensionality are exacerbated. We will start by the question of the links between the sources before tackling some challenges posed by missing values:
The first application domain of PreMeDICaL is respiratory diseases and in particular Asthma.
For more than 30 years, there has been an increase in a
number of chronic non-communicable diseases (NCD), such as asthma and allergies, respiratory diseases.
Allergies are the fourth most common chronic disease in the world. The World Health Organization (WHO) predicts that by 2050, one in two people in the world will suffer from allergies. In France, the number of people suffering from allergies has doubled in 20 years, particularly among children and young people.
Although the expression of these diseases results from the interaction between the genetic background and the environment, especially through epigenetic mechanisms, their sudden increase is solely due to the environmental changes that occurred in the last decades because of the Western lifestyle, the genetic heritage requiring centuries to change.
A full understanding of the complexity of chronic NCD prompts researchers to analyse large data utilizing proper markers and tools (e.g., biological, clinical, behavioural, economic, social,
demographic, environmental data, patient experience, patient social networks) in an
etiological and evaluative way to determine phenotypical patients’ pathways, explain their impacts, their
causes, their influences, prevent them and improve their prognosis.
Integrating these different sources of information, collected by several actors (healthcare professionals, public authorities or patients themselves), thus offer new opportunities to design personalized solutions by adapting treatment to the patient and the organizational context, leading to improved patient care and prevention policies.
With a successful proof of concept in the domain of allergies, by having clear reproducible pipelines, methodologies, software we will thereafter consider other pathologies (such as traumatology and oncology studied at IDESP).
From a methodological point of view, the aim is to improve and develop new statistical and ML methods for establishing evidence on the efficiency of treatment by data enrichment (data fusion), by taking the example of AIT in respiratory diseases.
An important output of this research
is that these methodological works have a concrete impact on designing future clinical trials and that the new methodology will be supported by regulatory authorities.
Indeed, exploiting both RCTs and observational data serve different purposes such as prediction of the treatment effect on new populations, increasing the generalization of clinical trials (so that they are more representative of the patient population who may benefit from the treatment) and also defining new inclusion criteria (because we identify subgroups who can benefit from treatment). This research is part of the PEPR project "Next methodological challenges in clinical trials in the era of digital health".
From a technological point of view, the aim is to provide software (starting with open access) for these methods to be applied in practice by studies stakeholders, clinicians and the clinical trial community.
From the clinical and patients point of view,
the different projects aim at quantifying the clinical benefit of treatment (over time), taking into account all patient characteristics, and provide useful clinical prognosis tools allowing clinicians to optimally treat every patient. The aim is to give patients better care and early access to innovation. In addition, these works can lead to a better adoption by the medical community of certain (advanced) techniques used to estimate the effects of treatment on patients (by comparing the results obtained in an RCT with the RWE).
From a public-health point of view, the aim is to guide decisions made by investigators, sponsors and authorities. Better trials’ designs may also have an important impact in terms of cost reduction.
Finally, we aim at having a significant impact in the field of allergy treatments providing new knowledge that may change guidelines and practice.
Margaux Zaffran received the Séphora Berrebi Women in Advanced Mathematics & Computer Science Scholarship which aims at encouraging active involvment of young women in scientific research, especially in the mathematics and computer science areas. They recognized her PhD work on conformal prediction and the grant allowed her to spend a research stay at the Technion Israel.
The FactoMineR package is dedicated to performing principal components methods to explore, sum-up and visualize data. Dimensionality reduction methods include PCA, correspondence analysis (CA) for count data such as documents-words data, multiple correspondence analysis (MCA) for categorical data such as survey data, factorial analysis of mixed data (FAMD) for both types of variables as well as methods for groups of variables, of individuals (multiple factorial analysis, MFA), for hierarchy …
References: https://husson.github.io/MOOC_AnaDo/index.html https://husson.github.io/MOOC.html#PCAcourse
Causal inference taskview: to list and organize all the R packages on causal inference
RCTs are the current gold-standard to empirically measure a causal effect of a given intervention on an outcome.
But more recently, concerns have been raised on the limited scope of RCTs: stringent eligibility criteria, unrealistic real-world compliance, short timeframe, limited sample size, etc. Such limitations threaten the external validity of RCT studies to other situations or populations 51.
The usage of complementary non-randomized data, referred to as observational or from real world, brings promises as additional sources of evidence, in particular combined to trials.
Transportability (also known as generalization, recoverability from sampling bias, or data-fusion 52, 49)
allows to generalize or transport the trial findings toward a target population of interest, potentially subject to a covariate distributional shift.
RCT and observational data are seldom acquired as part of a homogeneous effort. As a result, they come with different covariates.
Restricting the analysis to the shared covariates raises the risk of omitting an important one leading to identifiabilities issues. This problem is reminiscent of unobserved confounding in causal inference with one observational data.
In 1, we suggest a sensitivity analysis to handle cases where such covariates (namely treatment effect modifiers that are shifted between the two sets when studying risk difference) are missing in one or both sets.
We also completed proofs on the consistency of generalization estimators that use either weighting (Inverse Propensity of Sampling Weighting, IPSW), outcome modeling , or combine the two in doubly robust approaches with Augmented IPSW (AIPSW).
We further analysed the IPSW estimator, which consists of re-weighting the trial so that it resembles the observational sample, in
42. In particular, we established finite sample bias and variance (the literature mostly focuses on asymptotic results) and upper bound on the risk of different versions of the estimator: oracle, semi-oracle, etc. This work can lead to practical recommendations in terms of data collection (e.g., doubling the size of the observational data leads to a smaller asymptotic variance than doubling the size of the trial).
The optimal individualized treatment regime (ITR) learned from a source population, due to covariate shift, may not generalize well to the target population that we aim to apply the ITR on. We propose a transfer learning framework, where covariate information from the target population is available, for ITR estimation with heterogeneous populations and right-censored survival data, which is common in clinical studies and motivated by our medical application.
We characterize the efficient influence function (EIF) and propose a doubly robust estimator of the targeted value function, which accommodates a broad class of functionals of survival distributions. For a pre-specified class of ITRs, we establish the
Model-based unsupervised learning, as any learning task, stalls as soon as missing data occurs. This is even more true when the missing data are informative, or said missing not at random (MNAR). In this paper, we propose model-based clustering algorithms designed to handle very 5 general types of missing data, including MNAR data. To do so, we introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism, remaining vigilant to the degrees of freedom of each. Eight different MNAR models which depend on the class membership and/or on the values of the missing variables themselves are proposed. For a particular type of MNAR models, for which 10 the missingness depends on the class membership, we show that the statistical inference can be carried out on the data matrix concatenated with the missing mask considering a MAR mechanism instead; this specifically underlines the versatility of the studied MNAR models. Then, we establish sufficient conditions for identifiability of parameters of both the data distribution and the mechanism. Regardless of the type of data and the mechanism, we propose to perform 15 clustering using EM or stochastic EM algorithms specially developed for the purpose. Finally, we assess the numerical performances of the proposed methods on synthetic data and on the real medical registry TraumaBase R as well.
Most statistical learning and artificial intelligence methodologies provide point predictions, without any indication of the degree of confidence that can be given to these predictions (i.e. without predictive intervals). This lack of uncertainty quantification of predictive models is a major barrier to the adoption of powerful machine learning methods by society. Probabilistic forecasts, i.e. predicting the entire distribution probability and not only the conditional expectation, could partially tackle this issue but they are only valid asymptotically, require strong assumptions on the data (e.g. normality) or/and are model-dependent. The emergent field of conformal prediction (CP) 53, 48, 47 is a promising framework for distribution-free uncertainty quantification. It is a general procedure to build predictive intervals for any predictive model (including black-box methods such as deep learning), which are valid (i.e. achieve nominal marginal coverage), in finite sample, and without assumption on the data generation process except the exchangeability. This is extremely promising for decision support tools in critical applications: healthcare, autonomous driving, etc. An extension of CP (Conformalized Quantile Regression, 50) was used to predict the U.S. presidential elections (2020) by the Washington Post.
Given the non-exchangeability of time series data, CP can not be applied as such to this framework. To achieve this task, we study and extend Adaptive Conformal Inference (ACI) 46 in the context of time series with general dependency. ACI is a method designed to handle an online setting, with distributional shift. It relies on using an adaptive miscoverage rate , that is updated according to previous performances and to an hyper-parameter, playing the role of a learning rate. First, we study theoretically, using Markov Chain theory, the impact of the learning rate on the length of the predictive intervals, in the exchangeable and auto-regressive case, in order to describe not only the validity but also the efficiency of ACI. This is hardly useful in practice: the optimal learning rate depends on the unknown data distribution. This is why we introduce AgACI, a parameter-free method using online expert aggregation 45. Finally, we compare ACI, AgACI and other methods slightly adapted to time series, on extensive synthetic experiments. These experiments highlight that AgACI achieves good performances in terms of validity and efficiency. To allow for better benchmarking of existing and new methods, we provide implementations in Python of (all) the described methods and a complete pipeline of analysis on GitHub.
Uncertainty quantification has not been addressed with missing values. In the finite-sample regime, we show that, for almost all imputations and missing values mechanisms, the imputed data set is exchangeable. Thus, CP properties still hold and marginal guarantees are met. Nevertheless, we emphasize that the average coverage varies depending on the pattern of missing values: it tends to construct prediction intervals that often under-cover the response conditional on a given missing pattern.
After theoretically studying the case of a linear model,
we propose a methodology, missing data augmentation, to achieve approximate conditional guarantees conditional on the
International medical trials with AB science: member of the Data Safety Monitoring Board (DSMB) which will oversee the safety of study AB20001 entitled: “A Randomized, Open-label Phase 2 Clinical Trial to Evaluate the Safety and Efficacy of Masitinib combined with Isoquercetin, and Best Supportive Care in Hospitalized Patients with Moderate and Severe COVID-19.”
HORIZONHLTH-2021-ENVHLTH-02 SynAir-G (500k€)
Dissemination from the creation of the PreMeDICaL team in June 2022 to December 2022
Annals of Statistics, Journal of Statistical Computations and Simulations, Journal of Statistical Software, WAO Journal, Allergy, Clin Exp Allergy