The SISTM team is localized in the Carreire area of the Bordeaux University.
The overall objective of SISTM is to develop statistical methods for the integrative analysis of health data, especially those related to clinical immunology to answer specific questions risen in the application field.
To reach this objective we are developing statistical methods belonging to two main research areas:
Statistical and mechanistic modeling, especially based on ordinary differential equation systems, fitted to population and sparse data
Statistical learning methods in the context of high-dimensional data
These two approaches are used for addressing different types of questions. Statistical learning methods are developed and applied to deal with the high dimensional characteristics of the data. The outcome of this research leads to hypotheses linked to a restricted number of markers. Mechanistic models are then developed and used for modeling the dynamics of a few markers. For example, regularized methods can be used to select relevant genes among 20000 measured with microarray technology, whereas differential equations can be used to capture the dynamics and relationship between several genes followed over time by a q-PCR assay or RNA-seq.
Data are generated in clinical trials or biological experimentations. Our main application of interest is the immune response to vaccine or other immune interventions (such as exogenous cytokines), mainly in the context of HIV infection. The methods developed in this context can be applied in other circumstances but the focus of the team on immunology is important for the relevance of the results and their translation into practice, thanks to a longstanding collaboration with several immunologists and the implication of the team in the Labex Vaccine Research Institute. Exemples of objectives related to this application field are:
To understand how immune response is generated with immune interventions (vaccines or interleukines)
To predict what would be the immune response to a given immune intervention for designing next studies and adapting interventions to individual patients
When studying the dynamics of a given marker, say the HIV concentration in the blood (HIV viral load), one can for instance use descriptive models summarizing the dynamics over time in term of slopes of the trajectories . These slopes can be compared between treatment groups or according to patients' characteristics. Another way for analyzing these data is to define a mathematical model based on the biological knowledge of what drives HIV dynamics. In this case, it is mainly the availability of target cells (the CD4+ T lymphocytes), the production and death rates of infected cells and the clearance of the viral particles that impact the dynamics. Then, a mathematical model most often based on ordinary differential equations (ODE) can be written . Estimating the parameters of this model to fit observed HIV viral load gave a crucial insight in HIV pathogenesis as it revealed the very short half-life of the virions and infected cells and therefore a very high turnover of the virus, making mutations a very frequent event .
Having a good mechanistic model in a biomedical context such as HIV infection opens doors to various applications beyond a good understanding of the data. Global and individual predictions can be excellent because of the external validity of a model based on main biological mechanisms. Control theory may serve for defining optimal interventions or optimal designs to evaluate new interventions . Finally, these models can capture explicitly the complex relationship between several processes that change over time and may therefore challenge other proposed approaches such as marginal structural models to deal with causal associations in epidemiology .
Therefore, we postulate that this type of model could be very useful in the context of our research that is in complex biological systems. The definition of the model needs to identify the parameter values that fit the data. In clinical research this is challenging because data are sparse, and often unbalanced, coming from populations of subjects. A substantial inter-individual variability is always present and needs to be accounted as this is the main source of information. Although many approaches have been developed to estimate the parameters of non-linear mixed models , , , , , , the difficulty associated with the complexity of ODE models and the sparsity of the data leading to identifiability issues need further research.
With the availability of omics data such as genomics (DNA), transcriptomics (RNA) or proteomics (proteins), but also other types of data, such as those arising from the combination of large observational databases (e.g. in pharmacoepidemiology or environmental epidemiology), high-dimensional data have became increasingly common. Use of molecular biological technics such as Polymerase Chain Reaction (PCR) allows for amplification of DNA or RNA sequences. Nowadays, microarray and Next Generation Sequencing (NGS) techniques give the possibility to explore very large portions of the genome. Furthermore, other assays have also evolved, and traditional measures such as cytometry or imaging have became new sources of big data. Therefore, in the context of HIV research, the dimension of the datasets has much grown in term of number of variables per individual than in term of number of included patients although this latter is also growing thanks to the multi-cohort collaborations such as CASCADE or COHERE organized in the EuroCoord network
The objective is either to select the relevant information or to summarize it for understanding or prediction purposes. When dealing with high dimensional data, the methodological challenge arises from the fact that datasets typically contain many variables, much more than observations. Hence, multiple testing is an obvious issue that needs to be taken into account . Furthermore, conventional methods, such as linear models, are inefficient and most of the time even inapplicable. Specific methods have been developed, often derived from the machine learning field, such as regularization methods . The integrative analysis of large datasets is challenging. For instance, one may want to look at the correlation between two large scale matrices composed by the transcriptome in the one hand and the proteome on the other hand . The comprehensive analysis of these large datasets concerning several levels from molecular pathways to clinical response of a population of patients needs specific approaches and a very close collaboration with the providers of data that is the immunologists, the virologists, the clinicians...
Biological and clinical researches have dramatically changed because of the technological advances, leading to the possibility of measuring much more biological quantities than previously. Clinical research studies can include now traditional measurements such as clinical status, but also thousands of cell populations, peptides, gene expressions for a given patient. This has facilitated the transfer of knowledge from basic to clinical science (from "bench side to bedside") and vice versa, a process often called "Translational medicine". However, the analysis of these large amounts of data needs specific methods, especially when one wants to have a global understanding of the information inherent to complex systems through an "integrative analysis". These systems like the immune system are complex because of many interactions within and between many levels (inside cells, between cells, in different tissues, in various species). This has led to a new field called "Systems biology" rapidly adapted to specific topics such as "Systems Immunology" , "Systems vaccinology" , "Systems medicine" . From the statistician point of view, two main challenges appear: i) to deal with the massive amount of data ii) to find relevant models capturing observed behaviors.
The management of HIV infected patients and the control of the epidemics have been revolutionized by the availability of highly active antiretroviral therapies. Patients treated by these combinations of antiretrovirals have most often undetectable viral loads with an immune reconstitution leading to a survival which is nearly the same to uninfected individuals . Hence, it has been demonstrated that early start of antiretroviral treatments may be good for individual patients as well as for the control of the HIV epidemics (by reducing the transmission from infected people) . However, the implementation of such strategy is difficult especially in developing countries. Some HIV infected individuals do not tolerate antiretroviral regimen or did not reconstitute their immune system. Therefore, vaccine and other immune interventions are required. Many vaccine candidates as well as other immune interventions (IL7, IL15) are currently evaluated. The challenges here are multiple because the effects of these interventions on the immune system are not fully understood, there are no good surrogate markers although the number of measured markers has exponentially increased. Hence, HIV clinical epidemiology has also entered in the era of Big Data because of the very deep evaluation at individual level leading to a huge amount of complex data, repeated over time, even in clinical trials that includes a small number of subjects.
An R package for the gene set analysis of longitudinal gene expression data sets. Under development, and soon to be available on the CRAN website, this package implements a Time-course Gene Set Analysis method and provides useful plotting functions facilitating the interpretation of the results.
We have written a specific program called NIMROD for estimating
parameter of ODE based population models. It has been regularly
updated. For instance, we have adapted the program for parallel
computing, in collaboration with the MCIA (Mésocentre de calcul
intensif Aquitain) facility, which makes available a large computer
with more than 3000 cores. This program is described in
. Although the program is available on the
ISPED website
An R package for function optimization. Available on CRAN, this package performs a minimization of function based on the Marquardt-Levenberg algorithm. This package is really useful when the surface to optimize is non-strictly convex or far from a quadratic function. A new convergence criterion, the relative distance to maximum (RDM), allows the user to have a better confidence in the stopping points, other than basic algorithm stabilization.
An R package for Variable Selection Using Random Forests. Available on CRAN, this package performs an automatic (meaning completely data-driven) variable selection procedure. Originally designed to deal with high dimensional data, it can also be applied to standard datasets.
R2GUESS package is a wrapper of the GUESS (Graphical processing Unit Evolutionary Stochastic Search ) program. GUESS is a computationally optimised C++ implementation of a fully Bayesian variable selection approach that can analyse, in a genome-wide context, single and multiple responses in an integrated way. The program uses packages from the GNU Scientific Library (GSL) and offers the possibility to re-route computationally intensive linear algebra operations towards the Graphical Processing Unit (GPU) through the use of proprietary CULA-dense library.
A work (described below), in collaboration with M. Davis and R. Tibshirani from Standford University, has been published in the "Proceedings of the National Academy of Sciences" : .
Females have generally more robust immune responses than males for reasons that are not well-understood. Here we used a systems analysis to investigate these differences by analyzing the neutralizing antibody response to a trivalent inactivated seasonal influenza vaccine (TIV) and a large number of immune system components, including serum cytokines and chemokines, blood cell subset frequencies, genome-wide gene expression, and cellular responses to diverse in vitro stimuli, in 53 females and 34 males of different ages. We found elevated antibody responses to TIV and expression of inflammatory cytokines in the serum of females compared with males regardless of age. This inflammatory profile correlated with the levels of phosphorylated STAT3 proteins in monocytes but not with the serological response to the vaccine. In contrast, using a machine learning approach, we identified a cluster of genes involved in lipid biosynthesis and previously shown to be up-regulated by testosterone that correlated with poor virus-neutralizing activity in men. Moreover, men with elevated serum testosterone levels and associated gene signatures exhibited the lowest antibody responses to TIV. These results demonstrate a strong association between androgens and genes involved in lipid metabolism, suggesting that these could be important drivers of the differences in immune responses between males and females.
In collaboration with S. Arlot, we write a research report on some theoretical results about random forests : .
Random forests are a very effective and commonly used statistical method, but their full theoretical analysis is still an open problem. As a first step, simplified models such as purely random forests have been introduced, in order to shed light on the good performance of random forests. In this paper, we study the approximation error (the bias) of some purely random forest models in a regression framework, focusing in particular on the influence of the number of trees in the forest. Under some regularity assumptions on the regression function, we show that the bias of an infinite forest decreases at a faster rate (with respect to the size of each tree) than a single tree. As a consequence, infinite forests attain a strictly better risk rate (with respect to the sample size) than single trees. Furthermore, our results allow to derive a minimum number of trees sufficient to reach the same rate as an infinite forest. As a by-product of our analysis, we also show a link between the bias of purely random forests and the bias of some kernel estimators.
Roche Institute, through the Vaccine Research Institute, funding one engineer over 2 years (2012-2014)
Cytheris (now RevImmune), through the ANRS, for the development of IL-7, as this is the only one company able to produce exogeneous IL-7 usable in Humans.
The team have strong links with Bordeaux CHU ("Centre Hospitalier Universitaire").
There are strong collaborations with immunologists involved in the Labex Vaccine Research Institute (VRI) as RT is leading the Biostatistics/Bioinformatics division.
Coordination with Jean Weissenbach of the presidential plan of 100 M€ for “Systems biology” (RT)
Deputy director of the Institut de Recherche en Santé Publique IRESP (RT)
RT is participating to the EUROCOORD network on HIV cohort collaborations as :
a member of the scientific committee of IWHOD International Workshop on HIV Observational Databases from 2013,
a project leader on defining references for the CD4 count response to antiretrovirals.
Following the RHOMEO project (ANR-BBSRC Systems biology 2007 call, 2007-2011) steered by RT, a strong collaboration has been established with Pr Robin Callard (UCL Immunology) who is visiting the team in Bordeaux one month each year, Andy Yates (Physicists, Glasgow Univ) and Ben Seddon (NIMR, UCL Immunology).
Also, several other international collaboration have been initiated through the Labex:
Steve Self and Peter Gilbert in Seattle (HVTN HIV vaccine Trial Network),
Marcus Altfeld (Immunologists, Hambourg & Harvard).
This group in collaboration with other teams in Europe is writing a response to the H2020 call PHC 2 – 2015: Understanding diseases: systems medicine.
BL is on sabbatical in Queensland University, Australia.
Chloé Pasin is visiting Steve Self at HVTN, Seattle.
Boris Hejblum visited François Caron at Oxford University, United-Kingdom.
BMW (Bordeaux Modeling Workshop), a two days workshop was organized (with 30 participants).
8th French Clinical Epidemiology Conference EPICLIN
RT is a member of the scientific committee of IWHOD International Workshop on HIV Observational Databases from 2013,
Lifetime Data Analysis (DC)
Stat Surveys (DC)
Journal de la Société Française de Statistique (DC)
The members of the team reviewed numerous papers for the following international journals :
AIDS (RT)
Biometrical (BL)
Biometrics (DC)
Annals of applied Statistics (DC)
Briefings in Bioinformatics (RG)
Health Services and Outcome Methodology (DC)
Information Science (RG)
International Jounral of Epidemiology (DC)
Journal of Multivariate Analysis (RG)
Journal of the Royal Statistical Society: Series A (DC)
Machine Learning (RG)
Neurocomputing (RG)
Statistics in Medicine (MA, DC, RT)
Master : MA teaches in the two years of the Master of Public Health at ISPED, Univ. Bordeaux, France. Furthermore, she is head of the first year of the master.
Master : DC, teaches occasionnaly in the Biostatistics specialty of the second year of the Master of Public Health.
Master : RG, teaches in the two years of the Master of Public Health.
BL teaches at the School of Mathematics and Physics (The University of Queensland, Australia).
Master : RT, teaches in the two years of the Master of Public Health, and he is head of the Epidemiology specialty of the second year of the Master of Public Health.
E-learning
MA is head of the first year of the e-learning program of the Master of Public Health, and teaches in it.
RG teaches in the first year of the e-learning program of the Master of Public Health.
RT is head of the Epidemiology specialty of the second year of the e-learning program of the Master of Public Health, and teaches in it.
PhD in progress : Ana Jarne, Modélisation d'interventions sur le système immunitaire pour le traitement et les vaccins contre le VIH, from Nov 2012, co-directed by Daniel Commenges & Rodolphe Thiébaut
PhD in progress : Boris Hejblum, Analyse integrative de données de grande dimension appliquée à la recherche vaccinale, Oct 2011, co-directed by Rodolphe Thiébaut & François Caron
PhD in progress : Marie-Quitterie Picat, Analyse des biomarqueurs dans les troubles immunologiques des maladies du système immunitaire, from Nov 2012, directed by Rodolphe Thiébaut
PhD in progress : Perrine Soret, Modélisation de données longitudinales en grande dimension, from Oct 2014, directed by Marta Avalos
Master internship : Damien Chimits, Analyse par groupe de gènes de données longitudinales d'expression génique, from Mar 2014 to Sep 2014, co-directed by Rodolphe Thiébaut & Boris Hejblum
Master internship : Edouard Lhomme, Modélisation de la dynamique de la réponse immunitaire précoce au vaccin VIH, from Mar 2014 to Sep 2014, directed by Rodolphe Thiébaut
Master internship : Mélanie Née, Consommation médicamenteuse et risque d’accident de la route : exploration par simulation de schémas d’études épidémiologiques applicables à partir des données médico-administratives, from Mar 2014 to Sep 2014, co-directed by Marta Avalos & Ludivine Orriols
Master internship : Chloé Pasin, Modelling the immune response to HIV vaccine, from Feb 2014 to Sep 2014, directed by Rodolphe Thiébaut
Members of the team were involved in 6 PhD juries, 2 professorships and 2 HDR.