EN FR
EN FR


Section: New Results

Regression and machine learning

Uniform asymptotic certainty bands for the conditional cumulative distribution function

Participants: S. Ferrigno, A. Muller-Gueudin

External collaborator: M. Maumy-Bertrand (IRMA, Strasbourg)

In this work with Myriam Maumy-Bertrand (IRMA, Strasbourg), we study the conditional cumulative distribution function and a nonparametric estimator associated to this function. The conditional cumulative distribution function has the advantages of completely characterizing the law of the random considered variable, allowing to obtain the regression function, the density function, the moments and the conditional quantile function. As a nonparametric estimator of this function, we focus on local polynomial techniques described in Fan and Gijbels [64] . In particular, we use the local linear estimation of the conditional cumulative distribution function.

The objective of this work is to establish uniform asymptotic certainty bands for the conditional cumulative distribution function. To this aim, we give exact rate of strong uniform consistency for the local linear estimator of this function. We show that limit laws of the logarithm are useful in the construction of uniform asymptotic certainty bands for the conditional distribution function. In particular, we use a single bootstrap to construct sharp uniform asymptotic bands of this estimator.

We illustrate our results with simulations and a study of fetal growth which is based on 694 fetuses (carefully selected by exclusion of multiple pregnancies, malformed, macerated or serious ill fetuses, or those with chromosomal abnormalities) autopsied in fetopathologic units of the "Service de foetopathologie et de placentologie" of the Maternité Régionale Universitaire (CHU Nancy, France) between 1996 and 2013.

We have presented our results in two international conferences with proceedings in Lille in June 2015 ("47èmes Journées de Statistique de la SFdS") [21] and London in December 2015 ("CM Statistics") [36] .

Omnibus tests for regression models.

Participants: R. Azaïs, S. Ferrigno

External collaborator: M-J. Martinez Marcoux (LJK, Grenoble)

The aim of this collaboration with Marie-José Martinez Marcoux (LJK, Grenoble) is to compare, through simulations, several methods to test the validity of a regression model. These tests can be "directional" in that they are designed to detect departures from mainly one given assumption of the model (for example the regression function, the variance or the error) or global (for example the conditional distribution function). The establishment of such statistical tests require the use of nonparametric estimators various functions (regression, variance, cumulative distribution function). The idea would then be able to build a tool ( package R) that allows a user to test the validity of the model it uses through different methods and varying parameters associated with modeling. This work is currently in progress.

Data analysis techniques: a tool for cumulative exposure assessment

Participant: J-M. Monnez

External collaborators : W. Kihal, B. Lalloué, C. Padilla, D, S. Zmirou-Navier

Everyone is subject to environmental exposures from various sources, with negative health impacts (air, water and soil contamination, noise, etc.) or with positive effects (e.g. green space). Studies considering such complex environmental settings in a global manner are rare. We propose to use statistical factor and cluster analyses to create a composite exposure index with a data-driven approach, in view to assess the environmental burden experienced by populations. We illustrate this approach in a large French metropolitan area. The study was carried out in the Great Lyon area (France, 1.2 M inhabitants) at the census Block Group (BG) scale. We used as environmental indicators ambient air NO2 annual concentrations, noise levels and proximity to green spaces, to industrial plants, to polluted sites and to road traffic. They were synthesized using Multiple Factor Analysis (MFA), a data-driven technique without a priori modeling, followed by a Hierarchical Clustering to create BG classes. The first components of the MFA explained, respectively, 30, 14, 11 and 9% of the total variance. Clustering in five classes group: (1) a particular type of large BGs without population; (2) BGs of green residential areas, with less negative exposures than average; (3) BGs of residential areas near midtown; (4) BGs close to industries; and (5) midtown urban BGs, with higher negative exposures than average and less green spaces. Other numbers of classes were tested in order to assess a variety of clustering. We present an approach using statistical factor and cluster analyses techniques, which seem overlooked to assess cumulative exposure in complex environmental settings. Although it cannot be applied directly for risk or health effect assessment, the resulting index can help to identify hot spots of cumulative exposure, to prioritize urban policies or to compare the environmental burden across study areas in an epidemiological framework [9] .

Online Partial Principal Component Analysis of a Data Stream

Participant: J-M. Monnez

External collaborator: R. Bar (EDF, R & D)

Consider a data stream and suppose that each data vector is a realization of a random vector whose expectation varies with time, the law of the centered data vector being stationary. Consider the principal component analysis (PCA) of this centered vector called partial PCA. In this study are defined online estimators of direction vectors of the first principal axes by stochastic approximation processes using a data batch at each step or all the data until the current step. This extends a former result obtained by the second author by using one data vector at each step. This is applied to partial generalized canonical correlation analysis by defining a stochastic approximation process of the metric involved in this case using all the data until the current step. If the expectation of the data vector varies according to a linear model, a stochastic approximation process of the model parameters is used. All these processes can be performed in parallel.

Moreover, several incremental procedures of linear and logistic regression of a data stream were defined and tested and compared on existing batch data files and on simulated data streams.

Prognostic Value of Estimated Plasma Volume in Heart Failure.

Participant: J-M. Monnez

External collaborators: E. Albuisson, B. Pitt, P. Rossignol, F. Zannad (CHU, Nancy)

The purpose of this study was to assess the prognostic value of the estimation of plasma volume or of its variation beyond clinical examination in a post-hoc analysis of EPHESUS (Eplerenone Post-Acute Myocardial Infarction Heart Failure Efficacy and Survival Study).

Assessing congestion after discharge is challenging but of paramount importance to optimize patient management and to prevent hospital readmissions.

The present analysis was performed in a subset of 4,957 patients with available data (within a full dataset of 6,632 patients). The study endpoint was cardiovascular death or hospitalization for heart failure (HF) between months 1 and 3 after post-acute myocardial infarction HF. Estimated plasma volume variation (ΔePVS) between baseline and month 1 was estimated by the Strauss formula, which includes hemoglobin and hematocrit ratios. Other potential predictors, including congestion surrogates, hemodynamic and renal variables, and medical history variables, were tested.

An instantaneous estimation of plasma volume at month 1 was defined and also tested.

Multivariate analysis was performed with stepwise logistic regression. ΔePVS was selected in the model. The corresponding prognostic gain measured by integrated discrimination improvement was significant. Nevertheless, instantaneous estimation of plasma volume at month 1 was found to be a better predictor than ΔePVS. LDA with mixed variables was also performed and confirmed these results.

In HF complicating myocardial infarction, congestion as assessed by the Strauss formula and an instantaneous derived measurement of plasma volume provided a predictive value of early cardiovascular events beyond routine clinical assessment. Prospective trials to assess congestion management guided by this simple tool to monitor plasma volume are warranted [4] .

Death or hospitalization scoring for heart failure patients

Participant: J-M. Monnez

External collaborator: E. Albuisson (CHU Nancy)

The purpose of this study was to define an event - death or hospitalization - score for heart failure patients based on the observation of biological, clinical and medical historical variables. Some of them were transformed or winsorized. Two methods of statistical learning were performed, logistic regression and linear discriminant analysis, with a stepwise selection of variables. Aggregation of classifiers by bagging was used. Finally a score taking values between 0 and 100 was established.

A simultaneous stepwise covariate selection and clustering algorithm to discriminate a response variable with numerous values

Participant: J-M. Monnez

External collaborator: O. Collignon (LIH, Luxembourg)

In supervised learning the number of values of a response variable to predict can be high. Also clustering them in a few clusters can be useful to perform relevant supervised classification analyses. On the other hand selecting relevant covariates is a crucial step to build robust and efficient prediction models, especially when too many covariates are available in regard to the overall sample size. As a first attempt to solve these problems, we had already devised in a previous study an algorithm that simultaneously clusters the levels of a categorical response variable in a limited number of clusters and selects forward the best covariates by alternate minimization of Wilks' Lambda. In this paper we first extend the former version of the algorithm to a more general framework where Wilks's Lambda can be replaced by any model selection criterion. We also turned forward selection into stepwise selection in order to remove covariates while the procedure processes if necessary. Finally an application of our algorithm to real datasets from peanut allergy studies allowed confirming previously published results and suggesting new discoveries.

Statistical Analyses of Cell Impedance Signals in High-Throughput Cell Analysis

Participant: T. Bastogne, L. Batista

External Collaborator: El-Hadi Djermoune (Université de Lorraine, CRAN)

With the advent of high-throughput technologies, life scientists are starting to grapple with massive data sets, encountering challenges with handling, processing and moving information that were once the domain of astronomers and high-energy physicists [91] . We particularly focus the statistical analysis of large batch of time series with applications in the preclinical research in Cancerology. Our original contribution consists in developing new dynamical system identification methods suited to the processing of those type of data. System identification is a data-driven modeling approach more and more used in biology and biomedicine. In this application context, each assay is always repeated to estimate the response variability. The inference of the modeling conclusions to the whole population requires to account for the inter-individual variability within the modeling procedure. One solution consists in using mixed effects models but up to now no similar approach exists in the field of dynamical system identification. Therefore, our objective is to develop a new identification method integrating mixed effects within an ARX (Auto Regressive model with eXternal inputs) model structure. The parameter estimation step relies on the EM (Expectation-Maximisation) algorithm. First simulation results show the relevance of this solution compared with a classical procedure of system identification repeated for each subject. This work and derived was accepted in conference papers [34] [24] [18] .