Section: New Results

Regression and machine learning

Participants: E. Albuisson, R. Azaïs (Inria, Lyon), T. Bastogne, L. Batista, K. Duarte, S. Ferrigno, A. Gégout-Petit, P. Guyot, J.-M. Monnez, N. Sahki, S. Mézières


In the purpose to detect change of health state for lung-transplanted patient, we have begun to work on breakdowns in multivariate physiological signals. Based on the CUSUM statistics, we have used dynamical thresholds of detection [27]. A more general talk about statistical learning and connected patient was given in a workshop "Evaluation des objets en santé connectée" [35].

We consider the analysis of cardiomyocyte signals (cardiac cells) for the cardiotoxicity assessment of new pharmaceutical compounds in preclinical assays. The experimental data are either impedance signals measuring the contractility of cardiomyocytes [39], [4], field potential signals measuring their functionality or fluorescence signals measuring the activity of some ion channels such as calcium pumps (Ca2+). At this preclinical level, our main contribution is the estimation of important characteristics such the field potential duration [17] or the identification of cardiotoxic events such as the early-afterdepolarization.We have also developed new methods for the analysis of electrocardiograms at patient level and more precisely the estimation of parameters such as the RR and QT intervals in long and noisy signals provided by wearable sensors [24], [30], [23], [25], [29]. We also study the efficacy of a new biomarker in radiotherapy. The objective is to compute a score able to predict risk of radiosensitivity for patients in radiotherapy [19], [20].We are also developing a new method to characterize the potential interactions between nanoparticles and biological compounds of complex media such as blood. This new method aims at predicting risks on the biodistribution and toxicity of the nanoparticles [16], [36].

In [7], we present a methodology for constructing a short-term event risk score from an ensemble predictor using bootstrap samples, two different classification rules, logistic regression and linear discriminant analysis for mixed data, continuous or categorical, and random selections of variables into the construction of predictors. We establish a property of linear discriminant analysis for mixed data and define an event risk measure by an odds-ratio. This methodology is applied to heart failure patients on whom biological, clinical and medical history variables were measured and the results obtained from our data are detailed.

The study [8] addresses the problem of sequential least square multidimensional linear regression, particularly in the case of a data stream, using a stochastic approximation process. To avoid the phenomenon of numerical explosion which can be encountered and to reduce the computing time in order to take into account a maximum of arriving data, we propose using a process with online standardized data instead of raw data and the use of several observations per step or all observations until the current step. Herein, we define and study the almost sure convergence of three processes with online standardized data: a classical process with a variable step-size and use of a varying number of observations per step, an averaged process with a constant step-size and use of a varying number of observations per step, and a process with a variable or constant step-size and use of all observations until the current step. Their convergence is obtained under more general assumptions than classical ones. These processes are compared to classical processes on 11 datasets for a fixed total number of observations used and thereafter for a fixed processing time. Analyses indicate that the third-defined process typically yields the best results.

Many articles were devoted to the problem of estimating recursively the eigenvectors and eigenvalues in decreasing order of the expectation of a random matrix using an i.i.d. sample of it. In [43], we make the following contributions. The convergence of a normed process is proved under more general assumptions: the random matrices are not supposed i.i.d. and a new data mini-batch or all data until the current step are taken into account at each step without storing them; three types of processes are studied; this is applied to online principal component analysis of a data stream, assuming that data are realizations of a random vector Z whose expectation is unknown and must be estimated online, as well as possibly the metrics used when it depends on unknown characteristics of Z.

Let Y=m(X)+σ(X)ε be a regression model, where m(·) is the regression function, σ2(·) the variance function and ε the random error term. Methods to assess how well a model fits a set of observations fall under the banner of goodness-of-fit tests. Many tests have been developed to assess the different assumptions for this kind of model. Most of them are “directional” in that they detect departures from mainly a given assumption of the model. Other tests are “global” in that they assess whether a model fits a data set on all its assumptions. We focus on the task of choosing the structural part m(·). It gets most attention because it contains easily interpretable information about the relationship between X and Y. To valid the form of the regression function, we consider three nonparametric tests based on a generalization of the Cramér-von Mises statistic. The first two are directional tests, while the third is a global test. To perform these goodness-of-fit tests based on a generalization of the Cramér-von Mises statistic, we have used Wild bootstrap methods and we also proposed a method to choose the bandwidth parameter used in nonparametric estimations. Then, we have developed cvmgof R package (being submitted), an easy-to-use tool for many users. The use of the package is illustrated using simulations to compare the three implemented tests [37].

In epidemiology, we are working with clinicians to study fetal development in the last two trimesters of pregnancy. We have data from the "Service de foetopathologie et de placentologie" of the "Maternité Régionale Universitaire" (CHU Nancy) and from the EDEN cohort (INSERM). We propose to use non parametric methods of estimation to obtain reference curves of fetus and child growth. In addition, we want to develop a test, based on Z-scores, to detect any slope breaks in the fetal development curves (work in progress).