## Section: New Results

### Regression and machine learning

#### Aggregated methods for covariates selection in high-dimensional data under dependence

Participants: A. Gégout-Petit, A. Muller-Gueudin, Y. Shi

External collaborators: B. Bastien (Transgene, Strasbourg)

In the purpose to select factors linked to the efficiency of a treatment in the context of high dimension (about 100.000 covariates), we have developed a new methodology to select and rank covariates associated to a variable of interest in a context of high-dimensional data under dependence but few observations. The methodology imbricates successively rough selection, clustering of variables, decorrelation of variables using Factor Latent Analysis, selection using aggregation of adapted methods and finally ranking through bootstrap replications. Simulations study shows the interest of the decorrelation inside the different clusters of covariates. The methodology was applied to select covariates among genomics, proteomics covariates linked to the success of a immunotherapy treatment for the lung cancer. A paper on the subject is in preparation.

#### Clustering of the values of a response variable and simultaneous covariate selection using a stepwise algorithm

Participant: J.-M. Monnez

External collaborator: O. Collignon (LIH Luxembourg)

In supervised learning the number of values of a response variable to predict can be very high. Grouping these values in a few clusters can be useful to perform accurate supervised classification analyses. On the other hand selecting relevant covariates is a crucial step to build robust and efficient prediction models. We propose in this paper an algorithm that simultaneously groups the values of a response variable into a limited number of clusters and selects stepwise the best covariates that discriminate this clustering. These objectives are achieved by alternate optimization of a user-defined selection criterion. This process extends a former version of the algorithm to a more general framework. Moreover possible further developments are discussed in detail [3].

#### Death or hospitalization scoring for heart failure patients

Participant: J.-M. Monnez, K. Duarte

External collaborator: E. Albuisson (CHU, Nancy)

The purpose of this study was to define a short term event (death or hospitalization) score for heart failure patients based on the observation of biological, clinical and medical historical variables. Some of them were transformed or winsorized. Two methods of statistical learning were performed, logistic regression and linear discriminant analysis, different variable selection methods were used, on bootstrap samples. Aggregation of classifiers and out-of-bag validation were used. Finally a score taking values between 0 and 100 was established and an odds-ratio was defined in order to support medical decision (writing in progress).

#### Sequential linear regression with online standardized data

Participant: J.-M. Monnez, K. Duarte

External collaborator: E. Albuisson

We consider the problem of sequential least square multidimensional linear regression using a stochastic approximation process. The choice of the stepsize may be crucial in this type of process. In order to avoid the risk of numerical explosion which can be encountered, we define three processes with a variable or a constant stepsize and establish their convergence. Finally these processes are compared to classic processes on 11 datasets, 6 with a continuous output and 5 with a binary output, for a fixed total number of observations used and then for a fixed processing time. It appears that the third-defined process with a very simple choice of the stepsize gives usually the best results (paper to be submitted).

#### Mixed-effects ARX Model Identification of Dynamical Biological Systems

Participants: T. Bastogne, L. Batista

System identification is a data-driven modeling approach more and more used in biology and biomedicine [26]. In this application context, each assay is always repeated to estimate the response variability. The inference of the modeling conclusions to the whole population requires to account for the inter-individual variability within the modeling procedure. One solution consists in using mixed effects models but up to now no similar approach exists in the field of dynamical system identification. In [23], we propose a new solution based on an ARX (Auto Regressive model with eXternal inputs) structure using the EM (Expectation-Maximisation) algorithm for the estimation of the model parameters. Simulations show the relevance of this solution compared with a classical procedure of system identification repeated for each subject.

In [24], we propose a solution to firstly estimate the Fisher information matrix using the Louis' method and secondly to determine the parameters confidence intervals of an ARX model structure. We show relevance of the proposed solution in simulation and using real in-vitro data coming from realtime cell impedance measurements.

In parallel, we applied the mixed-effect modeling approach to the analysis in vivo responses in order to identify pronostic biomarkers of tumor regrowth after photodynamic therapy [11]. This application corroborated the practical relevance of our model-based approach.

#### Uniform asymptotic certainty bands for the conditional cumulative distribution function

Participants: S.Ferrigno, A. Muller-Gueudin, M. Maumy-Bertrand (IRMA, Strasbourg)

In this work, we study the conditional cumulative distribution function and a nonparametric estimator associated to this function. The conditional cumulative distribution function has the advantages of completely characterizing the law of the random considered variable, allowing to obtain the regression function, the density function, the moments and the conditional quantile function. As a nonparametric estimator of this function, we focus on local polynomial techniques described in Fan and Gijbels [ref]. In particular, we use the local linear estimation of the conditional cumulative distribution function.

The objective of this work is to establish uniform asymptotic certainty bands for the conditional cumulative distribution function. To this aim, we give exact rate of strong uniform consistency for the local linear estimator of this function (writing in progress).

#### Omnibus tests for regression models

Participants: R.Azaïs, S.Ferrigno, M-J Martinez Marcoux (LJK, Grenoble)

The aim of this collaboration begins is to compare, through simulations, several methods to test the validity of a regression model. These tests can be "directional" in that they are designed to detect departures from mainly one given assumption of the model (for example the regression function, the variance or the error) or global (for example the conditional distribution function). The establishment of such statistical tests require the use of nonparametric estimators various functions (regression, variance, cumulative distribution function). The idea would then be able to build a tool (package R) that allows a user to test the validity of the model it uses through different methods and varying parameters associated with modeling. This work is currently in progress.