EN FR
EN FR


Section: New Results

Regression and machine learning

Participants: A. Gégout-Petit, A. Muller-Gueudin, T. Bastogne, L. Batista, R. Azais, S. Ferrigno, K. Duarte, J.-M. Monnez

We consider the problem of sequential least square multidimensional linear regression using a stochastic approximation process. The choice of the stepsize may be crucial in this type of process. In order to avoid the risk of numerical explosion which can be encountered, we define three processes with a variable or a constant stepsize and establish their convergence. Finally these processes are compared to classic processes on 11 datasets, 6 with a continuous output and 5 with a binary output, for a fixed total number of observations used and then for a fixed processing time. It appears that the third-defined process with a very simple choice of the stepsize gives usually the best results [32].

We study many other regression models like survival analysis, spatio temporal models with covariates. Among the multiple regression models, we want to test, thanks to simulation methods, validity of their assumptions [25]. Tests of this kind are called omnibus test. An omnibus test is an overall test that examines several assumptions together, the most known omnibus test is the one for testing gaussianity (that examines both skewness and kurtosis).

In the purpose of selecting factors linked to the efficiency of a treatment in the context of high dimension (about 100.000 covariates), we have developed a new methodology to select and rank covariates associated to a variable of interest in a context of high-dimensional data under dependence but few observations. The methodology imbricates successively rough selection, clustering of variables, decorrelation of variables using Factor Latent Analysis, selection using aggregation of adapted methods and finally ranking through bootstrap replications. Simulations study shows the interest of the decorrelation inside the different clusters of covariates. The methodology was applied to select covariates among genomics, proteomics covariates linked to the success of a immunotherapy treatment for lung cancer [21], [19], [20].

We also focus on the biological context of high-throughput and high-content bioassays in which several hundreds or thousands of biological signals are measured for a posterior analysis. In this experimental context, each culture well is a biological system in which the output variable is the cell proliferation, the input variable can be an electrical or a light stimulus signal and the covariate may be the type of cells, type of medium or tested compounds. The ambition is to identify a batch of several thousands of wells in a single step with the same model structure. Mixed effects models are largely used in regression but up to now they have rarely been used in the field of dynamical system identification. Our approach aims at developing a new solution based on an ARX (Auto Regressive model with eXternal inputs) model structure using the EM (Expectation-Maximisation) algorithm for the estimation of the model parameter [13], [10].