Section:
New Results
BIG-SIR: a Sliced Inverse Regression approach for massive data
The following result has been obtained by
J. Saracco (Inria CQFD) in collaboration with B. Liquet.
In a massive data setting, we focus on a semiparametric regression model involving a real dependent variable and a -dimensional covariable . This model includes a dimension reduction of via an index . The Effective Dimension Reduction (EDR) direction cannot
be directly estimated by the Sliced Inverse Regression (SIR) method due to the large volume
of the data. To deal with the main challenges of analysing massive datasets which are the
storage and computational efficiency, we propose a new SIR estimator of the EDR direction by
following the “divide and conquer” strategy. The data is divided into subsets. EDR directions
are estimated in each subset which is a small dataset. The recombination step is based on the
optimisation of a criterion which assesses the proximity between the EDR directions of each
subset. Computations are run in parallel with no communication among them.
The consistency of our estimator is established and its asymptotic distribution is given. Exten-
sions to multiple indices models, -dimensional response variable and/or SIR -based methods
are also discussed. A simulation study using our edrGraphicalTools R package shows that
our approach enables us to reduce the computation time and conquer the memory constraint
problem posed by massive datasets. A combination of foreach and bigmemory R packages
are exploited to offer efficiency of execution in both speed and memory. Finally, results are
visualised using the bin-summarise-smooth approach through the bigvis R package