Section: New Results
Algorithmic Foundations
Keywords: computational geometry, Computational topology, Voronoi diagrams,
Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces
Participants : Frédéric Cazals, Alix Lhéritier.
Comparing two sets of multivariate samples is a central problem in data analysis. From a statistical standpoint, the simplest way to perform such a comparison is to resort to a non-parametric two-sample test (TST), which checks whether the two sets can be seen as i.i.d. samples of an identical unknown distribution (the null hypothesis). If the null is rejected, one wishes to identify regions accounting for this difference. In this paper [17] , we presents a two-stage method providing feedback on this difference, based upon a combination of statistical learning (regression) and computational topology methods.
Consider two populations, each given as a point cloud in
Experiments are reported both on synthetic and real data (satellite
images and handwritten digit images), ranging in dimension from
On a general perspective, the ability to provide feedback downstream TST may prove of ubiquitous interest in exploratory statistics and data science.
A Sequential Non-parametric Two-Sample Test
Participants : Frédéric Cazals, Alix Lhéritier.
Given samples from two distributions, a nonparametric two-sample test aims at determining whether the two distributions are equal or not, based on a test statistic. This statistic may be computed on the whole dataset, or may be computed on a subset of the dataset by a function trained on its complement. We propose a third tier [19] , consisting of functions exploiting a sequential framework to learn the differences while incrementally processing the data. Sequential processing naturally allows optional stopping, which makes our test the first truly sequential nonparametric two-sample test.
We show that any sequential predictor can be turned into a sequential
two-sample test for which a valid