## Section: New Results

### Algorithmic Foundations

**Keywords:** computational geometry, Computational topology, Voronoi diagrams,
$\alpha $-shapes, Morse theory, graph algorithm, combinatorial optimization, statistical learning.

#### Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces

Participants : Frédéric Cazals, Alix Lhéritier.

Comparing two sets of multivariate samples is a central problem in
data analysis. From a statistical standpoint, the simplest way to
perform such a comparison is to resort to a non-parametric two-sample
test (TST), which checks whether the two sets can be seen as
i.i.d. samples of an identical unknown distribution (the null
hypothesis). If the null is rejected, one wishes to identify regions
accounting for this difference. In this paper
[17] , we presents a two-stage method providing
*feedback* on this difference, based upon a combination of
statistical learning (regression) and computational topology
methods.

Consider two populations, each given as a point cloud in ${\mathbb{R}}^{d}$.
In the first step, we assign a label to each set and we compute, for each
sample point, a discrepancy measure based on comparing an estimate of the
conditional probability distribution of the label given a position versus the
global unconditional label distribution.
In the second step, we study the height function defined at each point
by the aforementioned estimated discrepancy. Topological persistence is
used to identify persistent local minima of this height function,
their *basins* defining
regions of points with high discrepancy and in spatial proximity.

Experiments are reported both on synthetic and real data (satellite images and handwritten digit images), ranging in dimension from $d=2$ to $d=784$, illustrating the ability of our method to localize discrepancies.

On a general perspective, the ability to provide feedback downstream TST may prove of ubiquitous interest in exploratory statistics and data science.

#### A Sequential Non-parametric Two-Sample Test

Participants : Frédéric Cazals, Alix Lhéritier.

Given samples from two distributions, a nonparametric two-sample test aims at determining whether the two distributions are equal or not, based on a test statistic. This statistic may be computed on the whole dataset, or may be computed on a subset of the dataset by a function trained on its complement. We propose a third tier [19] , consisting of functions exploiting a sequential framework to learn the differences while incrementally processing the data. Sequential processing naturally allows optional stopping, which makes our test the first truly sequential nonparametric two-sample test.

We show that any sequential predictor can be turned into a sequential two-sample test for which a valid $p$-value can be computed, yielding controlled type I error. We also show that pointwise universal predictors yield consistent tests, which can be built with a nonparametric regressor based on $k$-nearest neighbors in particular. We also show that mixtures and switch distributions can be used to increase power, while keeping consistency.