Section: New Results

Statistical aspects of topological and geometric data analysis

Estimating the Reach of a Manifold

Participants : Frédéric Chazal, Jisu Kim, Bertrand Michel.

In collaboration with E. Aamari (Univ. Paris-Diderot), A. Rinaldo, L. Wasserman (Carnegie Mellon University).

In [13], various problems in manifold estimation make use of a quantity called the reach, denoted by τM, which is a measure of the regularity of the manifold. This paper is the first investigation into the problem of how to estimate the reach. First, we study the geometry of the reach through an approximation perspective. We derive new geometric results on the reach for submanifolds without boundary. An estimator τ^ of τM is proposed in an oracle framework where tangent spaces are known, and bounds assessing its efficiency are derived. In the case of i.i.d. random point cloud Xn, τ^(Xn) is showed to achieve uniform expected loss bounds over a C3-like model. Finally, we obtain upper and lower bounds on the minimax rate for estimating the reach.

A statistical test of isomorphism between metric-measure spaces using the distance-to-a-measure signature

Participant : Claire Brecheteau.

In [20], we introduce the notion of DTM-signature, a measure on that can be associated to any metric-measure space. This signature is based on the function distance to a measure (DTM) introduced in 2009 by Chazal, Cohen-Steiner and Mérigot. It leads to a pseudo-metric between metric-measure spaces, that is bounded above by the Gromov-Wasserstein distance. This pseudo-metric is used to build a statistical test of isomorphism between two metric-measure spaces, from the observation of two N-samples.

The test is based on subsampling methods and comes with theoretical guarantees. It is proven to be of the correct level asymptotically. Also, when the measures are supported on compact subsets of d, rates of convergence are derived for the L1-Wasserstein distance between the distribution of the test statistic and its subsampling approximation. These rates depend on some parameter ρ>1. In addition, we prove that the power is bounded above by exp(CN1/ρ), with C proportional to the square of the aforementioned pseudo-metric between the metric-measure spaces. Under some geometrical assumptions, we also derive lower bounds for this pseudo-metric.

An algorithm is proposed for the implementation of this statistical test, and its performance is compared to the performance of other methods through numerical experiments.

On the choice of weight functions for linear representations of persistence diagrams

Participant : Vincent Divol.

In collaboration with Wolfgang Polonik (UC Davis).

Persistence diagrams are efficient descriptors of the topology of a point cloud. As they do not naturally belong to a Hilbert space, standard statistical methods cannot be directly applied to them. Instead, feature maps (or representations) are commonly used for the analysis. A large class of feature maps, which we call linear, depends on some weight functions, the choice of which is a critical issue. An important criterion to choose a weight function is to ensure stability of the feature maps with respect to Wasserstein distances on diagrams. In [21], we improve known results on the stability of such maps, and extend it to general weight functions. We also address the choice of the weight function by considering an asymptotic setting; assume that 𝕏n is an i.i.d. sample from a density on [0,1]d. For the Č ech and Rips filtrations, we characterize the weight functions for which the corresponding feature maps converge as n approaches infinity, and by doing so, we prove laws of large numbers for the total persistences of such diagrams. Those two approaches (stability and convergence) lead to the same simple heuristic for tuning weight functions: if the data lies near a d-dimensional manifold, then a sensible choice of weight function is the persistence to the power α with αd.

Understanding the Topology and the Geometry of the Persistence Diagram Space via Optimal Partial Transport

Participants : Vincent Divol, Théo Lacombe.

Despite the obvious similarities between the metrics used in topological data analysis and those of optimal transport, an optimal-transport based formalism to study persistence diagrams and similar topological descriptors has yet to come. In [48], by considering the space of persistence diagrams as a measure space, and by observing that its metrics can be expressed as solutions of optimal partial transport problems, we introduce a generalization of persistence diagrams, namely Radon measures supported on the upper half plane. Such measures naturally appear in topological data analysis when considering continuous representations of persistence diagrams (e.g. persistence surfaces) but also as limits for laws of large numbers on persistence diagrams or as expectations of probability distributions on the persistence diagrams space. We study the topological properties of this new space, which will also hold for the closed subspace of persistence diagrams. New results include a characterization of convergence with respect to transport metrics, the existence of Fréchet means for any distribution of diagrams, and an exhaustive description of continuous linear representations of persistence diagrams. We also showcase the usefulness of this framework to study random persistence diagrams by providing several statistical results made meaningful thanks to this new formalism.