## Section: New Results

### Statistical aspects of topological and geometric data analysis

#### Robust Bregman Clustering

Participants : Claire Brécheteau, Clément Levrard.

In collaboration with Aurélie Fischer (Université Paris-Diderot).

Using a trimming approach, in [38], we investigate a k-means type method based on Bregman divergences for clustering data possibly corrupted with clutter noise. The main interest of Bregman divergences is that the standard Lloyd algorithm adapts to these distortion measures, and they are well-suited for clustering data sampled according to mixture models from exponential families. We prove that there exists an optimal codebook, and that an empirically optimal codebook converges a.s. to an optimal codebook in the distortion sense. Moreover, we obtain the sub-Gaussian rate of convergence for k-means 1 √ n under mild tail assumptions. Also, we derive a Lloyd-type algorithm with a trimming parameter that can be selected from data according to some heuristic, and present some experimental results.

#### Statistical analysis and parameter selection for Mapper

Participants : Mathieu Carrière, Bertrand Michel, Steve Oudot.

In [15] we study the question of the statistical convergence of the 1-dimensional Mapper to its continuous analogue, the Reeb graph. We show that the Mapper is an optimal estimator of the Reeb graph, which gives, as a byproduct, a method to automatically tune its parameters and compute confidence regions on its topological features, such as its loops and flares. This allows to circumvent the issue of testing a large grid of parameters and keeping the most stable ones in the brute-force setting, which is widely used in visualization, clustering and feature selection with the Mapper.

#### A Fuzzy Clustering Algorithm for the Mode-Seeking Framework

Participants : Thomas Bonis, Steve Oudot.

In [13] we propose a new soft clustering algorithm based on the mode-seeking framework. Given a point cloud in ${\mathbb{R}}^{d}$, we define regions of high density that we call cluster cores, then we implement a random walk on a neighborhood graph built on top of the data points. This random walk is designed in such a way that it is attracted by high-density regions, the intensity of the attraction being controlled by a temperature parameter $\beta >0$. The membership of a point to a given cluster is then the probability for the random walk starting at this point to hit the corresponding cluster core before any other. While many properties of random walks (such as hitting times, commute distances, etc) are known to eventually encode purely local information when the number of data points grows to infinity, the regularization introduced by the use of cluster cores allows the output of our algorithm to converge to quantities involving the global structure of the underlying density function. Empirically, we show how the choice of $\beta $ influences the behavior of our algorithm: for small values of $\beta $ the result is really close to hard mode-seeking, while for values of $\beta $ close to 1 the result is similar to the output of the (soft) spectral clustering. We also demonstrate the scalability of our approach experimentally.

#### Large Scale computation of Means and Clusters for Persistence Diagrams using Optimal Transport

Participants : Théo Lacombe, Steve Oudot.

In collaboration with Marco Cuturi (ENSAE).

Persistence diagrams (PDs) are at the core of topological data analysis. They provide succinct descriptors encoding the underlying topology of sophisticated data. PDs are backed-up by strong theoretical results regarding their stability and have been used in various learning contexts. However, they do not live in a space naturally endowed with a Hilbert structure where natural metrics are not even differentiable, thus not suited to optimization process. Therefore, basic statistical notions such as the barycenter of a finite sample of PDs are not properly defined. In [30] we provide a theoretically good and computationally tractable framework to estimate the barycenter of a set of persistence diagrams. This construction is based on the theory of Optimal Transport (OT) and endows the space of PDs with a metric inspired from regularized Wasserstein distances.

#### The k-PDTM : a coreset for robust geometric inference

Participants : Claire Brécheteau, Clément Levrard.

Analyzing the sub-level sets of the distance to a compact sub-manifold of ${\mathbb{R}}^{d}$ is a common method in TDA to understand its topology. The distance to measure (DTM) was introduced by Chazal, Cohen-Steiner and Mérigot to face the non-robustness of the distance to a compact set to noise and outliers. This function makes possible the inference of the topology of a compact subset of ${\mathbb{R}}^{d}$ from a noisy cloud of $n$ points lying nearby in the Wasserstein sense. In practice, these sub-level sets may be computed using approximations of the DTM such as the q-witnessed distance or other power distance. These approaches lead eventually to compute the homology of unions of $n$ growing balls, that might become intractable whenever $n$ is large. To simultaneously face the two problems of large number of points and noise, we introduce in [39] the $k$-power distance to measure ($k$-PDTM). This new approximation of the distance to measure may be thought of as a $k$-coreset based approximation of the DTM. Its sublevel sets consist in union of $k$-balls, $k<<n$, and this distance is also proved robust to noise. We assess the quality of this approximation for $k$ possibly dramatically smaller than $n$, for instance $k=n13$ is proved to be optimal for 2-dimensional shapes. We also provide an algorithm to compute this $k$-PDTM.

#### The density of expected persistence diagrams and its kernel based estimation

Participants : Frédéric Chazal, Vincent Divol.

Persistence diagrams play a fundamental role in Topological Data Analysis where they are used as topological descriptors of filtrations built on top of data. They consist in discrete multisets of points in the plane ${\mathbb{R}}^{2}$ that can equivalently be seen as discrete measures in ${\mathbb{R}}^{2}$. When the data come as a random point cloud, these discrete measures become random measures whose expectation is studied in this paper. In [28] we first show that for a wide class of filtrations, including the Čech and Rips-Vietoris filtrations, the expected persistence diagram, that is a deterministic measure on ${\mathbb{R}}^{2}$, has a density with respect to the Lebesgue measure. Second, building on the previous result we show that the persistence surface recently introduced by Adams et al can be seen as a kernel estimator of this density. We propose a cross-validation scheme for selecting an optimal bandwidth, which is proven to be a consistent procedure to estimate the density.

#### On the choice of weight functions for linear representations of persistence diagrams

Participant : Vincent Divol.

In collaboration with Wolfgang Polonik (UC Davis)

Persistence diagrams are efficient descriptors of the topology of a point cloud. As they do not naturally belong to a Hilbert space, standard statistical methods cannot be directly applied to them. Instead, feature maps (or representations) are commonly used for the analysis. A large class of feature maps, which we call linear, depends on some weight functions, the choice of which is a critical issue. An important criterion to choose a weight function is to ensure stability of the feature maps with respect to Wasserstein distances on diagrams. In [42], we improve known results on the stability of such maps, and extend it to general weight functions. We also address the choice of the weight function by considering an asymptotic setting; assume that ${X}_{n}$ is an i.i.d. sample from a density on ${[0,1]}^{d}$. For the Cech and Rips filtrations, we characterize the weight functions for which the corresponding feature maps converge as n approaches infinity, and by doing so, we prove laws of large numbers for the total persistence of such diagrams. Both approaches lead to the same simple heuristic for tuning weight functions: if the data lies near a $d$-dimensional manifold, then a sensible choice of weight function is the persistence to the power $\alpha $ with $\alpha \ge d$.

#### Estimating the Reach of a Manifold

Participants : Frédéric Chazal, Bertrand Michel.

In collaboration with E. Aamari (CNRS Paris 7), J.Kim, A. Rinaldo and L. Wasserman (Carnegie Mellon University).

Various problems in manifold estimation make use of a quantity called the reach, denoted by ${\tau}_{M}$, which is a measure of the regularity of the manifold. [32] is the first investigation into the problem of how to estimate the reach. First, we study the geometry of the reach through an approximation perspective. We derive new geometric results on the reach for submanifolds without boundary. An estimator $\widehat{\tau}$ of ${\tau}_{M}$ is proposed in a framework where tangent spaces are known, and bounds assessing its efficiency are derived. In the case of i.i.d. random point cloud ${\mathbb{X}}_{n}$, $\tau \left({\mathbb{X}}_{n}\right)$ is showed to achieve uniform expected loss bounds over a ${\mathcal{C}}^{3}$-like model. Finally, we obtain upper and lower bounds on the minimax rate for estimating the reach.

#### Robust Topological Inference: Distance To a Measure and Kernel Distance

Participants : Frédéric Chazal, Bertrand Michel.

In collaboration with B. Fasy (Univ. Montana) and F. Lecci, A. Rinaldo and L. Wasserman (Carnegie Mellon University).

Let $P$ be a distribution with support $S$. The salient features of $S$ can be quantified with persistent homology, which summarizes topological features of the sublevel sets of the distance function (the distance of any point $x$ to $S$). Given a sample from $P$ we can infer the persistent homology using an empirical version of the distance function. However, the empirical distance function is highly non-robust to noise and outliers. Even one outlier is deadly. The distance-to-a-measure (DTM), introduced by Chazal et al. (2011), and the kernel distance, introduced by Phillips et al. (2014), are smooth functions that provide useful topological information but are robust to noise and outliers. Chazal et al. (2015) derived concentration bounds for DTM. Building on these results, in [16], we derive limiting distributions and confidence sets, and we propose a method for choosing tuning parameters.