DataShape is a research project in Topological Data Analysis (TDA), a recent field whose aim is to uncover, understand and exploit the topological and geometric structure underlying complex and possibly high dimensional data. The overall objective of the DataShape project is to settle the mathematical, statistical and algorithmic foundations of TDA and to disseminate and promote our results in the data science community.

The approach of DataShape relies on the conviction that it is necessary to combine statistical, topological/geometric and computational approaches in a common framework, in order to face the challenges of TDA. Another conviction of DataShape is that TDA needs to be combined with other data science approaches and tools to lead to successful real applications. It is necessary for TDA challenges to be simultaneously addressed from the fundamental and applied sides.

The team members have actively contributed to the emergence of TDA during the last few years. The variety of expertise, going from fundamental mathematics to software development, and the strong interactions within our team as well as numerous well established international collaborations make our group one of the best to achieve these goals.

The expected output of DataShape is two-fold. First, we intend to set up and develop the mathematical, statistical and algorithmic foundations of Topological and Geometric Data Analysis. Second, we intend to pursue the development of the GUDHI platform, initiated by the team members and which is becoming a standard tool in TDA, in order to provide an efficient state-of-the-art toolbox for the understanding of the topology and geometry of data. The ultimate goal of DataShape is to develop and promote TDA as a new family of well-founded methods to uncover and exploit the geometry of data. This also includes the clarification of the position and complementarity of TDA with respect to other approaches and tools in data science. Our objective is also to provide practically efficient and flexible tools that could be used independently, complementarily or in combination with other classical data analysis and machine learning approaches.

tda requires to construct and manipulate appropriate representations
of complex and high dimensional shapes. A major difficulty comes from
the fact that the complexity of data structures and algorithms used to
approximate shapes rapidly grows as the dimensionality increases,
which makes them intractable in high dimensions. We focus our research
on simplicial complexes which offer a convenient representation of
general shapes and generalize graphs and triangulations. Our work
includes the study of simplicial complexes with good approximation
properties and the design of compact data structures to represent them.

In low dimensions, effective shape reconstruction techniques exist
that can provide precise geometric approximations very efficiently and
under reasonable sampling conditions. Extending those techniques to
higher dimensions as is required in the context of tda is
problematic since almost all methods in low dimensions rely on the
computation of a subdivision of the ambient space. A direct extension
of those methods would immediately lead to algorithms whose
complexities depend exponentially on the ambient dimension, which is
prohibitive in most applications. A first direction to by-pass the
curse of dimensionality is to develop algorithms whose complexities
depend on the intrinsic dimension of the data (which most of the time
is small although unknown) rather than on the dimension of the ambient
space. Another direction is to resort to cruder approximations that
only captures the homotopy type or the homology of the sampled
shape. The recent theory of persistent homology provides a powerful and robust tool to study the homology of sampled spaces in a stable way.

The wide variety of larger and larger available data - often corrupted by noise and outliers - requires to consider the statistical properties of their topological and geometric features and to propose new relevant statistical models for their study.

There exist various statistical and machine learning methods intending to uncover the geometric structure of data. Beyond manifold learning and dimensionality reduction approaches that generally do not allow to assert the relevance of the inferred topological and geometric features and are not well-suited for the analysis of complex topological structures, set estimation methods intend to estimate, from random samples, a set around which the data is concentrated. In these methods, that include support and manifold estimation, principal curves/manifolds and their various generalizations to name a few, the estimation problems are usually considered under losses, such as Hausdorff distance or symmetric difference, that are not sensitive to the topology of the estimated sets, preventing these tools to directly infer topological or geometric information.

Regarding purely topological features, the statistical estimation of homology or homotopy type of compact subsets of Euclidean spaces, has only been considered recently, most of the time under the quite restrictive assumption that the data are randomly sampled from smooth manifolds.

In a more general setting, with the emergence of new geometric inference tools based on the study of distance functions and algebraic topology tools such as persistent homology, computational topology has recently seen an important development offering a new set of methods to infer relevant topological and geometric features of data sampled in general metric spaces. The use of these tools remains widely heuristic and until recently there were only a few preliminary results establishing connections between geometric inference, persistent homology and statistics. However, this direction has attracted a lot of attention over the last three years. In particular, stability properties and new representations of persistent homology information have led to very promising results to which the DataShape members have significantly contributed. These preliminary results open many perspectives and research directions that need to be explored.

Our goal is to build on our first statistical results in tda to develop the mathematical foundations of Statistical Topological and Geometric Data Analysis. Combined with the other objectives, our ultimate goal is to provide a well-founded and effective statistical toolbox for the understanding of topology and geometry of data.

This objective is driven by the problems raised by the use of
topological and geometric approaches in machine learning. The goal is both to use our techniques
to better understand the role of topological and geometric structures in machine learning
problems and to apply our tda tools to develop specialized topological approaches to be used
in combination with other machine learning methods.

We develop a high quality open source software platform called gudhi which is becoming a reference in geometric and topological data analysis in high dimensions. The goal
is not to provide code tailored to the numerous potential applications but rather to provide the central data structures and algorithms that underlie applications in geometric and topological data analysis.

The development of the gudhi platform also serves to benchmark and optimize new algorithmic solutions resulting from our theoretical work. Such development necessitates a whole line of research on software architecture and
interface design, heuristics and fine-tuning optimization, robustness and arithmetic issues, and visualization.
We aim at providing a full programming environment following the same recipes that made up the success story of the cgal library, the reference library in computational geometry.

Some of the algorithms implemented on the platform will also be interfaced to other software platforms, such as the R software for statistical computing, and languages such as Python in order to make them usable in combination with other data analysis and machine learning tools. A first attempt in this direction has been done with the creation of an R package called TDA in collaboration with the group of Larry Wasserman at Carnegie Mellon University (Inria Associated team CATS) that already includes some functionalities of the gudhi library and implements some joint results between our team and the CMU team. A similar interface with the Python language is also considered a priority. To go even further towards helping users, we will provide utilities that perform the most common tasks without requiring any programming at all.

Our work is mostly of a fundamental mathematical and algorithmic nature but finds a variety of applications in data analysis, e.g., in material science, biology, sensor networks, 3D shape analysis and processing, to name a few.

More specifically, DataShape is working on the analysis of trajectories obtained from inertial sensors (PhD theses of Wojtek Riese and Alexandre Guérin with Sysnav, participation to the DGA/ANR challenge MALIN with Sysnav) and, more generally on the development of new TDA methods for Machine Learning and Artificial Intelligence for (multivariate) time-dependent data from various kinds of sensors in collaboration with Fujitsu, or high dimensional point cloud data with Metafora.

DataShape is also working in collaboration with the University of Columbia in New-York, especially with the Rabadan lab, in order to improve bioinformatics methods and analyses for single cell genomic data. For instance, there is a lot of work whose aim is to use TDA tools such as persistent homology and the Mapper algorithm to characterize, quantify and study statistical significance of biological phenomena that occur in large scale single cell data sets. Such biological phenomena include, among others: the cell cycle, functional differentiation of stem cells, and immune system responses (such as the spatial response on the tissue location, and the genomic response with protein expression) to breast cancer.

The weekly research seminar of DataShape is now taking place in hybrid mode. The travels for the team members have decreased a lot these years to take care of the environmental footprint of the team.

The Gudhi library is an open source library for Computational Topology and Topological Data Analysis (TDA). It offers state-of-the-art algorithms to construct various types of simplicial complexes, data structures to represent them, and algorithms to compute geometric approximations of shapes and persistent homology.

The GUDHI library offers the following interoperable modules:

. Complexes: + Cubical + Simplicial: Rips, Witness, Alpha and Čech complexes + Cover: Nerve and Graph induced complexes . Data structures and basic operations: + Simplex tree, Skeleton blockers and Toplex map + Construction, update, filtration and simplification . Topological descriptors computation . Manifold reconstruction . Topological descriptors tools: + Bottleneck and Wasserstein distance + Statistical tools + Persistence diagram and barcode

Below is a list of changes made since GUDHI 3.7.0 (december 2022):

Perslay a TensorFlow layer for persistence diagrams representations.

Cover Complex New classes to compute Mapper, Graph Induced complex and Nerves with a scikit-learn like interface.

Persistent cohomology New linear-time compute_persistence_of_function_on_line, also available though CubicalPersistence in Python.

Cubical complex Add possibility to build a lower-star filtration from vertices instead of top-dimensional cubes. Much faster implementation for the 2d case with input from top-dimensional cells.

Hera version of Wasserstein distance now provides matching in its interface.

Subsampling New choose_n_farthest_points_metric as a faster alternative of choose_n_farthest_points.

SimplexTree SimplexTree can now be used with python pickle. A helper for_each_simplex that applies a given function object on each simplex A new option link_nodes_by_label to speed up cofaces and stars access, when set to true. A new option stable_simplex_handles to keep Simplex handles valid even after insertions or removals, when set to true.

Čech complex A function assign_MEB_filtration that assigns to each simplex a filtration value equal to the squared radius of its minimal enclosing ball (MEB), given a simplicial complex and an embedding of its vertices. Applied on a Delaunay triangulation, it computes the Delaunay-Čech filtration.

Edge collapse A Python function reduce_graph to simplify a clique filtration (represented as a sparse weighted graph), while preserving its persistent homology.

0-dimensional persistent homology is known, from a computational point of view, as the easy case. Indeed, given a list of

In collaboration with André Lieutier.

In 27 we introduce a pruning of the medial axis called the

Isomanifolds are the generalization of isosurfaces to arbitrary
dimension and codimension, i.e. submanifolds of

Kleinjohann (Archiv der Mathematik 35(1):574–582, 1980; Mathematische Zeitschrift 176(3), 327–344, 1981) and Bangert (Archiv der Mathematik 38(1):54–57, 1982) extended the reach

In collaboration with Arijit Ghosh and Ramsay Dyer.

In 11, we present criteria for establishing a triangulation of a manifold. Given a manifold

In 14, we investigate the existence of sufficient local conditions under which poset representations decompose as direct sums of indecomposables from a given class. In our work, the indexing poset is the product of two totally ordered sets, corresponding to the setting of 2-parameter persistence in topological data analysis. Our indecomposables of interest belong to the so-called interval modules, which by definition are indicator representations of intervals in the poset. While the whole class of interval modules does not admit such a local characterization, we show that the subclass of rectangle modules does admit one and that it is, in some precise sense, the largest subclass to do so.

We introduce 23 a theoretical and computational framework to use discrete Morse theory as an efficient preprocessing in order to compute zigzag persistent homology. From a zigzag filtration of complexes zigzag Morse filtration whose complexes

This paper 24 investigates the properties of the persistence diagrams stemming from almost
surely continuous random processes on

In

58, we present a method to construct signatures of periodic-like data. Based on topological considerations, our construction encodes information about the order and values of local extrema. Its main strength is robustness to reparametrisation of the observed signal, so that it depends only on the form of the periodic function. The signature converges as the observation contains increasingly many periods. We show that it can be estimated from the observation of a single time series using bootstrap techniques.

In 21, we propose two multiscale comparisons of graphs using heat diffusion, allowing to compare graphs without node correspondence or even with different sizes. These multiscale comparisons lead to the definition of Lipschitz-continuous empirical processes indexed by a real parameter. The statistical properties of empirical means of such processes are studied in the general case. Under mild assumptions, we prove a functional Central Limit Theorem, as well as a Gaussian approximation with a rate depending only on the sample size. Once applied to our processes, these results allow to analyze data sets of pairs of graphs. We design consistent confidence bands around empirical means and consistent two-sample tests, using bootstrap methods. Their performances are evaluated by simulations on synthetic data sets.

In

44, we consider noisy observations of a distribution with unknown support. In the deconvolution model, it has been proved recently that, under very mild assumptions, it is possible to solve the deconvolution problem without knowing the noise distribution and with no sample of the noise. We first give general settings where the theory applies and provide classes of supports that can be recovered in this context. We then exhibit classes of distributions over which we prove adaptive minimax rates (up to a log log factor) for the estimation of the support in Hausdorff distance. Moreover, for the class of distributions with compact support, we provide estimators of the unknown (in general singular) distribution and prove maximum rates in Wasserstein distance. We also prove an almost matching lower bound on the associated minimax risk.

We introduce a novel gradient descent algorithm extending the well-known Gradient Sampling methodology to the class of stratifiably smooth objective functions, which are defined as locally Lipschitz functions that are smooth on some regular pieces-called the strata-of the ambient Euclidean space. For this class of functions, our algorithm achieves a sub-linear convergence rate. We then apply our method to objective functions based on the (extended) persistent homology map computed over lower-star filters, which is a central tool of Topological Data Analysis. For this, we propose an efficient exploration of the corresponding stratification by using the Cayley graph of the permutation group. Finally, we provide benchmark and novel topological optimization problems, in order to demonstrate the utility and applicability of our framework.

The Fermat distance has been recently established as a useful tool for machine learning tasks when a natural distance is not directly available to the practitioner or to improve the results given by Euclidean distances by exploding the geometrical and statistical properties of the dataset. This distance depends on a parameter

Despite their successful application to a variety of tasks, neural networks remain limited, like other machine learning methods, by their sensitivity to shifts in the data: their performance can be severely impacted by differences in distribution between the data on which they were trained and that on which they are deployed. In 54, we propose a new family of representations, called MAGDiff, that we extract from any given neural network classifier and that allows for efficient covariate data shift detection without the need to train a new model dedicated to this task. These representations are computed by comparing the activation graphs of the neural network for samples belonging to the training distribution and to the target distribution, and yield powerful data- and task-adapted statistics for the two-sample tests commonly used for data set shift detection. We demonstrate this empirically by measuring the statistical powers of two-sample Kolmogorov-Smirnov (KS) tests on several different data sets and shift types, and showing that our novel representations induce significant improvements over a state-of-the-art baseline relying on the network output.

Quantification learning deals with the task of estimating the target label distribution under label shift. In 26, we established a unifying framework, distribution feature matching (DFM), that recovers as particular instances various estimators introduced in previous literature. We derived a general performance bound for DFM procedures, improving in several key aspects upon previous bounds derived in particular cases. We then extended this analysis to study robustness of DFM procedures in the misspecified setting under departure from the exact label shift hypothesis, in particular in the case of contamination of the target by an unknown distribution. These theoretical findings were confirmed by a detailed numerical study on simulated and real-world datasets. We also introduced an efficient, scalable and robust version of kernel-based DFM using the Random Fourier Feature principle.

This paper received the "Best student paper" award at the ECML/PKDD conference 2023.

We addressed in 25 the multiple testing problem under the assumption that the true/false hypotheses are driven by a Hidden Markov Model (HMM), which is recognized as a fundamental setting to model multiple testing under dependence since the seminal work of Sun and Cai (2009). While previous work has concentrated on deriving specific procedures with a controlled False Discovery Rate (FDR) under this model, following a recent trend in selective inference, we considered the problem of establishing confidence bounds on the false discovery proportion (FDP), for a user-selected set of hypotheses that can depend on the observed data in an arbitrary way. We developed a methodology to construct such confidence bounds first when the HMM model is known, then when its parameters are unknown and estimated, including the data distribution under the null and the alternative, using a nonparametric approach. In the latter case, we proposed a bootstrap-based methodology to take into account the effect of parameter estimation error. We showed that taking advantage of the assumed HMM structure allows for a substantial improvement of confidence bound sharpness over existing agnostic (structure-free) methods, as witnessed both via numerical experiments and real data examples.

Persistent homology (PH) provides topological descriptors for geometric data, such as weighted graphs, which are interpretable, stable to perturbations, and invariant under, e.g., relabeling. Most applications of PH focus on the one-parameter case – where the descriptors summarize the changes in topology of data as it is filtered by a single quantity of interest – and there is now a wide array of methods enabling the use of one-parameter PH descriptors in data science, which rely on the stable vectorization of these descriptors as elements of a Hilbert space. Although the multiparameter PH (MPH) of data that is filtered by several quantities of interest encodes much richer information than its one-parameter counterpart, the scarceness of stability results for MPH descriptors has so far limited the available options for the stable vectorization of MPH. In this paper, we aim to bring together the best of both worlds by showing how the interpretation of signed barcodes – a recent family of MPH descriptors – as signed measures leads to natural extensions of vectorization strategies from one parameter to multiple parameters. The resulting feature vectors are easy to define and to compute, and provably stable. While, as a proof of concept, we focus on simple choices of signed barcodes and vectorizations, we already see notable performance improvements when comparing our feature vectors to state-of-the-art topology-based methods on various types of data.

Topological data analysis (TDA) is an area of data science that focuses on using invariants from algebraic topology to provide multiscale shape descriptors for geometric data sets such as point clouds. One of the most important such descriptors is persistent homology, which encodes the change in shape as a filtration parameter changes; a typical parameter is the feature scale. For many data sets, it is useful to simultaneously vary multiple filtration parameters, for example feature scale and density. While the theoretical properties of single parameter persistent homology are well understood, less is known about the multiparameter case. In particular, a central question is the problem of representing multiparameter persistent homology by elements of a vector space for integration with standard machine learning algorithms. Existing approaches to this problem either ignore most of the multiparameter information to reduce to the one-parameter case or are heuristic and potentially unstable in the face of noise. In this article, we introduce a new general representation framework that leverages recent results on decompositions of multiparameter persistent homology. This framework is rich in information, fast to compute, and encompasses previous approaches. Moreover, we establish theoretical stability guarantees under this framework as well as efficient algorithms for practical computation, making this framework an applicable and versatile tool for analyzing geometric and point cloud data. We validate our stability results and algorithms with numerical experiments that demonstrate statistical convergence, prediction accuracy, and fast running times on several real data sets.

Quantum topology provides various frameworks for defining and computing invariants of manifolds. One such framework of substantial interest in both mathematics and physics is the Turaev-Viro-Barrett-Westbury state sum construction, which uses the data of a spherical fusion category to define topological invariants of triangulated 3-manifolds via tensor network contractions. In this work 47 we consider a restricted class of state sum invariants of 3-manifolds derived from Tambara-Yamagami categories. These categories are particularly simple, being entirely specified by three pieces of data: a finite abelian group, a bicharacter of that group, and a sign

We present three “hard” diagrams of the unknot 15. They require (at least) three extra crossings before they can be simplified to the trivial unknot diagram via Reidemeister moves in S2. Both examples are constructed by applying previously proposed methods. The proof of their hardness uses significant computational resources. We also determine that no small “standard” example of a hard unknot diagram requires more than one extra crossing for Reidemeister moves in S2.

Inspired by the strengths of quadric error metrics initially designed for mesh decimation, we propose a concise mesh reconstruction approach for 3D point clouds 32. Our approach proceeds by clustering the input points enriched with quadric error metrics, where the generator of each cluster is the optimal 3D point for the sum of its quadric error metrics. This approach favors the placement of generators on sharp features, and tends to equidistribute the error among clusters. We reconstruct the output surface mesh from the adjacency between clusters and a constrained binary solver. We combine our clustering process with an adaptive refinement driven by the error. Compared to prior art, our method avoids dense reconstruction prior to simplification and produces immediately an optimized mesh.

We present 43 two lower bounds that hold with high probability for random point sets. We first give a new, and elementary, proof that the classical models of random point sets (uniform in a smooth convex body, uniform in a polygon, Gaussian) have a superconstant number of extreme points with high probability. We next prove that any algorithm that determines the orientation of all triples in a planar set of n points (that is, the order type of the point set) from their Cartesian coordinates must read with high probability

In this paper 51 we show that a Möbius-structure

is a continuous operator on the weighted

From here we construct a Sobolev space

The work is inspired by a paper by Astengo, Cowling, and Di Blasio, who construct uniformly bounded representations for simple Lie groups of rank 1. We formulate the problem in a much more general framework of groups acting on Möbius structures. In particular, all hyperbolic groups.

We investigated in 30 the problem of cumulative regret minimization for individual sequence prediction with respect to the best expert in a finite family of size a posteriori the losses of at most

In 42, by interpreting the product of the Principal Component Analysis, that is the covariance matrix, as a sequence of nested subspaces naturally coming with weights according to the level of approximation they provide, we are able to embed all

Collaboration with Sysnav, a French SME with world leading expertise in navigation and geopositioning in extreme environments, on TDA, geometric approaches and machine learning for the analysis of movements of pedestrians and patients equipped with inertial sensors (CIFRE PhD of Alexandre Guérin).

Research collaboration with Fujitsu on the development of new TDA methods and tools for Machine learning and Artificial Intelligence (started in Dec 2017).

Research collaboration with MetaFora on the development of new TDA-based and statistical methods for the analysis of cytometric data (started in Nov. 2019).

Collaboration with Dassault Systèmes and Inria team Geomerix (Saclay) on the applications of methods from geometric measure theory to the modelling and processing of complex 3D shapes (PhD of Lucas Brifault, started in May 2022).

- Acronym : TopAI

- Type : ANR Chair in AI.

- Title : Topological Data Analysis for Machine Learning and AI

- Coordinator : Frédéric Chazal

- Duration : 4 years from September 2020 to August 2024.

- Others Partners: Two industrial partners, the French SME Sysnav and the French start-up MetaFora.

- Abstract:

The TopAI project aims at developing a world-leading research activity on topological and geometric approaches in Machine Learning (ML) and AI with a double academic and industrial/societal objective. First, building on the strong expertise of the candidate and his team in TDA, TopAI aims at designing new mathematically well-founded topological and geometric methods and tools for Data Analysis and ML and to make them available to the data science and AI community through state-of-the-art software tools. Second, thanks to already established close collaborations and the strong involvement of French industrial partners, TopAI aims at exploiting its expertise and tools to address a set of challenging problems with high societal and economic impact in personalized medicine and AI-assisted medical diagnosis.

- Acronym : ALGOKNOT.

- Type : ANR Jeune Chercheuse Jeune Chercheur.

- Title : Algorithmic and Combinatorial Aspects of Knot Theory.

- Coordinator : Clément Maria.

- Duration : 2020 – 2025 (5 years).

- Abstract: The project AlgoKnot aims at strengthening our understanding of the computational and combinatorial complexity of the diverse facets of knot theory, as well as designing efficient algorithms and software to study their interconnections.

- See also: Clément Maria and ANR AlgoKnot.

- Acronym: GeMfaceT.

- Type: ANR JCJC -CES 40 – Mathématiques

- Title: A bridge between Geometric Measure and Discrete Surface Theories

- Coordinator: Blanche Buet.

- Duration: 48 months, starting October 2021.

- Abstract: This project positions at the interface between geometric measure and discrete surface theories. There has recently been a growing interest in non-smooth structures, both from theoretical point of view, where singularities occur in famous optimization problems such as Plateau problem or geometric flows such as mean curvature flow, and applied point of view where complex high dimensional data are no longer assumed to lie on a smooth manifold but are more singular and allow crossings, tree-structures and dimension variations. We propose in this project to strengthen and expand the use of geometric measure concepts in discrete surface study and complex data modelling and also, to use those possible singular disrcete surfaces to compute numerical solutions to the aforementioned problems.

Research collaboration between DataShape and IFPEN on TDA applied to various problems issued from energy transition and sustainable mobility.

Research collaboration on anomaly detection for multivariate time series using TDA and ML approaches.

- Type : Paris Region PhD 2021.

- Title : Comparaison de données cytométriques.

The Île-de-France region funds a PhD thesis in collaboration with Metafora biosystems, a company specialized in the analysis of cells through their metabolism. Bastien Dussap is supervised by Gilles Blanchard and Marc Glisse and aims to compare blood samples using statistics.