DataShape is a research project in Topological Data Analysis (TDA), a recent field whose aim is to uncover, understand and exploit the topological and geometric structure underlying complex and possibly high dimensional data. The overall objective of the DataShape project is to settle the mathematical, statistical and algorithmic foundations of TDA and to disseminate and promote our results in the data science community.

The approach of DataShape relies on the conviction that it is necessary to combine statistical, topological/geometric and computational approaches in a common framework, in order to face the challenges of TDA. Another conviction of DataShape is that TDA needs to be combined with other data science approaches and tools to lead to successful real applications. It is necessary for TDA challenges to be simultaneously addressed from the fundamental and applied sides.

The team members have actively contributed to the emergence of TDA during the last few years. The variety of expertise, going from fundamental mathematics to software development, and the strong interactions within our team as well as numerous well established international collaborations make our group one of the best to achieve these goals.

The expected output of DataShape is two-fold. First, we intend to set up and develop the mathematical, statistical and algorithmic foundations of Topological and Geometric Data Analysis. Second, we intend to pursue the development of the GUDHI platform, initiated by the team members and which is becoming a standard tool in TDA, in order to provide an efficient state-of-the-art toolbox for the understanding of the topology and geometry of data. The ultimate goal of DataShape is to develop and promote TDA as a new family of well-founded methods to uncover and exploit the geometry of data. This also includes the clarification of the position and complementarity of TDA with respect to other approaches and tools in data science. Our objective is also to provide practically efficient and flexible tools that could be used independently, complementarily or in combination with other classical data analysis and machine learning approaches.

tda requires to construct and manipulate appropriate representations
of complex and high dimensional shapes. A major difficulty comes from
the fact that the complexity of data structures and algorithms used to
approximate shapes rapidly grows as the dimensionality increases,
which makes them intractable in high dimensions. We focus our research
on simplicial complexes which offer a convenient representation of
general shapes and generalize graphs and triangulations. Our work
includes the study of simplicial complexes with good approximation
properties and the design of compact data structures to represent them.

In low dimensions, effective shape reconstruction techniques exist
that can provide precise geometric approximations very efficiently and
under reasonable sampling conditions. Extending those techniques to
higher dimensions as is required in the context of tda is
problematic since almost all methods in low dimensions rely on the
computation of a subdivision of the ambient space. A direct extension
of those methods would immediately lead to algorithms whose
complexities depend exponentially on the ambient dimension, which is
prohibitive in most applications. A first direction to by-pass the
curse of dimensionality is to develop algorithms whose complexities
depend on the intrinsic dimension of the data (which most of the time
is small although unknown) rather than on the dimension of the ambient
space. Another direction is to resort to cruder approximations that
only captures the homotopy type or the homology of the sampled
shape. The recent theory of persistent homology provides a powerful and robust tool to study the homology of sampled spaces in a stable way.

The wide variety of larger and larger available data - often corrupted by noise and outliers - requires to consider the statistical properties of their topological and geometric features and to propose new relevant statistical models for their study.

There exist various statistical and machine learning methods intending to uncover the geometric structure of data. Beyond manifold learning and dimensionality reduction approaches that generally do not allow to assert the relevance of the inferred topological and geometric features and are not well-suited for the analysis of complex topological structures, set estimation methods intend to estimate, from random samples, a set around which the data is concentrated. In these methods, that include support and manifold estimation, principal curves/manifolds and their various generalizations to name a few, the estimation problems are usually considered under losses, such as Hausdorff distance or symmetric difference, that are not sensitive to the topology of the estimated sets, preventing these tools to directly infer topological or geometric information.

Regarding purely topological features, the statistical estimation of homology or homotopy type of compact subsets of Euclidean spaces, has only been considered recently, most of the time under the quite restrictive assumption that the data are randomly sampled from smooth manifolds.

In a more general setting, with the emergence of new geometric inference tools based on the study of distance functions and algebraic topology tools such as persistent homology, computational topology has recently seen an important development offering a new set of methods to infer relevant topological and geometric features of data sampled in general metric spaces. The use of these tools remains widely heuristic and until recently there were only a few preliminary results establishing connections between geometric inference, persistent homology and statistics. However, this direction has attracted a lot of attention over the last three years. In particular, stability properties and new representations of persistent homology information have led to very promising results to which the DataShape members have significantly contributed. These preliminary results open many perspectives and research directions that need to be explored.

Our goal is to build on our first statistical results in tda to develop the mathematical foundations of Statistical Topological and Geometric Data Analysis. Combined with the other objectives, our ultimate goal is to provide a well-founded and effective statistical toolbox for the understanding of topology and geometry of data.

This objective is driven by the problems raised by the use of
topological and geometric approaches in machine learning. The goal is both to use our techniques
to better understand the role of topological and geometric structures in machine learning
problems and to apply our tda tools to develop specialized topological approaches to be used
in combination with other machine learning methods.

We develop a high quality open source software platform called gudhi which is becoming a reference in geometric and topological data analysis in high dimensions. The goal
is not to provide code tailored to the numerous potential applications but rather to provide the central data structures and algorithms that underlie applications in geometric and topological data analysis.

The development of the gudhi platform also serves to benchmark and optimize new algorithmic solutions resulting from our theoretical work. Such development necessitates a whole line of research on software architecture and
interface design, heuristics and fine-tuning optimization, robustness and arithmetic issues, and visualization.
We aim at providing a full programming environment following the same recipes that made up the success story of the cgal library, the reference library in computational geometry.

Some of the algorithms implemented on the platform will also be interfaced to other software platforms, such as the for statistical computing, and languages such as Python in order to make them usable in combination with other data analysis and machine learning tools. A first attempt in this direction has been done with the creation of an R package called TDA in collaboration with the group of Larry Wasserman at Carnegie Mellon University (Inria Associated team CATS) that already includes some functionalities of the gudhi library and implements some joint results between our team and the CMU team. A similar interface with the Python language is also considered a priority. To go even further towards helping users, we will provide utilities that perform the most common tasks without requiring any programming at all.

Our work is mostly of a fundamental mathematical and algorithmic nature but finds a variety of applications in data analysis, e.g., in material science, biology, sensor networks, 3D shape analysis and processing, to name a few.

More specifically, DataShape is working on the analysis of trajectories obtained from inertial sensors (PhD theses of Wojtek Riese Alexandre Guérin with Sysnav, participation to the DGA/ANR challenge MALIN with Sysnav) and, more generally on the development of new TDA methods for Machine Learning and Artificial Intelligence for (multivariate) time-dependent data from various kinds of sensors in collaboration with Fujitsu, or high dimensional point cloud data with MetaFora.

DataShape is also working in collaboration with the University of Columbia in New-York, especially with the Rabadan lab, in order to improve bioinformatics methods and analyses for single cell genomic data. For instance, there is a lot of work whose aim is to use TDA tools such as persistent homology and the Mapper algorithm to characterize, quantify and study statistical significance of biological phenomena that occur in large scale single cell data sets. Such biological phenomena include, among others: the cell cycle, functional differentiation of stem cells, and immune system responses (such as the spatial response on the tissue location, and the genomic response with protein expression) to breast cancer.

The weekly research seminar of DataShape is now taking place online, and travels for the team members have decreased a lot this year, mainly because of the COVID-19 pandemic.

Previous works on lexicographic optimal chains have shown that they provide meaningful geometric homology representatives while being easier to compute than their

-norm optimal counterparts. This work

presents a novel algorithm to efficiently compute lexicographic optimal chains with a given boundary in a triangulation of 3-space, by leveraging Lefschetz duality and an augmented version of the disjoint-set data structure. Furthermore, by observing that lexicographic minimization is a linear operation, we define a canonical basis of lexicographic optimal chains, called critical basis, and show how to compute it. In applications, the presented algorithms offer new promising ways of efficiently reconstructing open surfaces in difficult acquisition scenarios.

We give

a lower bound of

on the maximal complexity of the Euclidean Voronoi diagram of

non-intersecting lines in

for

.

Density estimation appears as a subroutine in many learning procedures, so it is of crucial interest to have efficient methods to perform it in practical situations. Multidimensional density estimation faces the curse of dimensionality. To tackle this issue, a solution is to add a structural hypothesis through an undirected graphical model on the underlying distribution. We propose

ISDE (Independence Structure Density Estimation) an algorithm designed to estimate a density and an undirected graphical model from a special family of graphs corresponding to an independence structure, where features can be separated into independent groups. It is designed for moderately high dimensional data (up to 15 features) and it can be used in parametric as well as nonparametric situations. Existing methods on nonparametric graph estimation focus on multidimensional dependencies only through pairwise ones. ISDE does not suffer from this restriction and can addresses structures not covered yet by available algorithms. In this paper, we present existing theory about independence structure, explain the construction of our algorithm and prove its effectiveness on simulated data both quantitatively, through measures of density estimation performance under Kullback-Leibler loss and qualitatively, in terms of recovering of independence structures. We also provide information about running time.

In collaboration with Quentin Mérigot (Laboratoire de Mathématiques d'Orsay, Univ. Paris-Saclay).

We derive nearly tight and non-asymptotic convergence bounds for solutions of entropic semi-discrete optimal transport . These bounds quantify the stability of the dual solutions of the regularized problem (sometimes called Sinkhorn potentials) w.r.t. the regularization parameter, for which we ensure a better than Lipschitz dependence. Such facts may be a first step towards a mathematical justification of annealing or ε-scaling heuristics for the numerical resolution of regularized semi-discrete optimal transport. Our results also entail a non-asymptotic and tight expansion of the difference between the entropic and the unregularized costs.

In collaboration with C. Levrard (LPSM, Sorbonne Université), K. Balasubramanian (UC. Davis), W. Polonik (UC. Davis).

Persistence diagrams (PDs) are the most common descriptors used to encode the topology of structured data appearing in challenging learning tasks; think e.g. of graphs, time series or point clouds sampled close to a manifold. Given random objects and the corresponding distribution of PDs, one may want to build a statistical summary-such as a mean-of these random PDs, which is however not a trivial task as the natural geometry of the space of PDs is not linear. In this article , we study two such summaries, the Expected Persistence Diagram (EPD), and its quantization. The EPD is a measure supported on

We provide a short proof that the Wasserstein distance between the empirical measure of a n-sample and the estimated measure is of order

Assume that we observe i.i.d. points lying close to some unknown d-dimensional

In collaboration with C. Levrard (LPSM, Sorbonne Université.

This work addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topological data analysis. To this aim we provide two algorithms that we prove almost minimax optimal. Second we build from the estimator of the mean measure a vectorization map, that sends every measure into a finite-dimensional Euclidean space, and investigate its properties through a clustering-oriented lens. In a nutshell, we show that in a mixture of measure generating processes, our technique yields a representation in

In collaboration with Yuichi Ike (Fujitsu) and Hariprasad Kannan.

Solving optimization tasks based on functions and losses with a topological flavor is a very active, growing field of research in data science and Topological Data Analysis, with applications in non-convex optimization, statistics and machine learning. However, the approaches proposed in the literature are usually anchored to a specific application and/or topological construction, and do not come with theoretical guarantees. To address this issue, we study the differentiability of a general map associated with the most common topological construction, that is, the persistence map . Building on real analytic geometry arguments, we propose a general framework that allows us to define and compute gradients for persistence-based functions in a very simple way. We also provide a simple, explicit and sufficient condition for convergence of stochastic subgradient methods for such functions. This result encompasses all the constructions and applications of topological optimization in the literature. Finally, we provide associated code, that is easy to handle and to mix with other non-topological methods and constraints, as well as some experiments showcasing the versatility of our approach.

In collaboration with Elchanan Solomon (Department of Mathematics, Duke University).

Stable topological invariants are a cornerstone of persistence theory and applied topology, but their discriminative properties are often poorly-understood. In we study a rich homology-based invariant first defined by Dey, Shi, and Wang, which we think of as embedding a metric graph in the barcode space. We prove that this invariant is locally injective on the space of metric graphs and globally injective on a GH-dense subset. Moreover, we show that is globally injective on a full measure subset of metric graphs, in the appropriate sense.

In collaboration with Jacob Leygonie and Ulrike Tillmann (Mathematical Institute of Oxford).

In collaboration with Jacob Leygonie (Mathematical Institute of Oxford).

We introduce a novel gradient descent algorithm extending the well-known Gradient Sampling methodology to the class of stratifiably smooth objective functions, which are defined as locally Lipschitz functions that are smooth on some regular pieces-called strata-of the ambient Euclidean space . For this class of functions, our algorithm achieves a sub-linear convergence rate. We then apply our method to objective functions based on the (extended) persistent homology map computed over lower-star filters, which is a central tool of Topological Data Analysis. For this, we propose an efficient exploration of the corresponding stratification by using the Cayley graph of the permutation group. Finally, we provide benchmark and novel topological optimization problems, in order to demonstrate the utility and applicability of our framework.

In collaboration with Yuhei Umeda (Fujitsu) and Yuichi Ike (Fujitsu).

Although neural networks are capable of reaching astonishing performances on a wide variety of contexts, properly training networks on complicated tasks requires expertise and can be expensive from a computational perspective. In industrial applications, data coming from an open-world setting might widely differ from the benchmark datasets on which a network was trained. Being able to monitor the presence of such variations without retraining the network is of crucial importance. In this article , we develop a method to monitor trained neural networks based on the topological properties of their activation graphs. To each new observation, we assign a Topological Uncertainty, a score that aims to assess the reliability of the predictions by investigating the whole network instead of its final layer only, as typically done by practitioners. Our approach entirely works at a post-training level and does not require any assumption on the network architecture, optimization scheme, nor the use of data augmentation or auxiliary datasets; and can be faithfully applied on a large range of network architectures and data types. We showcase experimentally the potential of Topological Uncertainty in the context of trained network selection, Out-Of-Distribution detection, and shift-detection, both on synthetic and real datasets of images and graphs.

In collaboration with with Yuhei Umeda (Fujitsu), Yuichi Ike (Fujitsu) and Clément Levrard (Univ. Paris Diderot).

Robust topological information commonly comes in the form of a set of persistence diagrams, finite measures that are in nature uneasy to affix to generic machine learning frameworks. We introduce a fast, learnt, unsupervised vectorization method for measures in Euclidean spaces and use it for reflecting underlying changes in topological behaviour in machine learning contexts. The algorithm is simple and efficiently discriminates important space regions where meaningful differences to the mean measure arise. It is proven to be able to separate clusters of persistence diagrams. We showcase the strength and robustness of our approach on a number of applications, from emulous and modern graph collections where the method reaches state-of-the-art performance to a geometric synthetic dynamical orbits problem. The proposed methodology comes with a single high level tuning parameter: the total measure encoding budget. We provide a completely open access software.

In collaboration with El Mehdi Saad (LMO, University Paris-Saclay)

Motivated by the question of frugal machine learning, we investigate in this paper the problem of minimizing the excess generalization error with respect to the best expert prediction in a finite family in the stochastic setting, under limited access to information: we assume that the learner only has access to a limited number of expert advices per training round, as well as for prediction. Under suitable assumptions on the loss (strong convexity and Lipschitz) We design novel algorithms achieving fast rates in this setting, and show that it is necessary to query at least two experts per training round to attain such fast rates.

Collaboration with T.Mary-Huard (AgroParisTech), V. Perduca (MAP5, Univ. Paris), M.-L. Martin-Magniette (AgroParisTech)

Collaboration with: H. Marienwald (U. Potsdam), J.-B. Fermanian (LMO, U. Paris-Saclay)

In the multi-task averaging problem, the goal is the joint estimation of the means of multiple distributions using separate, independent data sets. This is also known as the "many-means estimation problem" which has a long history in statistics. In this paper we are particularly interested in the setting where the ambient space is of large dimension and propose a method exploiting similarities between tasks, without any related information being known in advance. The principle is that each empirical mean is shrunk towards the local average of its neighbors, which are found by multiple testing. We prove theoretically that this approach provides a reduction in mean squared error. This improvement can be significant when the dimension of the input space is large, demonstrating a "blessing of dimensionality" phenomenon. An application of this approach is the estimation of multiple kernel mean embeddings.

Collaboration with: P. Neuvial (CNRS, U. Toulouse), E. Roquain (LPSM, U Sorbonne université)

Collaboration with: J.-B. Fermanian (LMO, U. Paris-Saclay)

Article to appear in the Festschrift in the honor of V. Spokoiny

In collaboration with Ewan Carr and Raquel Iniesta (King's College London) and Bertrand Michel (Ecole Centrale de Nantes).

In collaboration with Michael Bleher (Heidelberg University), Lukas Hahn (Heidelberg University), Juan Patiño-Galindo (Columbia University), Ulrich Bauer (TUM), Raúl Rabadán (Columbia University) and Andreas Ott (Heidelberg University).

The COVID-19 pandemic has lead to a worldwide effort to characterize its evolution through the mapping of mutations in the genome of the coronavirus SARS-CoV-2. As the virus spreads and evolves it acquires new mutations that could have important public health consequences, including higher transmissibility, morbidity, mortality, and immune evasion, among others. Ideally, we would like to quickly identify new mutations that could confer adaptive advantages to the evolving virus by leveraging the large number of SARS-CoV-2 genomes. One way of identifying adaptive mutations is by looking at convergent mutations, mutations in the same genomic position that occur independently. The large number of currently available genomes, more than a million at this moment, however precludes the efficient use of phylogeny-based techniques. In this work , we establish a fast and scalable Topological Data Analysis approach for the early warning and surveillance of emerging adaptive mutations of the coronavirus SARS-CoV-2 in the ongoing COVID-19 pandemic. Our method relies on a novel topological tool for the analysis of viral genome datasets based on persistent homology. It systematically identifies convergent events in viral evolution merely by their topological footprint and thus overcomes limitations of current phylogenetic inference techniques. This allows for an unbiased and rapid analysis of large viral datasets. We introduce a new topological measure for convergent evolution and apply it to the complete GISAID dataset as of February 2021, comprising 303,651 high-quality SARS-CoV-2 isolates taken from patients all over the world since the beginning of the pandemic. A complete list of mutations showing topological signals of convergence is compiled. We find that topologically salient mutations on the receptor-binding domain appear in several variants of concern and are linked with an increase in infectivity and immune escape. Moreover, for many adaptive mutations the topological signal precedes an increase in prevalence. We demonstrate the capability of our method to effectively identify emerging adaptive mutations at an early stage. By localizing topological signals in the dataset, we are able to extract geo-temporal information about the early occurrence of emerging adaptive mutations. The identification of these mutations can help to develop an alert system to monitor mutations of concern and guide experimentalists to focus the study of specific circulating variants.

In collaboration with Bertrand Michel (Ecole Centrale de Nantes).

Topological Data Analysis (TDA)is a recent and fast growing field providing a set of new topological and geometric tools to infer relevant features for possibly complex data. This paper is a brief introduction, through a few selected topics, to basic fundamental and practical aspects of TDA for non experts.

- Acronym : ASPAG.

- Type : ANR blanc.

- Title : Analysis and Probabilistic Simulations of Geometric Algorithms.

- Coordinator : Olivier Devillers (équipe Inria Gamble).

- Duration : 4 years from January 2018 to December 2021.

- Others Partners: Inria Gamble, LPSM, LABRI, Université de Rouen, IECL, Université du Littoral Côte d'Opale, Telecom ParisTech, Université Paris X (Modal'X), LAMA, Université de Poitiers, Université de Bourgogne.

- Abstract:

The analysis and processing of geometric data has become routine in a variety of human activities ranging from computer-aided design in manufacturing to the tracking of animal trajectories in ecology or geographic information systems in GPS navigation devices. Geometric algorithms and probabilistic geometric models are crucial to the treatment of all this geometric data, yet the current available knowledge is in various ways much too limited: many models are far from matching real data, and the analyses are not always relevant in practical contexts. One of the reasons for this state of affairs is that the breadth of expertise required is spread among different scientific communities (computational geometry, analysis of algorithms and stochastic geometry) that historically had very little interaction. The Aspag project brings together experts of these communities to address the problem of geometric data. We will more specifically work on the following three interdependent directions.

(1) Dependent point sets: One of the main issues of most models is the core assumption that the data points are independent and follow the same underlying distribution. Although this may be relevant in some contexts, the independence assumption is too strong for many applications.

(2) Simulation of geometric structures: The phenomena studied in (1) involve intricate random geometric structures subject to new models or constraints. A natural first step would be to build up our understanding and identify plausible conjectures through simulation. Perhaps surprisingly, the tools for an effective simulation of such complex geometric systems still need to be developed.

(3) Understanding geometric algorithms: the analysis of algorithm is an essential step in assessing the strengths and weaknesses of algorithmic principles, and is crucial to guide the choices made when designing a complex data processing pipeline. Any analysis must strike a balance between realism and tractability; the current analyses of many geometric algorithms are notoriously unrealistic. Aside from the purely scientific objectives, one of the main goals of Aspag is to bring the communities closer in the long term. As a consequence, the funding of the project is crucial to ensure that the members of the consortium will be able to interact on a very regular basis, a necessary condition for significant progress on the above challenges.

- See also: https://members.loria.fr/Olivier.Devillers/aspag/

- Acronym : TopAI

- Type : ANR Chair in AI.

- Title : Topological Data Analysis for Machine Learning and AI

- Coordinator : Frédéric Chazal

- Duration : 4 years from September 2020 to August 2024.

- Others Partners: Two industrial partners, the French SME Sysnav and the French start-up MetaFora.

- Abstract:

The TopAI project aims at developing a world-leading research activity on topological and geometric approaches in Machine Learning (ML) and AI with a double academic and industrial/societal objective. First, building on the strong expertise of the candidate and his team in TDA, TopAI aims at designing new mathematically well-founded topological and geometric methods and tools for Data Analysis and ML and to make them available to the data science and AI community through state-of-the-art software tools. Second, thanks to already established close collaborations and the strong involvement of French industrial partners, TopAI aims at exploiting its expertise and tools to address a set of challenging problems with high societal and economic impact in personalized medicine and AI-assisted medical diagnosis.

- Acronym : ALGOKNOT.

- Type : ANR Jeune Chercheuse Jeune Chercheur.

- Title : Algorithmic and Combinatorial Aspects of Knot Theory.

- Coordinator : Clément Maria.

- Duration : 2020 – 2023 (3 years).

- Abstract: The project AlgoKnot aims at strengthening our understanding of the computational and combinatorial complexity of the diverse facets of knot theory, as well as designing efficient algorithms and software to study their interconnections.

- See also: https://www-sop.inria.fr/members/Clement.Maria/

- Acronym: GeMfaceT.

- Type: ANR JCJC -CES 40 – Mathématiques

- Title: A bridge between Geometric Measure and Discrete Surface Theories

- Coordinator: Blanche Buet.

- Duration: 48 months, starting October 2021.

- Abstract: This project positions at the interface between geometric measure and discrete surface theories. There has recently been a growing interest in non-smooth structures, both from theoretical point of view, where singularities occur in famous optimization problems such as Plateau problem or geometric flows such as mean curvature flow, and applied point of view where complex high dimensional data are no longer assumed to lie on a smooth manifold but are more singular and allow crossings, tree-structures and dimension variations. We propose in this project to strengthen and expand the use of geometric measure concepts in discrete surface study and complex data modelling and also, to use those possible singular disrcete surfaces to compute numerical solutions to the aforementioned problems.

Research collaboration between DataShape and the Service Hydrographique et Océanographique de la Marine (SHOM) on bathymetric data analysis using a combination of TDA and deep learning techniques. This collaboration is funded by the AMI IA Améliorer la cartographie du littoral.

Research collaboration between DataShape and IFPEN on TDA applied to various problems issued from energy transition and sustainable mobility.

- Type : Paris Region PhD² - PhD 2021.

- Title : Analyse de données cytométriques.

The Île-de-France region funds two PhD theses in collaboration with Metafora biosystems, a company specialized in the analysis of cells through their metabolism. The first is supervised by Pascal Massart (Inria team Celeste) and Marc Glisse, and its goal is to improve clustering for this particular type of data. The second one is supervised by Gilles Blanchard and Marc Glisse and aims to compare samples instead of analyzing just one sample.