Section: New Results

Algorithmic foundations

Keywords: computational geometry, Computational topology, Voronoi diagrams, α-shapes, Morse theory, graph algorithm, combinatorial optimization, statistical learning.

The Structural Bioinformatics Library: modeling in biomolecular science and beyond

Participants : Frédéric Cazals, Tom Dreyfus.

Software in structural bioinformatics has mainly been application driven. To favor practitioners seeking off-the-shelf applications, but also developers seeking advanced building blocks to develop novel applications, we undertook the design of the Structural Bioinformatics Library (SBL, http://sbl. inria.fr) [20], a generic C++/python cross-platform software library targeting complex problems in structural bioinformatics. Its tenet is based on a modular design offering a rich and versatile framework allowing the development of novel applications requiring well specified complex operations, without compromising robustness and performances.

The SBL involves four software components (1-4 thereafter). For end-users, the SBL provides ready to use, state-of-the-art (1) applications to handle molecular models defined by unions of balls, to deal with molecular flexibility, to model macro-molecular assemblies. These tools can also be combined to tackle integrated analysis problems. For developers, the SBL provides a broad C++ toolbox with modular design, involving (2) core algorithms, (3) biophysical models, and (4) modules, the latter being especially suited to develop novel applications. The SBL comes with a thorough documentation consisting of user and reference manuals, and a bugzilla platform to handle community feedback.

The SBL is available from http://sbl.inria.fr.

Optimal transportation problems with connectivity constraints

Participants : Frédéric Cazals, Dorian Mazauric.

The earth mover distance (EMD) or the Mallows distance are example optimal transportation (OT) problems reducing to linear programs. In this work [21], we study a generalization of these problems when the supply and demand nodes are the vertices of two graphs called the supply and the demand graphs. The novel problems embed connectivity constraints in the transport plans computed, using a Lipschitz-like condition involving distances between certain subgraphs of the supply graph and certain subgraphs of the demand graph. More precisely, we make three contributions.

First, we formally introduce two optimal transportation problems generalizing EMD, namely Minimum-cost under flow, transport size, and connectivity constraints problem (problem EMD-FCC) and Maximum-flow under cost, transport size, and connectivity constraints problem (problem EMD-CCC). We prove that problems EMD-CCC and EMD-FCC are NP-complete, and that EMD-FCC is hard to approximate within any given constant. Second, we develop a greedy heuristic algorithm returning admissible solutions, of time complexity O(n3m2) with n and m the numbers of vertices of the supply and demand graphs, respectively. Third, on the experimental side, we apply our novel OT algorithms for two applications, namely the comparison of clusterings, and the analysis of so-called potential energy landscapes in molecular science. These experiments show that optimizing the transport plan and respecting connectivity constraint can be competing objectives. Implementations of our algorithms are available in the Structural Bioinformatics Library at http://sbl.inria.fr.

Clustering stability revealed by matchings between clusters of clusters

Participants : Frédéric Cazals, Dorian Mazauric, Romain Tetley.

Clustering is a fundamental problem in data science, yet, the variety of clustering methods and their sensitivity to parameters make clustering hard. To analyze the stability of a given clustering algorithm while varying its parameters, and to compare clusters yielded by different algorithms, several comparison schemes based on matchings, information theory and various indices (Rand, Jaccard) have been developed. In this work [22], we go beyond these by providing a novel class of methods computing meta-clusters within each clustering– a meta-cluster is a group of clusters, together with a matching between these. Altogether, these pieces of information help assessing the coherence between two clusterings.

More specifically, let the intersection graph of two clusterings be the edge-weighted bipartite graph in which the nodes represent the clusters, the edges represent the non empty intersection between two clusters, and the weight of an edge is the number of common items. We introduce the so-called (k,D) and D-family-matching problems on intersection graphs, with k the number of meta-clusters and D the upper-bound on the diameter of the graph induced by the clusters of any meta-cluster. First we prove hardness and inapproximability results. Second, we design exact polynomial time dynamic programming algorithms for some classes of graphs (in particular trees). Then, we prove efficient (exact, approximation, and heuristic) algorithms, based on spanning trees, for general graphs. Practically, we present extensive experiments in two directions. First, we illustrate the ability of our algorithms to identify relevant meta-clusters between a given clustering and an edited version of it. Second, we show how our methods can be used to identify notorious instabilities of the k-means algorithm.

Experimental evaluation of a branch and bound algorithm for computing pathwidth

Participant : Dorian Mazauric.

In collaboration with David Coudert and Nicolas Nisse (COATI project-team, Université Côte D'Azur, Inria, I3S / CNRS).

Path-decompositions of graphs are an important ingredient of dynamic programming algorithms for solving efficiently many NP-hard problems. Therefore, computing the pathwidth and associated path-decomposition of graphs has both a theoretical and practical interest. In [16], we design a Branch and Bound algorithm that computes the exact pathwidth of graphs and a corresponding path-decomposition. Our main contribution consists of several non-trivial techniques to reduce the size of the input graph (pre-processing) and to cut the exploration space during the search phase of the algorithm. We evaluate experimentally our algorithm by comparing it to existing algorithms of the literature. It appears from the simulations that our algorithm offers a significant gain with respect to previous work. In particular, it is able to compute the exact pathwidth of any graph with less than 60 nodes in a reasonable running-time (10 minutes on a standard laptop). Moreover, our algorithm achieves good performance when used as a heuristic (i.e., when returning best result found within bounded time-limit). Our algorithm is not restricted to undirected graphs since it actually computes the directed pathwidth which generalizes the notion of pathwidth to digraphs.

Extracting the core structural connectivity network: guaranteeing network connectedness through a graph-theoretical approach

Participant : Dorian Mazauric.

In collaboration with Demian Wassermann, Guillermo Gallardo-Diez and Rachid Deriche (ATHENA project-team, Université Côte d'Azur, Inria).

In this work [19], we present a graph-theoretical algorithm to extract the connected core structural connectivity network of a subject population. Extracting this core common network across subjects is a main problem in current neuroscience. Such network facilitates cognitive and clinical analyses by reducing the number of connections that need to be explored. Furthermore, insights into the human brain structure can be gained by comparing core networks of different populations. We show that our novel algorithm has theoretical and practical advantages. First, contrary to the current approach our algorithm guarantees that the extracted core subnetwork is connected agreeing with current evidence that the core structural network is tightly connected. Second, our algorithm shows enhanced performance when used as feature selection approach for connectivity analysis on populations.

On the complexity of the representation of simplicial complexes by trees

Participant : Dorian Mazauric.

In collaboration with Jean-Daniel Boissonnat (DataShape team, Université Côte d'Azur, Inria).

In this paper [14], we investigate the problem of the representation of simplicial complexes by trees. We introduce and analyze local and global tree representations. We prove that the global tree representation is more efficient in terms of time complexity for searching a given simplex and we show that the local tree representation is more efficient in terms of size of the structure. The simplicial complexes are modeled by hypergraphs. We then prove that the associated combinatorial optimization problems are very difficult to solve and to approximate even if the set of maximal simplices induces a planar graph of maximum degree at most three or a bounded degree hypergraph. However, we prove polynomial time algorithms that compute constant factor approximations and optimal solutions for some classes of instances.

Well balanced designs for data placement

Participant : Dorian Mazauric.

In collaboration with Jean-Claude Bermond (COATI project-team, Université Côte D'Azur, Inria, I3S / CNRS), Alain Jean-Marie (MAESTRO project-team, Université Côte D'Azur, Inria) and Joseph Yu (Department of Mathematics, UFV, Abbotsford, BC, Canada).

The problem we consider in [13] is motivated by data placement, in particular data replication in distributed storage and retrieval systems. We are given a set V of v servers along with b files (data, documents). Each file is replicated on exactly k servers. A placement consists in finding a family of b subsets of V (representing the files) called blocks, each of size k. Each server has some probability to fail and we want to find a placement that minimizes the variance of the number of available files. It was conjectured that there always exists an optimal placement (with variance better than any other placement for any value of the probability of failure). We show that the conjecture is true, if there exists a well-balanced design, that is a family of blocks, each of size k, such that each j-element subset of V, 1jk, belongs to the same or almost the same number of blocks (difference at most one). The existence of well-balanced designs is a difficult problem as it contains, as a subproblem, the existence of Steiner systems. We completely solve the case math formula and give bounds and constructions for math formula and some values of v and b.