Section: New Results
Keywords: Computational geometry, computational topology, optimization, data analysis.
Making a stride towards a better understanding of the biophysical questions discussed in the previous sections requires various methodological developments discussed below.
A Sequential non-parametric multivariate two-sample test
Participant : F. Cazals.
In collaboration with A. Lhéritier (Amadeus, France).
Given samples from two distributions, a nonparametric two-sample test aims at determining whether the two distributions are equal or not, based on a test statistic. Classically, this statistic is computed on the whole dataset, or is computed on a subset of the dataset by a function trained on its complement. We consider methods in a third tier , so as to deal with large (possibly infinite) datasets, and to automatically determine the most relevant scales to work at, making two contributions. First, we develop a generic sequential nonparametric testing framework, in which the sample size need not be fixed in advance. This makes our test a truly sequential nonparametric multivariate two-sample test. Under information theoretic conditions qualifying the difference between the tested distributions, consistency of the two-sample test is established. Second, we instantiate our framework using nearest neighbor regressors, and show how the power of the resulting two-sample test can be improved using Bayesian mixtures and switch distributions. This combination of techniques yields automatic scale selection, and experiments performed on challenging datasets show that our sequential tests exhibit comparable performances to those of state-of-the-art nonsequential tests.
Comparing two clusterings using matchings between clusters of clusters
Participants : F. Cazals, D. Mazauric, R. Tetley.
In collaboration with R. Watrigant (University Lyon I, Laboratoire de l'Informatique du Parallélisme, France).
Clustering is a fundamental problem in data science, yet, the variety of clustering methods and their sensitivity to parameters make clustering hard. To analyze the stability of a given clustering algorithm while varying its parameters, and to compare clusters yielded by different algorithms, several comparison schemes based on matchings, information theory and various indices (Rand, Jaccard) have been developed. We go beyond these by providing a novel class of methods computing meta-clusters within each clustering–a meta-cluster is a group of clusters, together with a matching between these. Let the intersection graph of two clusterings be the edge-weighted bipartite graph in which the nodes represent the clusters, the edges represent the non empty intersection between two clusters, and the weight of an edge is the number of common items. We introduce the so-called D-family-matching problem on intersection graphs, with D the upper-bound on the diameter of the graph induced by the clusters of any meta-cluster. First we prove NP-completeness and APX-hardness results, and unbounded approximation ratio of simple strategies. Second, we design exact polynomial time dynamic programming algorithms for some classes of graphs (in particular trees). Then, we prove spanning-tree based efficient algorithms for general graphs. Our experiments illustrate the role of D as a scale parameter providing information on the relationship between clusters within a clustering and in-between two clusterings. They also show the advantages of our built-in mapping over classical cluster comparison measures such as the variation of information (VI).
How long does it take for all users in a social network to choose their communities?
Participant : D. Mazauric.
In collaboration with J.-C. Bermond (Inria/I3S project-team Coati) and A. Chaintreau (Columbia University in the city of New York) and G. Ducoffe (National Institute for Research and Development in Informatics and Research Institute of the University of Bucharest).
We consider a community formation problem in social networks, where the users are either friends or enemies. The users are partitioned into conflict-free groups (i.e., independent sets in the conflict graph that represents the enmities between users). The dynamics goes on as long as there exists any set of at most users, being any fixed parameter, that can change their current groups in the partition simultaneously, in such a way that they all strictly increase their utilities (number of friends i.e., the cardinality of their respective groups minus one). Previously, the best-known upper-bounds on the maximum time of convergence were for and for , with being the independence number of . Our first contribution in this paper consists in reinterpreting the initial problem as the study of a dominance ordering over the vectors of integer partitions. With this approach, we obtain for the tight upper-bound and, when is the empty graph, the exact value of order . The time of convergence, for any fixed , was conjectured to be polynomial , . In this paper we disprove this. Specifically, we prove that for any , the maximum time of convergence is an .
See  for details.
Sequential metric dimension
Participant : D. Mazauric.
In collaboration with J. Bensmail (I3S, Inria/I3S project-team Coati) and F. Mc Inerney (Inria/I3S project-team Coati) and N. Nisse (Inria, Inria/I3S project-team Coati) and S. Pérennes (CNRS, Inria/I3S project-team Coati).
In the localization game, introduced by Seager in 2013, an invisible and immobile target is hidden at some vertex of a graph . At every step, one vertex of can be probed which results in the knowledge of the distance between and the secret location of the target. The objective of the game is to minimize the number of steps needed to locate the target whatever be its location.
We address the generalization of this game where vertices can be probed at every step. Our game also generalizes the notion of the metric dimension of a graph. Precisely, given a graph and two integers , the localization problem asks whether there exists a strategy to locate a target hidden in in at most steps and probing at most vertices per step. We first show that, in general, this problem is NP-complete for every fixed (resp., ). We then focus on the class of trees. On the negative side, we prove that the localization problem is NP-complete in trees when and are part of the input. On the positive side, we design a -approximation for the problem in -node trees, i.e., an algorithm that computes in time (independent of ) a strategy to locate the target in at most one more step than an optimal strategy. This algorithm can be used to solve the localization problem in trees in polynomial time if is fixed.
We also consider some of these questions in the context where, upon probing the vertices, the relative distances to the target are retrieved. This variant of the problem generalizes the notion of the centroidal dimension of a graph.