## Section: New Results

### Algorithmic foundations

**Keywords:** Computational geometry, computational topology,
optimization, data analysis.

#### Comparing two clusterings using matchings between clusters of clusters

Participants : F. Cazals, D. Mazauric, R. Tetley.

In collaboration with R. Watrigant, University Lyon I.

Clustering is a fundamental problem in data science, yet, the variety of clustering methods and their sensitivity to parameters make clustering hard. To analyze the stability of a given clustering algorithm while varying its parameters, and to compare clusters yielded by different algorithms, several comparison schemes based on matchings, information theory and various indices (Rand, Jaccard) have been developed. In this work [15], we go beyond these by providing a novel class of methods computing meta-clusters within each clustering– a meta-cluster is a group of clusters, together with a matching between these.

Let the intersection graph of two clusterings be the edge-weighted bipartite graph in which the nodes represent the clusters, the edges represent the non empty intersection between two clusters, and the weight of an edge is the number of common items. We introduce the so-called $D$-Family matching problem on intersection graphs, with $D$ the upper-bound on the diameter of the graph induced by the clusters of any meta-cluster. First we prove $NP$-completeness and $APX$-hardness results, and unbounded approximation ratio of simple strategies. Second, we design exact polynomial time dynamic programming algorithms for some classes of graphs (in particular trees). Then, we prove spanning-tree based efficient heuristic algorithms for general graphs.

Our experiments illustrate the role of $D$ as a scale parameter providing information on the relationship between clusters within a clustering and in-between two clusterings. They also show the advantages of our built-in mapping over classical cluster comparison measures such as the variation of information (VI).

#### Low-Complexity Nonparametric Bayesian Online Prediction with Universal Guarantees

Participant : F. Cazals.

In collaboration with A. Lhéritier, Amadeus SA.

In this work [18], we propose a novel nonparametric online predictor for discrete labels conditioned on multivariate continuous features. The predictor is based on a feature space discretization induced by a full-fledged k-d tree with randomly picked directions and a recursive Bayesian distribution, which allows to automatically learn the most relevant feature scales characterizing the conditional distribution. We prove its pointwise universality, i.e., it achieves a normalized log loss performance asymptotically as good as the true conditional entropy of the labels given the features. The time complexity to process the n-th sample point is $O(logn)$ in probability with respect to the distribution generating the data points, whereas other exact nonparametric methods require to process all past observations. Experiments on challenging datasets show the computational and statistical efficiency of our algorithm in comparison to standard and state-of-the-art methods.

#### How long does it take for all users in a social network to choose their communities?

Participant : D. Mazauric.

In collaboration with J.-C. Bermond (Coati project-team), A. Chaintreau (Columbia University), and G. Ducoffe (National Institute for Research and Development in Informatics, Bucharest).

In this work [14],
we consider a community formation problem in social networks, where the users are either friends or enemies. The users are partitioned into conflict-free groups (*i.e.*, independent sets in the *conflict graph* ${G}^{-}=(V,E)$ that represents the enmities between users).
The dynamics goes on as long as there exists any set of at most $k$ users, $k$ being any fixed parameter, that can change their current groups in the partition *simultaneously*, in such a way that they all strictly increase their utilities (number of friends *i.e.*, the cardinality of their respective groups minus one).
Previously, the best-known upper-bounds on the maximum time of convergence were $\mathcal{O}\left(\right|V\left|\alpha \left({G}^{-}\right)\right)$ for $k\le 2$ and $\mathcal{O}\left(\right|V{|}^{3})$ for $k=3$, with $\alpha \left({G}^{-}\right)$ being the independence number of ${G}^{-}$.
Our first contribution in this paper consists in reinterpreting the initial problem as the study of a dominance ordering over the vectors of integer partitions.
With this approach, we obtain for $k\le 2$ the tight upper-bound $\mathcal{O}\left(\right|V|min\{\alpha \left({G}^{-}\right),\sqrt{\left|V\right|}\})$ and, when ${G}^{-}$ is the empty graph, the exact value of order $\frac{{\left(2\right|V\left|\right)}^{3/2}}{3}$.
The time of convergence, for any fixed $k\ge 4$, was conjectured to be polynomial.
In this paper we disprove this.
Specifically, we prove that for any $k\ge 4$, the maximum time of convergence is in $\Omega \left(\right|V{|}^{\Theta (log\left|V\right|)})$.