EN FR
EN FR


Section: New Results

Data Analysis and Learning

Participant : Konstantin Avrachenkov.

Unsupervised learning

In [6], K. Avrachenkov, together with A. Kondratev, V. Mazalov (Petrozavodsk State Univ., Russia) and D. Rubanov (Amadeus), applied game-theoretic methods for community detection in networks. The traditional methods for detecting community structure are based on selecting dense subgraphs inside the network. Here the authors propose to use the methods of cooperative game theory that highlight not only the link density but also the mechanisms of cluster formation. Specifically, they suggest two approaches from cooperative game theory: the first approach is based on the Myerson value, whereas the second approach is based on hedonic games. Both approaches allow to detect clusters with various resolutions. However, the tuning of the resolution parameter in the hedonic games approach is particularly intuitive. Furthermore, the modularity-based approach and its generalizations as well as ratio cut and normalized cut methods can be viewed as particular cases of the hedonic games. Finally, for approaches based on potential hedonic games a very efficient computational scheme using Gibbs sampling is suggested.

Semi-supervised learning

Graph Semi-supervised learning (gSSL) aims to classify data exploiting two initial inputs: firstly, the data are structured in a network whose edges convey information on the proximity, in a wide sense, of two data points (e.g. correlation or spatial proximity) and, second, there is a partial information on some nodes, which have previously been labelled. Thus, the classification problem is usually a balance between two terms: one diffusing the information from the labelled points to the unlabelled ones through the network and another one that constrains the solution to be similar, on the labelled nodes, to the given labels. In practice, popular SSL methods as Standard Laplacian (SL), Normalized Laplacian (NL) or PageRank (PR), exploit those operators defined on graphs to spread the labels and, from a random walk perspective, the classification of a given point is given the maximum of the expected number of visits from one class. Anomalous diffusion can alter the way a graph is “explored” and, therefore, it can alter classification performance. In a nushell, Lévy flights/walks are a way to create superdiffusive regimes: the customary rule for their ignition is to allow the walkers to perform non-local jumps, whose length is distributed according to a fat-tailed probability density function with diverging second moment. Mathematically speaking, there have been several attempts to convert the Lévy flight phenomenon on networks and, in the context of gSSL, K. Avrachenkov in conjunction with S. De Nigris, E. Bautista, P. Abry and P. Gonçalves, settled in [38] for the use of fractional operators. In this SSL context, the authors cast those operators in the SSL problem in each different incarnation (SL, PR and NL) and investigated the beneficial effect of such a procedure for classification.

In [13], K. Avrachenkov, together with A. Kadavankandy (CentraleSupélec), L. Cottatellucci (EURECOM ) and R. Sundaresan (IISc, India), tackle the problem of hidden community detection. We consider Belief Propagation (BP) applied to the problem of detecting a hidden Erdös-Rényi (ER) graph embedded in a larger and sparser ER graph, in the presence of side-information. We derive two related algorithms based on BP to perform subgraph detection in the presence of two kinds of side-information. The first variant of side-information consists of a set of nodes, called cues, known to be from the subgraph. The second variant of side-information consists of a set of nodes that are cues with a given probability. It was shown in past works that BP without side-information fails to detect the subgraph correctly when a so-called effective signal-to-noise ratio (SNR) parameter falls below a threshold. In contrast, in the presence of non-trivial side-information, we show that the BP algorithm achieves asymptotically zero error for any value of a suitably defined phase-transition parameter. We validate our results on synthetic datasets and a few real world networks.

Supervised learning

Graphlets are defined as k-node connected induced subgraph patterns. For instance, for an undirected graph, 3-node graphlets include closed triangles and open triangles. The number of each graphlet, called graphlet count, is a signature which characterizes the local network structure of a given graph. Graphlet count plays a prominent role in network analysis of many fields, most notably bioinformatics and social science. However, computing exact graphlet count is inherently difficult and computational expensive because the number of graphlets grows exponentially large as the graph size and/or graphlet size grow. To deal with this difficulty, many sampling methods were proposed to estimate graphlet count with bounded error. Nevertheless, these methods require large number of samples to be statistically reliable, which is still computationally demanding. Intuitively, learning from historic graphs can make estimation more accurate and avoid many repetitive counting to reduce computational cost. Based on this idea, in [29] K. Avrachenkov, together with X. Liu, J. Chen and J. Lui (CUHK, Hong Kong), propose a convolutional neural network (CNN) framework and two preprocessing techniques to estimate graphlet count. Extensive experiments on two types of random graphs and real world biochemistry graphs show that their framework can offer substantial speedup on estimating graphlet count of new graphs with high accuracy.