Section: New Results

Data Analysis and Learning

Participants : Konstantin Avrachenkov, Maximilien Dreveton, Giovanni Neglia, Chuan Xu.

Almost exact recovery in label spreading

In semi-supervised graph clustering setting, an expert provides cluster membership of few nodes. This little amount of information allows one to achieve high accuracy clustering using efficient computational procedures. Our main goal is to provide a theoretical justification why the graph-based semi-supervised learning works very well. Specifically, for the Stochastic Block Model in the moderately sparse regime, in [34] K. Avrachenkov and M. Dreveton have proved that popular semi-supervised clustering methods like Label Spreading achieve asymptotically almost exact recovery as long as the fraction of labeled nodes does not go to zero and the average degree goes to infinity.

Similarities, kernels and proximity measures on graphs

In [13], K. Avrachenkov together with P. Chebotarev (RAS Trapeznikov Institute of Control Sciences, Russia) and D. Rubanov (Google) have analytically studied proximity and distance properties of various kernels and similarity measures on graphs. This helps to understand the mathematical nature of such measures and can potentially be useful for recommending the adoption of specific similarity measures in data analysis.

The effect of communication topology on learning speed

Many learning problems are formulated as minimization of some loss function on a training set of examples. Distributed gradient methods on a cluster are often used for this purpose. In [47], G. Neglia, together with G. Calbi (Univ Côte d'Azur), D. Towsley, and G. Vardoyan (UMass at Amherst, USA), has studied how the variability of task execution times at cluster nodes affects the system throughput. In particular, a simple but accurate model allows them to quantify how the time to solve the minimization problem depends on the network of information exchanges among the nodes. Interestingly, they show that, even when communication overhead may be neglected, the clique is not necessarily the most effective topology, as commonly assumed in previous works.

In [48] G. Neglia and C. Xu, together with D. Towsley (UMass at Amherst, USA) and G. Calbi (Univ Côte d'Azur) have investigated why the effect of the communication topology on the number of epochs needed for machine learning training to converge appears experimentally much smaller than what predicted by theory.