Section: New Results

Information systems

Participants : Eitan Altman, Konstantin Avrachenkov, Nicaise Choungmo Fofack, Majed Haddad, Alain Jean-Marie, Dorian Mazauric, Philippe Nain, Marina Sokol.

Web crawler optimization

A typical web search engine consists of three principal parts: crawling engine, indexing engine, and searching engine. The work [19] by K. Avrachenkov and P. Nain, together with A. Dudin, V. Klimenok, and O. Semenova (all three from Belarussian State University, Belarus), aims to optimize the performance of the crawling engine. The crawling engine finds new web pages and updates existing web pages in the database of the web search engine. The crawling engine has several robots collecting information from the Internet. The authors first calculate various performance measures of the system (e.g., probability of arbitrary page loss due to the buffer overflow, probability of starvation of the system, average time waiting in the buffer). Intuitively, one would like to avoid system starvation and at the same time to minimize the information loss. The authors formulate the problem as a multi-criteria optimization problem and solve it in the class of threshold policies. The authors consider a very general web page arrival process modeled by Batch Marked Markov Arrival Process and a very general service time modeled by Phase-type distribution. The model has been applied to the performance evaluation and optimization of the crawler designed by Inria Maestro team in the framework of the RIAM Inria -Canon research project (see Maestro 2006 and 2007 activity reports).

PageRank node centrality

In [48] K. Avrachenkov and M. Sokol, together with D. Nemirovsky (former Maestro team member), E. Smirnova (Inria project-team Axis ) and N. Litvak (University of Twente, The Netherlands), study a problem of quick detection of top-k Personalized PageRank (PPR) lists. This problem has a number of important applications such as finding local cuts in large graphs, estimation of similarity distance and person name disambiguation. The authors suggest that two observations are important when finding top-k PPR lists. Firstly, it is crucial that one detects fast the top-k most important neighbors of a node, while the exact order in the top-k list and the exact values of PPR are by far not so crucial. Secondly, by allowing a small number of “wrong” elements in top-k lists, one achieves great computational savings, in fact, without degrading the quality of the results. Based on these ideas, the authors propose Monte Carlo methods for quick detection of top-k PPR lists. We demonstrate the effectiveness of these methods on the Web and Wikipedia graphs, provide performance evaluation and supply stopping criteria.

Analysis of YouTube

E. Altman and M. Haddad, in collaboration with S.-E. Elayoubi (Orange Labs, Issy les Moulineaux), R. El-Azouzi, T. Jimenez and Y. Xu (all three from Univ. Avignon/LIA) have been investigating streaming protocols similar to the one used by YouTube. After preparing a survey on the state-of-the-art in [57] , they used Ballot theorems in [106] in order to compute the starvation probabilities (these are the probability that the queue empties before completing to send a streaming application).

This work is carried out in the framework of the Grant with Orange Labs (see Section 7.3 ) on “Quality of Service and Quality of Experience”.

Peer-to-peer networks

Real-time control of contents download

In the course of the Vooddo project, the question of assessing the theoretical limits of prefetching information in real-time arose. Given a network bandwidth and a graph of documents, is it possible to download documents in advance, so that the document surfer is never blocked because of missing information? The problem is modeled using a “cops-and-robbers” game and some of its algorithmic properties are derived. This work of A. Jean-Marie and D. Mazauric is joint with F. Fomin (Univ. Bergen) and F. Giroire and N. Nisse (both from Inria project-team Mascotte ) [86] .

P2P traffic classification

P2P downloads still represent a large portion of today's Internet traffic. More than 100 million users operate BitTorrent and generate more than 30% of the total Internet traffic. According to the Wikipedia article about BitTorrent, the traffic generated by BitTorrent is greater than the traffic generated by Netflix and Hulu combined. Recently, a significant research effort has been done to develop tools for automatic classification of Internet traffic by application. The purpose of the work [47] by K. Avrachenkov and M. Sokol, together with A. Legout (Inria project-team Planete ) and P. Gonçalves (Inria project-team Reso ), is to provide a framework for subclassification of P2P traffic generated by the BitTorrent protocol. The general intuition is that users with similar interests download similar contents. This intuition can be rigorously formalized with the help of graph based semi-supervised learning approach. In particular, the authors propose to work with PageRank based semi-supervised learning method, which scales well with very large volumes of data.


The success of BitTorrent has fostered the development of variants to its basic components. Some of the variants adopt greedy approaches aiming at exploiting the intrinsic altruism of the original version of BitTorrent in order to maximize the benefit of participating to a torrent. G. Neglia, D. Carra (University of Verona, Italy), P. Michiardi and F. Albanese (both from Institut Eurecom ) have studied BitTyrant, a recently proposed strategic client. The research is described in Maestro 2008 activity report. Results have been extended and supported by PlanetLab experiments in [22] .

Content-centric networks

In [100] N. Choungmo Fofack, P. Nain and G. Neglia, together with D. Towsley (University of Massachusetts at Amherst), provide building blocks for the performance evaluation of Content Centric-like Networks (CCNs). In CCNs if a cache receives a request for a content it does not store, it forwards the request to a higher-level cache, if any, or to the server. When located, the document is routed on the reverse-path and a copy is placed in each cache along the path. In this work the authors consider a cache replacement policy based on Time-to-Lives (TTLs) like in a DNS network. A local TTL is set when the content is first stored at the cache and is renewed every time the cache can satisfy a request for this content (at each hit). The content is removed when the TTL expires. Under the assumption that requests follow a renewal process and the TTLs are exponential random variables, we determine exact formulas for the performance metrics of interest (average cache occupancy, hit and miss probabilities/rates) for some specific architectures (a linear network and a tree network with one root node and N leaf nodes). For more general topologies and general TTL distributions, an approximate solution is proposed. Numerical results show the approximations to be accurate, with relative errors smaller than 10 -3 and 10 -2 respectively for exponentially distributed and constant TTLs.

This work is carried out in the framework of the Grant with Orange Labs on “Content-centric networks” (Section 7.2 ).