Section: New Results

Scalable Data Analysis

Massive Graph Management

Participant : Patrick Valduriez.

Traversing massive graphs as efficiently as possible is essential for many scientific applications. Many common operations on graphs, such as calculating the distance between two nodes, are based on the Breadth First Search (BFS) traversal. However, because of the exhaustive exploration of all the nodes and edges of the graph, this operation might be very time consuming. A possible solution is partitioning the graph among the nodes of a shared-nothing parallel system. However, partitioning a graph and keeping the infor- mation regarding the location of vertices might be unrealistic for massive graphs because of much inter-node communication. In [28] , we propose ParallelGDB, a new graph database system based on specializing the local caches of any node in this system, providing a better cache hit ra- tio. ParallelGDB uses a random graph partitioning, avoid- ing complex partition methods based on the graph topology, that usually require managing extra data structures. This proposed system provides an efficient environment for dis- tributed graph databases.

Top-k Query Processing in Unstructured P2P Systems

Participants : Reza Akbarinia, William Kokou Dedzoe, Patrick Valduriez.

Top-k query processing techniques are useful in unstructured P2P systems to avoid overwhelming users with too many results and provide them with the best ones. However, existing approaches suffer from long waiting times, because top-k results are returned only when all queried peers have finished processing the query. As a result, response time is dominated by the slowest queried peer. We proposed to revisit the problem of top-k query processing.

In [29] we address the problem of reducing user waiting time of top-k query processing, in the case of unstructured P2P systems with overloaded peers. We propose a new algorithm, called QUAT, in which each peer maintains a semantic description of its local data and the semantic descriptions of its neighborhood (i.e. the semantic descriptions of data owned locally by its direct neighbors and data owned locally by these neighbors direct neighbors). These semantic descriptions allow peers to prioritize the queries that can provide high quality results, and to forward them in priority to the neighbors that can provide high quality answers. We validated our solution through a thorough experimental evaluation using a real-world dataset. The results show that QUAT significantly outperforms baseline algorithms by returning faster the final top-k results to users.

Top-k Query Processing Over Sorted Lists

Participants : Reza Akbarinia, Esther Pacitti, Patrick Valduriez.

The problem of answering top-k queries can be modeled as follows. Suppose we have m lists of n data items such that each data item has a local score in each list and the lists are sorted according to the local scores of their data items. Each data item has an overall score computed based on its local scores in all lists using a given scoring function. Then, the problem is to find the k data items whose overall scores are the highest. This problem model is a general model for top-k queries in many centralized, distributed and P2P applications. For example, in IR systems one of the main problems is to find the top-k documents whose aggregate rank is the highest wrt. some given keywords. To answer this query, the solution is to have for each keyword a ranked list of documents, and return the k documents whose aggregate rank in all lists are the highest.

In [16] , we propose an extension of our best position algorithms (BPA) which had been proposed for top-k query processing over sorted lists model.The BPA algorithms have been shown to be more efficient than the well known TA Algorithm. We propose several techniques using different data structures for managing best positions that are crucial for efficient execution of top-k algorithms. We also provide a complete discussion on the instance optimality of TA algorithm (TA was considered so far as optimal over any database of sorted lists). We illustrate that, the existence of deterministic algorithms such as BPA shows that if we are aware of positions of seen data, then one of the main arguments used for proving the instance optimality of TA is invalidated. Therefore, in this case the proof of TA’s instance optimality is incorrect, and must be revisited.

Satellite Image Mining

Participant : Florent Masseglia.

Satellite Image Time Series (SITS) provide us with precious information on land cover evolution. By studying SITS, we can both understand the changes of specific areas and discover global phenomena that spread over larger areas. Changes that can occur throughout the sensing time can spread over very long periods and may have different start time and end time depending on the location, which complicates the mining and the analysis of series of images. In [45] , we propose a frequent sequential pattern mining method for SITS analysis. Designing such a method called for important improvements on the data mining principles. First, the search space in SITS is multi-dimensional (the radiometric levels of different wavelengths correspond to infra-red, red, etc.). Furthermore, the non evolving regions, which are the vast majority and overwhelm the evolving ones, challenge the discovery of these patterns. Our framework enables discovery of these patterns despite these constraints and characteristics. We introduce new filters in the mining process to yield important reductions in the search space by avoiding consecutive occurences of similar values in the sequences. Then, we propose visualization techniques for results analysis (where modified regions are highlighted). Experiments carried out on a particular dataset showed that our method allows extracting repeated, shifted and distorted temporal behaviors. The flexibility of this method makes it possible to capture complex behaviors from multi-source, noisy and irregularly sensed data.

Distributed Approximate Similarity Join

Participant : Alexis Joly.

Efficiently constructing the KNN-graph of large and high dimensional feature datasets is crucial for many data intensive applications involving feature-rich objects, such as image features, text features or sensor's features. In this work we investigate the use of high dimensional hashing methods for efficiently approximating the full knn graph of large collections, in particular, in distributed environments. We first analyzed and experimented what seems to be the most intuitive hashing-based approach: constructing several Locality Sensitive Hashing (LSH) tables in parallel and computing the frequency of all emitted collisions. We show that balancing issues of classical LSH functions strongly affect the performance of this approach. On the other side, we show that using an alternative data-dependent hashing function (RMMH), that we introduced recently [34] , can definitely change that conclusion. The main originality of RMMH hash function family is that it is based on randomly trained classifiers, allowing to learn random and balanced splits of the data instead of using random splits of the feature space as in LSH. We show that the hash tables constructed through RMMH are much more balanced and that the number of emitted collisions can be strongly reduced without degrading quality. In the end, our hashing-based filtering algorithm of the all-pairs graph is two orders of magnitude faster than the one based on LSH. An efficient distributed implementation of the method was implemented within the MapReduce framework (and is the basis of the SimJoin prototype). This work is done in the context of the supervision of a PhD student working at INRIA Imedia (Riadh Mohamed Trad).

Visual objects mining

Participant : Alexis Joly.

State-of-the-art content-based object retrieval systems have demonstrated impressive performance in very large image datasets. These methods, based on fine local descriptions and efficient matching techniques, can detect accurately very small rigid objects with unambiguous semantics such as logos, buildings, manufactured objects, posters, etc. Mining such small objects in large collections is however difficult. Constructing a full local matching graph with a naïve approach would indeed require to probe all candidate query leading to an intractable algorithm complexity. In this work, we first introduce an adaptive weighted sampling scheme, starting with some prior distribution and iteratively converging to unvisited regions [35] . We show that the proposed method allows to discover highly interpretable visual words while providing excellent recall and image representativity. We then focused on mining visual objects on top of the discovered visual words. We therefore developed an original shared nearest-neighbors clustering method, working directly on the generated bi-partite graph. This work is in the context of the supervision of two PhD students, one working jointly with INA and INRIA and who will join the Zenith team next year (Pierre Letessier), one working at INRIA Imedia (Amel Hamzaoui).

Visual-based plant species identification from crowdsourced data

Participant : Alexis Joly.

Inspired by citizen sciences, the main goal of this work is to speed up the collection and integration of raw botanical observation data, while providing to potential users an easy and efficient access to this botanical knowledge. We therefore designed and developed an original crowdsourcing web application dedicated to the access of botanical knowledge through automated identification of plant species by visual content. Technically, the first side of the application deals with content-based identification of plant leaves. Whereas state-of-the-art methods addressing this objective are mostly based on leaf segmentation and boundary shape features, we developed a new approach based on local features and large-scale matching. This approach obtained the best results within ImageCLEF 2011 plant identification benchmark [48] . The second side of the application deals with interactive tagging and allows any user to validate or correct the automatic determinations returned by the system. Overall, this collaborative system allows to enrich automatically and continuously the visual botanical knowledge and thus to increase progressively the accuracy of the automated identification. A demo of the developed application was presented at the ACM Multimedia conference [33] . This work is done in collaboration with INRIA Imedia and with the botanists of the AMAP UMR team (CIRAD). It is also closely related to a citizen science project around plant's identification that we develop with the support of the TelaBotanica social network.