Section: New Results
Information and Social Networks Mining for Supporting Information Retrieval
Clustering of Relational Data and Social Network Data
Participants : Yves Lechevallier, Amine Louati.
The automatic detection of communities in a social network can provide this kind of graph aggregation. The objective of graph aggregations is to produce small and understandable summaries and can highlight communities in the network, which greatly facilitates the interpretation.
Social networks allow having a global view of the different actors and different interactions between them, thus facilitating the analysis and information retrieval.
In the enterprise context, a considerable amount of information is stored in relational databases. Therefore, relational database can be a rich source to extract social network. The extracted network has in general a huge size which makes its analyses and visualization difficult tasks. In  , we propose a social network extraction approach from relational database.
Often, the network has a large size which makes its analysis and visualization difficult.
The aggregation step is a necessary task, so we offer  and  an aggregation step based on the k-SNAP algorithm  that produces a summary graph by grouping nodes based on attributes and relationships selected by the user.
This work is done in collaboration with Marie-Aude Aufaure, head of the Business Intelligence Team, Ecole Centrale Paris, MAS Laboratory.
Networks Solutions for Expert Finding and People Name Disambiguation
Participants : Elena Smirnova, Yi-Ling Kuo, Brigitte Trousse.
The task of finding people who are experts on a given topic has cently attracted close attention. State-of-the-art expert finding algorithms uncover knowledge areas of candidate experts based on textual content of associated documents. While powerful, these models ignore social structure that might be available. Therefore, we develop a Bayesian hierarchical model for expert finding that accounts for both content and social relationships. The model assumes that social links are determined by expertise similarity between candidates. The results ofEGC experiments on UvT expert collection have demonstrated the effectiveness of our algorithm  .
E. Smirnova visited Intellius, people search technology company (Aug 8 - Oct 5, 2011): the goal of this visit was to validate the research on expert finding in social networks on real dataset and further advance it. As a real dataset, we have taken a sample of United States LinkedIn public profiles. We built an organizational network by connecting a LinkedIn user and his collegues at different workplaces. We also constructed a geographical network from user's current location in the United States. We used Amazon's Mechanical Turk framework ( http://aws.amazon.com/code/923 ) to collect user-oriented judgements for model evaluation. We found that the user-oriented model is statistically significantly prefered to the baseline model on 72,5% of queries.
Her work on name disambiguation done in 2010 has been integrated in an article related to the problem of quick detection of top-k Personnalized PageRank (PPR)in  . The effectiveness of the chosen approach based on Monte Carlo methods for quick detection of top-k PPR lists has been demonstrated on the Web and Wikipedia graphs.
Yi-Ling Kuo during her internship has worked on Person Name Disambiguation and started by managing the analysis of the very huge Yahoo! Web graph.
This topic has been done in the context of Smirnova's thesis  which has been defended on december 15 (thesis supervised by B. Trousse (AxIS) and K.Avrachenkov (Maestro)).
Towards an On-Line Analysis of Tweets Processing
Participant : Nicolas Béchet.
Tweets exchanged over the Internet represent an important source of information, even if their characteristics make them difficult to analyze (a maximum of 140 characters, etc.). In  , we define a data warehouse model to analyze large volumes of tweets by proposing measures relevant in the context of knowledge discovery. The use of data warehouses as a tool for the storage and analysis of textual documents is not new but current measures are not well-suited to the specificities of the manipulated data. We also propose a new way for extracting the context of a concept in a hierarchy. Experiments carried out on real data underline the relevance of our proposal.
This work is done inside a collaboration with LIRMM and CEMAGREF.