Section: New Results

Mining for Knowledge Discovery in Information Systems

Mining Data Streams: Clustering and Pattern extraction

Participant : Chongsheng Zhang.

In Zhang's thesis [19] (supervised by F. Masseglia), which was partially founded by ANR MIDAS (cf. 8.2.1 ), we present our study of the management and mining issues on data streams with evolving tuples, caused by model updates or tuple revisions. For instance, in an online auction system where bids on auction items are streaming, it is possible that some users may bid for more than one item within the user-specified time interval. As a result, the proles of the users can be updated or revised in such applications. Data streams having evolving tuples bring new challenges as well as research opportunity. In this work, he develops novel and efficient models and methods for managing and mining data streams with evolving tuples. (I) To model data streams with evolving tuples, we propose the Anti-Bouncing Streaming model (ABS) for usage streams. ABS fits data streams with evolving tuples and it enables methods for processing of data streams to handle tuple updates or revisions. (II) To find frequent itemsets from data streams with evolving tuples over pane-based sliding windows, we conduct theoretical analysis and propose theorems which can avoid scanning the past slides to check for possible itemsets that may become frequent. We also design novel data structures which can manage the data streams with evolving tuples efficiently and facilitate the mining of frequent itemsets. Moreover, we devise an efficient counting algorithm to verify the frequentness of the candidate frequent itemsets. We also propose two running frameworks for this problem. (III) To extract important feature set from data streams (including the ones with evolving tuples), based upon ABS, we devise the streaming feature set selection algorithm for data streams which is the first in the literature. This method is based on information theory to extract the informative feature sets. To further accelerate the extraction of the most informative feature set from high-dimensional data, we propose a framework that reduces the huge search space to a rather small subset while still guarantee the quality of the discovered feature sets.

In 2011, Chongsheng Zhang has mainly worked on a data stream mining method, intending to extract frequent itemsets. This method has not been published yet and is described in Chapter 5 (page 79) of his thesis document [19] .

Clustering on Multiple Dissimilarity Matrices

Participants : Yves Lechevallier, Francisco de A.T. de Carvalho, Thierry Despeyroux, Alessandra Silva Anyzewski.

In [23] we introduce hard clustering algorithms that are able to partitioning objects taking into account simultaneously their relational descriptions given by multiple dissimilarity matrices [49] . The aim is to obtain a collaborative role of the different dissimilarity matrices in order to obtain a final consensus partition. These matrices could have been generated using different sets of variables and a fixed dissimilarity function or using a fixed set of variables and different dissimilarity functions, or using different sets of variables and dissimilarity functions.

These methods, which are based on the dynamic hard clustering algorithm for relational data as well as on the dynamic clustering algorithm based on adaptive distances, are designed to furnish a partition and a prototype for each cluster as well as to learn a relevance weight for each dissimilarity matrix by optimizing an adequacy criterion that measures the fitting between clusters and their representatives.

These relevance weights change at each algorithm iteration and can either be the same for all clusters or different from one cluster to another. The usefulness of these partitioning hard clustering algorithms are shown on two time trajectory real world datasets.

Clustering of Constrained Symbolic Data

Participants : Marc Csernel, Francisco de A.T. de Carvalho.

In the context of our FACEPE collaboration with Brazil (cf. section ), we have presented a method which allows clustering of symbolic descriptions constrained by presence rules in a polynomial time instead of a combinatorial one.This method allows to deal with "false missing values". Such a method can be applied on various classification problems [26] .

Web Page Clustering based on a Community Detection Algorithm

Participants : Yves Lechevallier, Yacine Slimani.

Extracting knowledge from Web user’s access data in Web Usage Mining (WUM) process is a challenging task that is continuing to gain importance as the size of the web and its user-base increase. That is why meaningful methods have been proposed in the literature in order to understand the behaviour of the user in the web and improve the access modes to information. In this work [42] , we are interested in the analysis of the user browsing behavior. The objective is to understand the navigational practices of users (teachers, students and administrative staff). First we clean the data by removing irrelevant information and noise. During the second step, remaining data are arranged in a coherent way in order to identify user sessions. After we defined a new approach [42] of knowledge extraction. This approach treats the data resulting from the preprocessing phase (first and second steps) as being a set of communities. Our approach extends the Modularity measure, proposed by Newman and Girvan [97] , in the Web Mining context in order to benefit from their classifying capacity in the communities discovery.

This work is done in collaboration with the LRIA laboratory – Université Ferhat Abbas, Sétif, Algérie

Critical Edition of Sanskrit Texts

Participants : Marc Csernel, Nicolas Béchet, Ehab Hassan, Yves Lechevallier.

New progresses concerning the computer assisted elaboration of Sanskrit texts have been made. First Nicolas Béchet and Marc Csernel have worked on the problem of moved texts. After an alignment between two versions of the texts, we discover that some parts of the text apppears to have been moved according to the technics developed in [48] . Until now, we were not able to discover when a text has been moved in a manuscript.

Now using a words-grams technique proposed in [48] , we were able to obtain quite good results on the moved texts problem and we were able to optimize the different possible parameters. A paper on the subject has been submitted to the Cicling 2012 conference (http://www.cicling.org/2012/ .

After the new treatment related to the moved text problem, we need to provide an interactive display of the critical edition. During his internship, Ehab Hassan has been working on the subject and obtained good results.These results need to be deeply examined by Sanskritists to see if they always fulfill their needs.