Section: New Results

Mining for Knowledge Discovery in Information Systems

Fuzzy Clustering on Multiple Dissimilarity Matrices

Participants : Yves Lechevallier, Francisco de Carvalho.

During 2013 we introduce fuzzy clustering algorithms [18] and [27] that can partition objects taking into account simultaneously their relational descriptions given by multiple dissimilarity matrices. The aim is to obtain a collaborative role of the different dissimilarity matrices to get a final consensus partition. These matrices can be obtained using different sets of variables and dissimilarity functions. These algorithms are designed to furnish a partition and a prototype for each fuzzy cluster as well as to learn a relevance weight for each dissimilarity matrix by optimizing an adequacy criterion that measures the fit between the fuzzy clusters and their representatives. These relevance weights change at each algorithm iteration and can either be the same for all fuzzy clusters or different from one fuzzy cluster to another.

A new algorithm [19] based on a non-linear aggregation criterion, weighted Tchebycheff distances, more appropriate than linear combinations (such as weighted averages) for the construction of compromise solutions is proposed.

Experiments with real-valued data sets from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/ ) as well as with interval-valued and histogram-valued data sets show the usefulness of the proposed fuzzy clustering algorithms.

Clustering of Functional Boxplots for Multiple Streaming Time Series

Participant : Yves Lechevallier.

We introduced a micro-clustering strategy for Functional Boxplots [30] . The aim is to summarize a set of streaming time series split in non overlapping windows. It is a two step strategy which performs at first, an on-line summarization by means of functional data structures, named Functional Boxplot micro-clusters; then it reveals the final summarization by processing, off-line, the functional data structures. Our main contribution consists in providing a new definition of micro-cluster based on Functional Boxplots and, in defining a proximity measure which allows us to compare and update them. This allows us to get a finer graphical summarization of the streaming time series by five functional basic statistics of data. The obtained synthesis will be able to keep track of the dynamic evolution of the multiple streams.

This work is done in collaboration with the laboratory of Political Science "Jean Monnet", Second University of Naples, Caserta, Italy.

Web Page Clustering based on a Community Detection Algorithm

Participant : Yves Lechevallier.

Extracting knowledge from Web user's access data in Web Usage Mining (WUM) process is a challenging task that is continuing to gain importance as the size of the Web and its user-base increase. That is why meaningful methods have been proposed in the literature in order to understand the behaviour of the user in the Web and improve the access modes to information.

During 2013 we pursued our previous work on our approach for extracting data based on the modularity function. This approach discovers the existing communities by modeling the data obtained in the pre-processing operation as a weighted graph. The method discriminates the communities through their subject of interest and extract relevant knowledge.

This work is done in collaboration with Yacine Slimani from the LRIA laboratory at the Ferhat Abbas University, Setif, Algerie and will be submitted to an international journal.

Normalizing Constrained Symbolic Data for Clustering

Participants : Marc Csernel, Francisco de Carvalho.

Clustering is one of the most common operation in data analysis while constrained is not so common. During 2013 we presented a clustering method [31] in the framework of Symbolic Data Analysis (S.D.A) which allows us to cluster Symbolic Data. Such data can be constrained relations between the variables, expressed by rules which express the domain knowledge. But such rules can induce a combinatorial increase of the computation time according to the number of rules. The algoritm presented a way to cluster such data in polynomial time. This method is based first on the decomposition of the data according to the rules, then we can apply to the data a clustering algorithm based on dissimilarities.

Dynamic Clustering Method for Mixed Data

Participants : Yves Lechevallier, Marc Csernel, Brigitte Trousse.

For ELLIOT project purposes (cf. Section 8.3.1 ), a new version of MND method (Dynamic Clustering Method for Mixed Data) has been elaborated. It determines iteratively a series of partitions which improves at each step the underlying clustering criterion. All the proposed distance functions for p variables are determined by sums of dissimilarities corresponding to the univariate component descriptors Yj. The most appropriate dissimilarities have been suggested above according to the type of variables.

In practice, however, data to be clustered are typically described by different types of variables. An overall dissimilarity measure is obtained by a linear combination of the dissimilarity measures computed with respect to the different kinds of variables.

A new release of MND algorithm based on past work [80] has been developed for ELLIOT purposes, providing some default configuration parameters for non experts.

In this version two types of distances are proposed:

  • Quantitative distance: the choice is type L1 distance  or Euclidean distances when the types of variables are quantitative or continuous.

  • Boolean distance: the choice is Khi2, type L1 distance or Euclidean distances when the type of variables is categorical or discrete.

This algorithm has been applied to cluster answers at questionnaires issued from a diary tool within the ELLIOT Green Services use case (cf. Section 6.5.4 ).

Applying a K-means clustering method for districts clustering according to Pollution

Participants : Brigitte Trousse, Yves Lechevallier, Guillaume Pilot, Caroline Tiffon.

Our motivation was to provide citizen a comparative analysis at the district level related to pollution data from Azimut stations (ozone O3 and nitrogen dioxide NO2). To achieve this, the Nice Côte d'Azur territory was discretized into small areas. IoT Data are preprocessed for each district and period of time before applying clustering. The temporal and spatial units were clustered into 5 and then into 6 clusters. The partition into 5 clusters was selected, then the temporal units for each area were counted. For the partition in 5 clusters, for each area the percent of each cluster was counted. Around 30 areas with more than 10 temporal units were found. We improved this to classify different districts of the city based on their IoT data (Azimut data O3-NO2) for each hour/day in order to provide a new functionnality in the second version of MyGreenServices.

This work is partially funded by ELLIOT project (see Section  8.3.1 ).

Summarizing Dust Station IoT Data with REGLO, a FocusLab web service

Participants : Yves Lechevallier, Brigitte Trousse, Guillaume Pilot, Xavier Augros.

Within ELLIOT, we applied the GEAR (or REGLO in French) method [57] , [58] , [59] on the evolution of dust data issued from one citizen sensor.

Our motivation was to summarize IoT data in order to have a pollution context for each user. Such IoT summaries constitute interesting individual contextual data for supporting the living lab manager to better interpret the user behavior and finally the user experience.

REGLO summarised IoT data with isolated points and line segments.

The goal now is to carry out an analysis of these summaries to automatically determine the characteristics of the curve.

We selected only segments. For each segment we calculated four variables that characterize it:

  • The slope of the segment,

  • The midpoint of the segment (average of this segment),

  • The length of the segment,

  • The duration of the segment (the time interval between the start time and the end time of the segment).

From these four values we can achieve an interpretation of the previous curve, taking into account only two variables and constructing a 2D representation.

This work is partially funded by ELLIOT project (see Section  8.3.1 ).

Clustering of Solar Irradiance

Participants : Thierry Despeyroux, Francisco de Carvalho, Yves Lechevallier, Thien Phuc Hoang Nguyen.

The development of grid-connected photovoltaic power systems leads to new challenges. The short or medium term prediction of the solar irradiance is definitively a solution to reduce the storage capacities and, as a result, authorizes to increase the penetration of the photovoltaic units on the power grid. We present the first results of an interdisciplinary research project which involves researchers in energy, meteorology and data mining, addressing this real-world problem. The objective here is to show interest and disadvantages of two approaches for classifying curves.

In Reunion Island from December 2008 to March 2012, solar radiation measurements has been collected, every minutes, using calibrated instruments. Prior to prediction modelling, two clustering strategies has been applied for analysis the data base of 951 days.

During 2013 we continued our research and obtained many results [28] .

Our methodology is based on two clustering approaches. The objective here is to show interest and disadvantages of two approaches for classifying curves.

The first approach combines the following proven data-mining methods. Principal Component Analysis was used as a pre-process for reduction and de-noising and the Ward Hierarchical and K-means methods to find a partition with a good number of classes.

The second approach [78] ,[20] uses a clustering method that operates on a set of dissimilarity matrices. Each cluster is represented by an element or a subset of the set of objects to be classified. The five meaningfully clusters found by the two clustering approaches are compared.

Understanding of Cooking User's Recipes by Extracting Intrinsic Knowledge

Participants : Damien Leprovost, Thierry Despeyroux, Yves Lechevallier.

On community web sites, users share knowledge, being both authors and readers. We present a method to build our own understanding of the semantics of the community, without the use of any external knowledge base. We perform this understanding by knowledge extraction from analysed user contributions. We propose an evaluation of the trust attributable to that deduced understanding to assess the quality of user content, on cooking recipes provided by users on sharing web sites. This work is partially funded by FIORA project (see Section  8.2.2 ). Two articles have been accepted in early 2014 [25] , [29] .

Knowledge Modeling for Multi-View KDD Process

Participant : Brigitte Trousse.

We pursued our supervision (with our colleagues H. Behja and A. Marzark from Morocco) of E.L. Moukhtar Zemmouri's PhD thesis (Morocco) on a Viewpoint Model in the context of a KDD process, topic we initiated during Behja's PhD thesis [40] ). E. Zemmouri defended his thesis at the end of this year [75] . Below is the summary of his PhD thesis.

Knowledge Discovery in Databases (KDD) is a highly complex, iterative and interactive process aimed at the extraction of previously unknown, potentially useful, and ultimately understandable patterns from data. In practice, a KDD process involves several actors (domain experts, data analysts, KDD experts …) each with a particular viewpoint. We define a multi-view analysis as a KDD process held by several experts who analyze the same data with different viewpoints. We propose to support users of multi-view analysis through the development of a set of semantic models to manage knowledge involved during such analysis. Our objective is to enhance both the reusability of the process and coordination between users. To do so, we propose first a formalization of Viewpoint in KDD and a Knowledge Model that is a specification of the information and knowledge structures and functions involved during a multi-view analysis. Our formalization, using OWL ontologies, of viewpoint notion is based on CRISP-DM standard through the identification of a set of generic criteria that characterize a viewpoint in KDD. Once instantiated, these criteria define an analyst viewpoint. This viewpoint will guide the execution of the KDD process, and then keep trace of reasoning and major decisions made by the analyst. Then, to formalize interaction and interdependence between various analyses according to different viewpoints, we propose a set of semantic relations between viewpoints based on goal-driven analysis. We have defined equivalence, inclusion, conflict, and requirement relations. These relations allow us to enhance coordination, knowledge sharing and mutual understanding between different actors of a multi-view analysis, and re-usability in terms of viewpoint of successful data mining experiences within an organization. An article selected from the international conference NGNS 2012 [74] will be published in the on-line Journal of Mobile Multimedia , Volume 9 No.3 &4 March 1, 2014.