EN FR
EN FR


Section: Software

Data Mining

Classification and Clustering Methods

Participants : Marc Csernel, Yves Lechevallier [co-correspondant] , Brigitte Trousse [co-correspondant] .

We developed and maintained a collection of clustering and classification software, written in C++ and/or Java:

Supervised methods

  • a Java library (Somlib) that provides efficient implementations of several SOM(Self-Organizing Map) variants [77] , [76] , [101] , [100] , [104] , especially those that can handle dissimilarity data (available on Inria's Gforge server (public access) Somlib , developed by AxIS Rocquencourt and Brieuc Conan-Guez from Université de Metz.

  • a functional Multi-Layer Perceptron library, called FNET, that implements in C++ supervised classification of functional data [96] , [99] , [98] , [97] (developed by AxIS Rocquencourt).

Unsupervised methods : partitioning methods

  • Two partitioning clustering methods on the dissimilarity tables issued from a collaboration between AxIS Rocquencourt team and Recife University, Brazil: CDis and CCClust [84] . Both are written in C++ and use the “Symbolic Object Language” (SOL) developed for SODAS. And one partitioning method on interval data (Div).

  • Two standalone versions improved from SODAS modules, SCluster and DIVCLUS-T [74] (AxIS Rocquencourt).

Unsupervised methods : agglomerative methods

  • a Java implementation of the 2-3 AHC (developed by AxIS Sophia Antipolis). The software is available as a Java applet which runs the hierarchies visualization toolbox called HCT for Hierarchical Clustering Toolbox (see [75] ).

A Web interface developed in C++ and running on our Apache internal Web server .is available for the following methods: SCluster, Div, Cdis, CCClust.

Previous versions of the above software have been integrated in the SODAS 2 Software  [95] which was the result of the european project ASSO (ASSO: Analysis System of Symbolic Official data) (2001-2004). SODAS 2 supports the analysis of multidimensional complex data (numerical and non numerical) coming from databases mainly in statistical offices and administration using Symbolic Data Analysis [71] . This software is registrated at APP (Agence de la Protection des Programmes). The latest executive version of the SODAS 2 software, with its user manual can be downloaded at http://www.info.fundp.ac.be/asso/ [78] , [85] .

As a 2012 result, a release of MND (Dynamic Clustering Method for Multi-Nominal data) algorithm based on previous AxIS research (2003) has been done (cf. section  6.6 ).

Extracting Sequential Patterns with Low Support

Participant : Brigitte Trousse [correspondant] .

Two methods for extracting sequential patterns with low support have been developed by D. Tanasa in his thesis (see Chapter 3 in [103] for more details) in collaboration with F. Masseglia and B. Trousse :

  • Cluster & Divide,

  • and Divide & Discover [11] .

These methods have been successfully applied from 2005 on various Web logs.

Mining Data Streams

Participants : Brigitte Trousse [correspondant] , Mohamed Gaieb.

In Marascu's thesis (2009) [91] , a collection of software have been developed for knowledge discovery and security in data streams. Three clustering methods for mining sequential patterns (Java) in data streams method have been developped in Java:

  • SMDS compares the sequences to each others with a complexity of O(n2).

  • SCDS is an improvement of SMDS, where the complexity is enhanced from O(n2) to O(n.m) with n the number of navigations and m the number of clusters.

  • ICDS is a modification of SCDS. The principle is to keep the clusters' centroids from one batch to another.

Such methods take batches of data in the format "Client-Date-Item" and provide clusters of sequences and their centroids in the form of an approximate sequential pattern calculated with an alignment technique.

In 2010 the Java code of one method called SCDS has been integrated in the MIDAS demonstrator and a C++ version has been implemented by F. Masseglia for the CRE contract with Orange Labs with the deliverability of a licence) with a visualisation module (in Java).

It has been tested on the following data:

  • Orange mobile portal logs (100 million records, 3 months) in the context of Midas project (Java version) and the CRE (Orange C++ version)

  • Inria Sophia Antipolis Web logs (4 million records, 1 year, Java version)

  • Vehicle trajectories (Brinkhoff generator ) in the context of MIDAS project (Java version).

In 2011, in the context of the ELLIOT contrcat [cf. Section  8.3.1.1 ), SCDS has been integrated as a Web service (Java version) in the first version of FocusLab platform (cf. section  6.6 ) in the ELLIOT context: a demonstration was made on San Rafaelle Hospital media use case at the first ELLIOT review at Brussels.

In 2012 we applied SCDS web service on data issued from co-creation step of two use cases in Logistics (BIBA) and Green Services (ICT Usage Lab). More data are needed to show the relevance of this method, it is planned in 2013 with the experimentation step of Green Services.

The three C++ codes done for the CRE (Orange Labs) have been depositi at APP.