Section: Software

Web Usage Mining

AWLH for Pre-processing Web Logs

Participants : Yves Lechevallier [co-correspondant] , Brigitte Trousse [co-correspondant] .

AWLH (AxIS Web Log Preprocessing and Data Stream extraction) for Web Usage Mining (WUM) is issued from AxISlogminer preprocessing software which implements the mult-site log preprocessing methodology developed by D. Tanasa in his thesis [16] for Web Usage Mining (WUM). In the context of the Eiffel project (2008-2009), we isolated and redesigned the core of AxISlogMiner preprocessing tool (we called it AWLH) composed of a set of tools for pre-processing web log files. AWLH can extract and structure log files from several Web servers using different input format. The web log files are cleaned as usually before to be used by data mining methods, as they contain many noisy entries (for example, robots bring a lot of noise in the analysis of user behaviour then it is important in this case to identify robot requests). The data are stored within a database whose model has been improved.

Now the current version of our Web log processing (Available on INRIA's gforge website with private access) offers:

  • Processing of several log files from several servers,

  • Support of several input formats (CLF, ECLF, IIS, custom, ...);

  • Incremental pre-processing;

  • Java API to help integration of AWLH in external application.

An additionnal tool has been developped for capturing user actions in real time based on an open source project called "OpenSymphony ClickStream". An extension version of AWLH called AWLH-Debate has been developed for recording and structuring data issued from annotated documents inside discussion forums.

ATWUEDA for Analysing Evolving Web Usage Data

Participants : Yves Lechevallier [correspondant] , Brigitte Trousse, Mohamed Gaieb, Yves Lechevallier.

ATWUEDA for Web Usage Evolving Data Analysis [90] was developed by A. Da Silva in her thesis [89] under the supervision of Y. Lechevallier. This tool was developed in Java and uses the JRI library in order to allow the application of R which is a programming language and software environment for statistical computing http://www.r-project.org/ functions in the Java environment.

ATWUEDA is able to read data from a cross table in a MySQL database. It splits the data according to the user specifications (in logical or temporal windows) and then applies the approach proposed in the Da Silva's thesis in order to detect changes in dynamic environment. The proposed approach characterizes the changes undergone by the usage groups (e.g. appearance, disappearance, fusion and split) at each timestamp. Graphics are generated for each analyzed window, exhibiting statistics that characterizes changing points over time.

Version 2.0 available at INRIA's gforce website: http://gforge.inria.fr/projects/atwueda/ (public access, documentation september 2009).

This year we have demonstrated the efficiency of ATWUEDA [51] by appying it on another real case study on condition monitoring data streams of an electric power plant provided by EDF (cf. section  6.5.1 ).

ATWUEDA is used by Telecom Paris Tech and EDF [51] .