EN FR
EN FR


Section: New Results

Mining of Complex Data

Participants : Quentin Brabant, Miguel Couceiro, Adrien Coulet, Esther Catherine Galbrun, Nyoman Juniarta, Florence Le Ber, Joël Legrand, Pierre Monnin, Tatiana Makhalova, Amedeo Napoli, Justine Reynaud, Chedy Raïssi, Mohsen Sayed, Yannick Toussaint.

Keywords:

formal concept analysis, relational concept analysis, pattern structures, pattern mining, association rule, redescription mining, graph mining, sequence mining, biclustering, skyline, aggregation

FCA and Variations: RCA, Pattern Structures and Biclustering

Advances in data and knowledge engineering have emphasized the needs for pattern mining tools working on complex data. In particular, FCA, which usually applies to binary data-tables, can be adapted to work on more complex data. In this way, we have contributed to two main extensions of FCA, namely Pattern Structures and Relational Concept Analysis. Pattern Structures (PS [77]) allow building a concept lattice from complex data, e.g. numbers, sequences, trees and graphs. Relational Concept Analysis (RCA) is able to analyze objects described both by binary and relational attributes [90] and can play an important role in text classification and text mining.

Many developments were carried out in pattern mining and FCA for improving data mining algorithms and their applicability, and for solving some specific problems such as information retrieval, discovery of functional dependencies and biclustering. We also worked on a generic framework based on FCA in which we can define the pattern mining process at a formal level [3]. We consider several types of patterns and we are making precise the mining of complex patterns represented as sequences, trees and graphs.

We also worked on a significant extension of previous work on the discovery of skyline patterns (or “skypatterns”) based on the theoretical relationships with condensed representations of patterns. We have shown how these relationships facilitate the computation of skypatterns. Thus we proposed a flexible and efficient approach to mine skypatterns using a dynamic constraint satisfaction problems (CSP) framework [30].

Text Mining

In the context of the PractikPharma ANR Project, we study cross-corpus training with Tree-LSTM for the extraction of biomedical relationships from texts, especially, how large annotated corpora developed for alternative tasks may improve the performance on biomedicine related tasks, for which few annotated resources are available [55]. We experiment two deep learning-based models to extract relationships from biomedical texts with high performance. The first one combines locally extracted features using a Convolutional Neural Network (CNN) model, while the second exploits the syntactic structure of sentences using a Recursive Neural Network (RNN) architecture. Our experiments show that the latter benefits from a cross-corpus learning strategy to improve the performance of relationship extraction tasks. Indeed our approach leads to state-of-the-art performances for four biomedical tasks for which few annotated resources are available (less than 400 manually annotated sentences). This may have a particular impact in specialized domains in which training resources are scarce, because they would benefit from the training data of other domains for which large annotated corpora do exist.

In the framework of the Hybride ANR project (terminated at the end of 2016), Mohsen Sayed Hassan proposed an original machine learning approach for identifying in texts about diseases phenotypes that are not yet represented within existing ontologies [9]. The result of the extraction is used to enrich existing ontologies of the considered domain. We studied three research directions: (1) extracting relationships from texts, i.e., extracting Disease-Phenotype (D-P) relationships, (2) identifying new complex entities standing as phenotypes of a rare disease, and (3) enriching an existing rare disease ontology on the basis of the relationships previously extracted.

A collection of abstracts of scientific articles is represented as a collection of dependency graphs used for discovering relevant pieces of biomedical knowledge. We focused on the completion of rare disease descriptions, by extracting Disease-Phenotype relationships. We developed an automatic approach named SPARE⋆, for extracting Disease-Phenotype relationships from PubMed abstracts, where phenotypes and rare diseases are previously annotated by a Named Entity Recognizer. SPARE⋆ is the resulting hybrid approach that combines a graph-pattern based method, called SPARE, and a machine learning method (SVM). It benefits both from the good precision of SPARE and from the good recall of SVM. Finally, we applied pattern structures for classifying rare diseases and enriching an existing ontology about such diseases.

Mining Sequences and Trajectories

Nowadays datasets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. This year we finished a work on the analysis of “complex” sequential data and its usage in video games for the analysis of strategy “balance” in those games [14].

Redescription Mining

Among the mining methods developed in the team is redescription mining. Redescription mining aims to find distinct common characterizations of the same objects and, vice versa, to identify sets of objects that admit multiple shared descriptions [89]. It is motivated by the idea that in scientific investigations data oftentimes have different nature. For instance, they might originate from distinct sources or be cast over separate terminologies. In order to gain insight into the phenomenon of interest, a natural task is to identify the correspondences that exist between these different aspects.

A practical example in biology consists in finding geographical areas that admit two characterizations, one in terms of their climatic profile and one in terms of the occupying species. Discovering such redescriptions can contribute to better our understanding of the influence of climate over species distribution. Besides biology, applications of redescription mining can be envisaged in medicine or sociology, among other fields.

In a preceding work [83], we focused on the problem of pattern selection, developing a method for filtering a set of redescription to identify a non-redundant, interesting subset to present to the analyst. Also, we showcased the usability of redescription mining on an application in the political domain [76]. More specifically, we applied redescription mining to the exploratory analysis of the profiles and opinions of candidates to the parliamentary elections in Finland in 2011 and 2015.

We presented an introductory tutorial on redescription mining at SDM in April 2017 to help foster the research on these techniques and widen their use (http://siren.mpi-inf.mpg.de/tutorial_sdm2017/main/).

Mining subgroups as a single-player game

Discovering patterns that strongly distinguish one class label from another is a challenging data-mining task. The unsupervised discovery of such patterns would enable the construction of intelligible classifiers and to elicit interesting hypotheses from the data. Subgroup Discovery (SD) is one framework that formally defines this pattern mining task. However, SD still faces two major issues: (i) how to define appropriate quality measures to characterize the uniqueness of a pattern; (ii) how to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is unfeasible. The first issue has been tackled by the Exceptional Model Mining (EMM) framework. This general framework aims to find patterns that cover tuples that locally induce a model that substantially differs from the model of the whole dataset. The second issue has been studied in SD and EMM mainly with the use of beam-search strategies and genetic algorithms for discovering a pattern set that is non-redundant, diverse and of high quality.

In [1], we argue that the greedy nature of most of these approaches produce pattern sets that lack of diversity. Consequently, we proposed to formally define pattern mining as a single-player game, as in a puzzle, and to solve it with a Monte Carlo Tree Search (MCTS), a technique mainly used for artificial intelligence and planning problems. The exploitation/exploration trade-off and the power of random search of MCTS lead to an any-time mining approach, in which a solution is always available, and which tends towards an exhaustive search if given enough time and memory. Given a reasonable time and memory budget, MCTS quickly drives the search towards a diverse pattern set of high quality. MCTS does not need any knowledge of the pattern quality measure, and we show to what extent it is agnostic to the pattern language.

Data Privacy: Online link disclosure strategies for social networks

Online social networks are transforming our culture and world. While online social networks have become an important channel for social interactions, they also raise ethical and privacy issues. A well known fact is that social networks leak information, that may be sensitive, about users. However, performing accurate real world online privacy attacks in a reasonable time frame remains a challenging task. We continued our work on this aspect and we address the problem of rapidly disclosing many friendship links using only legitimate queries (i.e., queries and tools provided by the targeted social network). The results of this joint work with the Pesto Inria Team are published in [31].

Aggregation

Aggregation and consensus theory study processes dealing with the problem of merging or fusing several objects, e.g., numerical or qualitative data, preferences or other relational structures, into a single or several objects of similar type and that best represents them in some way. Such processes are modeled by so-called aggregation or consensus functions [79], [82]. The need to aggregate objects in a meaningful way appeared naturally in classical topics such as mathematics, statistics, physics and computer science, but it became increasingly emergent in applied areas such as social and decision sciences, artificial intelligence and machine learning, biology and medicine.

We are working on a theoretical basis of a unified theory of consensus and to set up a general machinery for the choice and use of aggregation functions. This choice depends on properties specified by users or decision makers, the nature of the objects to aggregate as well as computational limitations due to prohibitive algorithmic complexity. This problem demands an exhaustive study of aggregation functions that requires an axiomatic treatment and classification of aggregation procedures as well as a deep understanding of their structural behavior. It also requires a representation formalism for knowledge, in our case decision rules and methods for discovering them. Typical approaches include rough-set and FCA approaches, that we aim to extend in order to increase expressivity, applicability and readability of results. Applications of these efforts already appeared and further are expected in the context of three multidisciplinary projects, namely the “Fight Heart Failure” (research project with the Faculty of Medicine in Nancy), the European H2020 “CrossCult” project, and the “`ISIPA” (Interpolation, Sugeno Integral, Proportional Analogy) project.

In our recent work, we mainly focused on the utility-based preference model in which preferences are represented as an aggregation of preferences over different attributes, structured or not, both in the numerical and qualitative settings. In the latter case, the Sugeno integral is widely used in multiple criteria decision making and decision under uncertainty, for computing global evaluations of items based on local evaluations (utilities). The combination of a Sugeno integral with local utilities is called a Sugeno utility functional (SUF). A noteworthy property of SUFs is that they represent multi-threshold decision rules. However, not all sets of multi-threshold rules can be represented by a single SUF. We showed how to represent any set of multi-threshold rules as a combination of SUFs and studied their potential advantages as a compact representation of large sets of rules, as well as an intermediary step for extracting rules from empirical datasets [38], [59]. Problems related to feature selection and model elicitation where tackled in [15].