EN FR
EN FR


Section: New Results

The Mining of Complex Data

Participants : Quentin Brabant, Miguel Couceiro, Adrien Coulet, Esther Galbrun, Nicolas Jay, Nyoman Juniarta, Florence Le Ber, Joël Legrand, Pierre Monnin, Amedeo Napoli, Justine Reynaud, Chedy Raïssi, Mohsen Sayed, My Thao Tang, Yannick Toussaint.

Keywords:

formal concept analysis, relational concept analysis, pattern structures, pattern mining, association rule, redescription mining, graph mining, sequence mining, biclustering, aggregation

Pattern mining and Formal Concept Analysis are suitable symbolic methods for KDDK, that may be used for real-sized applications. Global improvements are carried out on the scope of applicability, the ease of use, the efficiency of the methods, and on the ability to fit evolving situations. Accordingly, the team is extending these symbolic data mining methods for working on complex data (e.g. textual documents, biological, chemical or medical data), involving objects with multi-valued attributes (e.g. domains or intervals), n-ary relations, sequences, trees and graphs.

FCA and Variations: RCA, Pattern Structures and Biclustering

Advances in data and knowledge engineering have emphasized the needs for pattern mining tools working on complex data. In particular, FCA, which usually applies to binary data-tables, can be adapted to work on more complex data. In this way, we have contributed to two main extensions of FCA, namely Pattern Structures and Relational Concept Analysis. Pattern Structures (PS [79]) allow building a concept lattice from complex data, e.g. numbers, sequences, trees and graphs. Relational Concept Analysis (RCA) is able to analyze objects described both by binary and relational attributes [90] and can play an important role in text classification and text mining.

Many developments were carried out in pattern mining and FCA for improving data mining algorithms and their applicability, and for solving some specific problems such as information retrieval, discovery of functional dependencies and biclustering. We designed new information retrieval methods based on FCA [72], text classification and heterogeneous pattern structures [71], pattern structures for structured attribute sets [67], and also a quasi-polynomial algorithm for mining top patterns w.r.t. measures satisfying special properties in a FCA framework [70]. We developed also a whole line of work on pattern structures for the discovery of functional dependencies [33], text classification and heterogeneous pattern structures [71], and fuzzy FCA as well [31]. Finally, we also proposed new visualization techniques and tools able to display important and useful information (e.g. stable concepts) from large concept lattices [28].

Text Mining

The thesis work of My Thao Tang [11] proposes a process where software and humans agents cooperate in knowledge discovery from different source textual types for extending a knowledge base. One challenge is that, on the one hand, knowledge discovery methods (software agents) can be run in accordance with background knowledge (or expert knowledge), at any step of the KDD process. On the other hand, human agents should be able to correct or to extend the current knowledge base. FCA is used for discovering a “class schema” (or “representation model”) within textual resources which can be either a set of attribute implications or a concept lattice. However, such a schema does not necessarily fit the point of view of a domain expert for different reasons, e.g. noise, errors or exceptions in the data. Thus, a bridge filling the possible gap between the representation model based on a concept lattice and the representation model of a domain expert is studied in [44]. The background knowledge is encoded as a set of attribute dependencies or constraints which is “aligned” with the set of implications associated with the concept lattice. Such an alignment may lead to modifications in the original concept lattice. This method can be generalized for generating lattices satisfying some constraints based on attribute dependencies in using the so-called “extensional projections”. It also allows experts to keep a trace of the changes occurring in the original lattice and the revised version, and to assess how concepts in practice are related to concepts discovered in the data.

In the framework of the Hybride ANR project (see 8.2.1.1), Mohsen Sayed proposes an original machine learning approach for identifying in literature disease phenotypes that are not yet represented within existing ontologies. The process is based on graph patterns extracted from sentences represented as dependency graphs. Phenotypes are usually expressed by complex noun phrases while traditional gazetteers recognize them only partially. The strength of graph patterns is to preserve the linguistic component bounds and to enable the identification of the complete phenotype formulation. A specific publication is currently in preparation.

Mining Sequences and Trajectories

Nowadays data sets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. This year, we completed a research work on the analysis of “complex sequential data” by means of interesting sequential patterns [13]. We approach the problem using FCA and pattern structures, where the subsumption relation ordering patterns is defined w.r.t. the partial order on sequences.

Redescription Mining

Among the mining methods developed in the team is redescription mining. Redescription mining aims to find distinct common characterizations of the same objects and, vice versa, to identify sets of objects that admit multiple shared descriptions [89]. It is motivated by the idea that in scientific investigations data oftentimes have different nature. For instance, they might originate from distinct sources or be cast over separate terminologies. In order to gain insight into the phenomenon of interest, a natural task is to identify the correspondences that exist between these different aspects.

A practical example in biology consists in finding geographical areas that admit two characterizations, one in terms of their climatic profile and one in terms of the occupying species. Discovering such redescriptions can contribute to better our understanding of the influence of climate over species distribution. Besides biology, applications of redescription mining can be envisaged in medicine or sociology, among other fields.

In recent work [40], we focused on the problem of pattern selection, developing a method for filtering a set of redescription to identify a non-redundant, interesting subset to present to the analyst. Also, we showcased the usability of redescription mining on an application in the political domain [50]. More specifically, we applied redescription mining to the exploratory analysis of the profiles and opinions of candidates to the parliamentary elections in Finland in 2011 and 2015.

We presented an introductory tutorial on redescription mining at ECML-PKDD in September 2016 to help foster the research on these techniques and widen their use (http://siren.mpi-inf.mpg.de/tutorial/main/).

E-sports analytics and subgroup discovery based on a single-player game

Discovering patterns that strongly distinguish one class label from another is a challenging data-mining task. The unsupervised discovery of such patterns would enable the construction of intelligible classifiers and to elicit interesting hypotheses from the data. Subgroup Discovery (SD) [87] is one framework that formally defines this pattern mining task. However, SD still faces two major issues: (i) how to define appropriate quality measures to characterize the uniqueness of a pattern; (ii) how to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is unfeasible. The first issue has been tackled by the Exceptional Model Mining (EMM) framework [77]. This general framework aims to find patterns that cover tuples that locally induce a model that substantially differs from the model of the whole dataset. The second issue has been studied in SD and EMM mainly with the use of beam-search strategies and genetic algorithms for discovering a pattern set that is non-redundant, diverse and of high quality. In [58], we argue that the greedy nature of most of these approaches produce pattern sets that lack of diversity. Consequently, we proposed to formally define pattern mining as a single-player game, as in a puzzle, and to solve it with a Monte Carlo Tree Search (MCTS), a recent technique mainly used for artificial intelligence and planning problems. The exploitation/exploration trade-off and the power of random search of MCTS lead to an any-time mining approach, in which a solution is always available, and which tends towards an exhaustive search if given enough time and memory. Given a reasonable time and memory budget, MCTS quickly drives the search towards a diverse pattern set of high quality. MCTS does not need any knowledge of the pattern quality measure, and we show to what extent it is agnostic to the pattern language.

Data Privacy: Online link disclosure strategies for social networks

Online social networks are transforming our culture and world. While online social networks have become an important channel for social interactions, they also raise ethical and privacy issues. A well known fact is that social networks leak information, that may be sensitive, about users. However, performing accurate real world online privacy attacks in a reasonable time frame remains a challenging task. In [57], [26] (this work is done in cooperation with the Pesto Inria Team), we address the problem of rapidly disclosing many friendship links using only legitimate queries (i.e., queries and tools provided by the targeted social network). Our study sheds new light on the intrinsic relation between communities (usually represented as groups) and friendships between individuals. To develop an efficient attack we analyzed group distributions, densities and visibility parameters from a large sample of a social network. By effectively exploring the target group network, our proposed algorithm is able to perform friendship and mutual-friend attacks along a strategy that minimizes the number of queries. The results of attacks performed on a major social network profiles show that 5 different friendship links are disclosed in average for each single legitimate query in the best cases.

Aggregation

Aggregation theory is the study of processes dealing with the problem of merging or fusing several objects, e.g., numerical or qualitative data, preferences or other relational structures, into a single or several objects of similar type and that best represents them in some way. Such processes are modeled by so-called aggregation or consensus functions [82]. The need to aggregate objects in a meaningful way appeared naturally in classical topics such as mathematics, statistics, physics and computer science, but it became increasingly emergent in applied areas such as social and decision sciences, artificial intelligence and machine learning, biology and medicine.

We are working on a theoretical basis of a unified theory of consensus and to set up a general machinery for the choice and use of aggregation functions. This choice depends on properties specified by users or decision makers, the nature of the objects to aggregate as well as computational limitations due to prohibitive algorithmic complexity. This problem demands an exhaustive study of aggregation functions that requires an axiomatic treatment and classification of aggregation procedures as well as a deep understanding of their structural behavior. It also requires a representation formalism for knowledge, in our case decision rules, as well as methods for extracting them. Typical approaches include rough-set and FCA approaches, that we aim to extend in order to increase expressivity, applicability and readability of results. Direct applications of these efforts are expected in the context of two multidisciplinary projects, namely the “Fight Heart Failure” and the European H2020 “CrossCult” project.

In our recent work, we mainly focused on the utility-based preference model in which preferences are represented as an aggregation of preferences over different attributes, structured or not, both in the numerical and qualitative settings. In the latter case, we provided axiomatizations of noteworthy classes of lattice-based aggregation functions, which were then used to model preferences and to provide their logical description [14]. In this qualitative setting, we also tackled the problem of computing version spaces (with explicit descriptions of all models compatible with a given dataset) and proved a dichotomy theorem showing that the problem is NP-complete for preferences over at least 4 attributes whereas it is solvable in polynomial time otherwise [61].

Finding consensual structures among different classifications or metrics is again a challenging task, especially, for large and multi-source data, and its importance becomes apparent since algorithmic approaches are often heuristic on such datasets and they rarely produce the same output. The difficulty in extracting such consensual structures is then to find appropriate and meaningful aggregation rules, and their impossibility is often revealed by Arrow type impossibility results. In the current year, we focused on median structures [19], [21] that include several relational structures (trees, graphs, lattices) and allow several consensus procedures.