ORPAILLEUR - 2016 - Rapport annuel d'activité

ORPAILLEUR

ORPAILLEUR - 2016

Project-Team Orpailleur

Members

Overall Objectives

Introduction

Research Program

Application Domains

Highlights of the Year

New Software and Platforms

New Results

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: Partnerships and Cooperations

National Initiatives

ANR

Hybride (2011-2016)

Participants : Adrien Coulet, Amedeo Napoli, Chedy Raïssi, My Thao Tang, Mohsen Sayed, Yannick Toussaint.

The Hybride research project (http://hybride.loria.fr/) aims at combining Natural Language Processing (NLP) and Knowledge Discovery in Databases (KDD) for text mining. A key idea is to design an interacting and convergent process where NLP methods are used for guiding text mining while KDD methods are used for guiding the analysis of textual documents. NLP methods are mainly based on text analysis and extraction of general and temporal information. KDD methods are based on pattern mining, e.g. patterns and sequences, formal concept analysis and graph mining. In this way, NLP methods applied to texts extract “textual information” that can be used by KDD methods as constraints for focusing the mining of textual data. By contrast, KDD methods extract patterns and sequences to be used for guiding information extraction from texts and text analysis. Experimental and validation parts associated with the Hybride project are provided by an application to the documentation of rare diseases in the context of Orphanet.

The partners of the Hybride consortium are the GREYC Caen laboratory (pattern mining, NLP, text mining), the MoDyCo Paris laboratory (NLP, linguistics), the INSERM Paris laboratory (Orphanet, ontology design), and the Orpailleur team at Inria NGE (FCA, knowledge representation, pattern mining, text mining). The Hybride project ended on 30th November 2016.

ISTEX (2014–2016)

Participant : Yannick Toussaint.

ISTEX is a so-called “Initiative d'excellence” managed by CNRS and DIST (“Direction de l'Information Scientifique et Technique”). ISTEX aims at giving to the research and teaching community an on-line access to scientific publications in all the domains (http://www.istex.fr/istex-excellence-initiative-of-scientific-and-technical-information/). Thus ISTEX requires a massive acquisition of documents such as journals, proceedings, corpus, databases...ISTEX-R is one research project within ISTEX in which the Orpailleur team is involved, with two other partners, namely ATILF laboratory and INIST Institute (both located in Nancy). ISTEX-R aims at developing new tools for querying full-text documentation, analyzing content and extracting information. A platform is under development to provide robust NLP tools for text processing, as well as methods in text mining and domain conceptualization.

PractiKPharma (2016–2020)

Participants : Adrien Coulet, Joël Legrand, Pierre Monnin, Amedeo Napoli, Malika Smaïl-Tabbone, Yannick Toussaint.

The ANR project PractiKPharma (http://practikpharma.loria.fr/) is interested in the validation of domain knowledge in pharmacogenomics. The originality of PractiKPharma is to use “Electronic Health Records” (EHRs) to constitute cohorts of patients, cohorts which are are then mined for validating extracted pharmacogenomics knowledge units after validation w.r.t. literature knowledge. This project involves two other labs, namely LIRMM at Montpellier and CRC Paris.

Termith (2014–2016)

Participant : Yannick Toussaint.

Termith (http://www.atilf.fr/ressources/termith/) is an ANR Project involving a series of laboratories, namely ATILF, INIST, Inria Nancy Grand Est, Inria Saclay, LIDILEM, and LINA. It aims at indexing documents belonging to different domain of Humanities. Thus, the project focuses on extracting candidate terms (information extraction) and on disambiguation.

In the Orpailleur team, we are mainly concerned by information extraction using Formal Concept Analysis techniques, but also pattern and sequence mining. The objective is to define contexts introducing terms, i.e. finding textual environments allowing a system to decide whether a textual element is actually a candidate term and its corresponding environment. This disambiguation process was described and published at LREC 2016 [35]. The Termith project ended in April 2016.

FUI POQEMON (2014-2016)

Participants : Chedy Raïssi, Mickaël Zehren.

The publication of transaction data, such as market basket data, medical records, and query logs, serves the public benefit. Mining such data allows the derivation of association rules that connect certain items to others with measurable confidence. Still, this type of data analysis poses a privacy threat; an adversary having partial information on a person’s behavior may confidently associate that person to an item deemed to be sensitive. Ideally, an anonymization of such data should lead to an inference-proof version that prevents the association of individuals to sensitive items, while otherwise allowing truthful associations to be derived. The POQEMON project aims at developing new pattern mining methods and tools for supporting privacy preserving knowledge discovery from monitoring purposes on mobile phone networks. The main idea is to develop sound approaches that handle the tradeoff between privacy of data and the power of analysis. Original approaches to this problem were based on value perturbation, damaging data integrity. Recently, value generalization has been proposed as an alternative; still, approaches based on it have assumed either that all items are equally sensitive, or that some are sensitive and can be known to an adversary only by association, while others are non-sensitive and can be known directly. Yet in reality there is a distinction between sensitive and non-sensitive items, but an adversary may possess information on any of them. Most critically, no antecedent method aims at a clear inference-proof privacy guarantee. In this project, we integrated the $ρ$ -uncertainty privacy concept that inherently safeguards against sensitive associations without constraining the nature of an adversary’s knowledge and without falsifying data. The project integrates the $ρ$ -uncertainty pattern mining approach with novel data visualization techniques.

The POQEMON research project (https://members.loria.fr/poqemon/) involves the following partners: Altran, DataPublica, GenyMobile, HEC, IP-Label, Next Interactive Media, Orange and Université Paris-Est Créteil, and Inria Nancy Grand Est.

CNRS PEPS and Mastodons projects

Mastodons HyQual (2016–2018)

Participants : Miguel Couceiro, Esther Galbrun, Dhouha Grissa, Amedeo Napoli, Chedy Raïssi, Justine Reynaud.

The HyQual project was proposed and initiated this year in the framework of the Mastodons CNRS Call about data quality in data mining (see http://www.cnrs.fr/mi/spip.php?article819&lang=fr). This project is interested in the mining of nutritional data for discovering predictive biomarkers of diabetes and metabolic syndrome in elder populations. The data mining methods which are considered here are hybrid, combining symbolic and numerical methods, and are applied to complex and noisy metabolic data [39]. In the HyQual project, we are mainly interested by the quality of the data at hand and the patterns that can be discovered. In particular, we check whether we can find possible definitions within the data (actually double implications) and redescriptions (under the form of different descriptions of the same data). In this way, we can study the definitional power of the data and as well the incompleteness of the data, leading to two original ways of considering data quality. The project involves researchers from the Orpailleur Team, with researchers from LIRIS Lyon, ICube Strasbourg, and INRA Clermont-Ferrand.

PEPS Confocal (2015–2016)

Participants : Adrien Coulet, Amedeo Napoli, Chedy Raïssi, Malika Smaïl-Tabbone.

The Confocal Project (see http://www.cnrs.fr/ins2i/spip.php?article1183) is interested in the design of new methods in bioinformatics for analyzing and classifying heterogeneous omics data w.r.t. biological domain knowledge. We are working on the adaption of FCA and pattern structures for discovering patterns and associations in gene data with the help of domain ontologies. One important objective of the project is to check whether such a line of research could be reused on so-called “discrete models in molecular biology”.

PEPS Prefute (2015–2016)

Participants : Quentin Brabant, Adrien Coulet, Miguel Couceiro, Esther Galbrun, Amedeo Napoli, Chedy Raïssi, Justine Reynaud, Mohsen Sayed, Malika Smaïl-Tabbone, My Thao Tang, Yannick Toussaint.

The PEPS Prefute project is mainly interested in interaction and iteration in the knowledge discovery (KD) process. Usually the KD process is organized around three main steps which are (i) selection and preparation of the data, (ii) data mining, and (iii) interpretation of (selected) resulting patterns. An analyst, most of the time an expert of the data domain, is present for leading the KD process. Accordingly, the PEPS Prefute project is interested in the study of interactions between the analyst and the KD process, i.e. pushing constraints, preferences and domain knowledge, for guiding and improving the KD process. One possible way is to discover initial patterns acting as seeds for searching farther the pattern space w.r.t. this initial seeds possibly linked to preferences of the analyst. In this way, the interesting pattern space is much more concise and of much lower size.

Then, the importance of preferences and domain knowledge in interaction with KD, and as well, visualization tools, have to be improved for allowing work with large and complex datasets (see https://www.greyc.fr/fr/node/2207).

Previous |

Home | Next next