Section: New Results

Knowledge Discovery in Healthcare and Life Sciences

Participants : Miguel Couceiro, Adrien Coulet, Kévin Dalleau, Joël Legrand, Pierre Monnin, Amedeo Napoli, Chedy Raïssi, Mohsen Sayed, Malika Smaïl-Tabbone, Yannick Toussaint.

Life Sciences constitute a challenging domain for KDDK. Biological data are complex from many points of views, e.g. voluminous, high-dimensional and deeply inter-connected. Analyzing such data is a crucial issue in healthcare, environment and agronomy. Besides, many bio-ontologies are available and can be used to enhance the knowledge discovery process. Accordingly, the research work of the Orpailleur team in KDDK applied to Life Sciences is in concern with mining biological data, data integration, information retrieval, and use of bio-ontologies and linked data for resource annotation.

Ontology-based Clustering of Biological Linked Open Data

Increasing amounts of biomedical data provided as Linked Open Data (LOD) offer novel opportunities for knowledge discovery in bio-medicine. We proposed an approach for selecting, integrating, and mining LOD with the goal of discovering genes responsible for a disease [88]. We are currently working on the integration of LOD about known phenotypes and genes responsible for diseases along with relevant bio-ontologies. We are also defining a corpus-based semantic distance. One possible application of this work is to build and compare possible diseaseomes, i.e. global graphs representing all diseases connected according to their pairwise similarity values.

Biological Data Aggregation for Knowledge Discovery

Two multi-disciplinary projects were initiated in 2016, in collaboration with the Capsid Team, with a group of clinicians from the Regional University Hospital (CHU Nancy) and bio-statisticians from the Maths Lab (IECL). The first project is entitled ITM2P (“Innovations Technologiques, Modélisation et Médecine Personnalisée”) and depends on the so-called CPER 2015–2020 framework. We are involved in the design of the SMEC platform as a support for “Simulation, Modeling and Knowledge Extraction from Bio-Medical Data”.

The second project is a RHU (“Recherche Hospitalo-Universitaire”) project entitled Fight Heart Failure (FHF), where we are in charge of a workpackage about “data aggregation” mechanisms. Accordingly, we are working on the definition of multidimensional similarity measure for comparing and clustering sets of patients. Each cluster should correspond to a bioprofile, i.e. a subgroup of patients sharing the same form of the disease and thus the same diagnosis and care strategy.

The first results were presented at the International Symposium on Aggregation and Structures (ISAS 2016) [36]. We propose the GABS for “Graph Aggregation Based Similarity” approach for complex graph aggregation resulting in a similarity graph between a subset of nodes. Indeed the initial graph contains two types of nodes, i.e. individuals and attributes. The pairwise similarity between individuals is derived from the various paths in the initial graph. This setting allows the integration of domain knowledge in the initial graph (corresponding to domain ontologies, norms...). Another advantage of the GABS approach is to generate a similarity graph which can be used as input for various clustering algorithms (graph-based ones as well as those working on similarity/distance matrix).

The next question will be to build a prediction model for each bioprofile/subgroup (once validated by the clinicians) for a decision support system. Thus, we are investigating SRL (“Statistical Relational Learning”) methods which combine symbolic and probabilistic methods for improving expressivity (through logical or relational languages) and for dealing with uncertainty.

Suggesting Valid Pharmacogenes by Mining Linked Open Data and Electronic Health Records

A standard task in pharmacogenomics research is identifying genes that may be involved in drug response variability and called “pharmacogenes”. As genomic experiments in this domain tend to generate many false positives, computational approaches based on background knowledge may generate more valuable results. Until now, the later have only used molecular networks databases or biomedical literature. We are studying a new method taking advantage of various linked data sources to validate uncertain drug-gene relationships, i.e. pharmacogenes [75]. One advantage relies on the standard implementation of linked data that facilitates the joint use of various sources and makes easier the consideration of features of various origins. Accordingly, we selected, formatted, interconnected and published an initial set of linked data sources relevant to pharmacogenomics. We applied numerical classification methods for extracting drug-gene pairs that can become validated pharmacogene candidates.

The ANR project “PractiKPharma” initiated in 2016 relies on similar ideas, having the motivation of validating state-of-the-art knowledge in pharmacogenomics (http://practikpharma.loria.fr/). The originality of “PractiKPharma” is to use Electronic Health Records (EHRs) to constitute cohorts of patients that can be mined for validating extracted pharmacogenomics knowledge units.

Analysis of biomedical data annotated with ontologies

In the context of the Snowflake Inria Associate team, Gabin Personeni, who is a PhD student co-supervised by Marie-Dominique Devignes (Capsid EPI) and Adrien Coulet (Orpailleur EPI) spent four months at the Stanford University in 2016. After this internship, we developed an approach based on pattern structures to identify frequently associated ADRs (Adverse Drug Reactions) from patient data either in the form of EHR or ADR spontaneous reports [51], [49]. In this case, pattern structures provide an expressive representation of ADR, taking into account the multiplicity of drugs and phenotypes involved in such reactions. Additionally, pattern structures allow considering diverse biomedical ontologies used to represent or annotate patient data, enabling a “semantic” comparison of ADRs. Up to now, this is the first research work considering such representations to mine rules between frequently associated ADRs. We illustrated the generality of the approach on two distinct patient datasets, each of them linked to distinct biomedical ontologies. The first dataset corresponds to anonymized EHRs, extracted from “STRIDE”, the EHR data warehouse of Stanford Hospital and Clinics. The second dataset is extracted from the U.S. FDA (for Food & Drug Administration) Adverse Event Reporting System (FAERS). Several significant association rules have been extracted and analyzed and may be used as a basis of a recommendation system.