ORPAILLEUR - 2017 - Rapport annuel d'activité

ORPAILLEUR

ORPAILLEUR - 2017

Project-Team Orpailleur

Personnel

Overall Objectives

Introduction

Research Program

Application Domains

Highlights of the Year

New Software and Platforms

New Results

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Knowledge Discovery in Healthcare and Life Sciences

Participants : Miguel Couceiro, Adrien Coulet, Kévin Dalleau, Nicolas Jay, Joël Legrand, Pierre Monnin, Amedeo Napoli, Chedy Raïssi, Mohsen Sayed, Malika Smaïl-Tabbone, Yannick Toussaint.

Ontology-based Clustering of Biological Linked Open Data

Increasing amounts of biomedical data provided as Linked Open Data (LOD) offer novel opportunities for knowledge discovery in bio-medicine. We proposed an approach for selecting, integrating, and mining LOD with the goal of discovering genes responsible for a disease [87]. We are currently working on the integration of LOD about known phenotypes and genes responsible for diseases along with relevant bio-ontologies. We are also defining a corpus-based semantic distance. One possible application of this work is to build and compare possible “diseaseomes”, i.e. global graphs representing all diseases connected according to their pairwise similarity values.

Biological Data Aggregation for Knowledge Discovery

This specific research takes place within two multi-disciplinary projects initiated in 2016, in collaboration with the Capsid Team, with a group of clinicians from the Regional University Hospital (CHU Nancy) and bio-statisticians from the Maths Lab (IECL). The first project is entitled ITM2P (“Innovations Technologiques, Modélisation et Médecine Personnalisée”) and depends on the so-called CPER 2015–2020 framework. We are involved in the design of the SMEC platform as a support for “Simulation, Modeling and Knowledge Extraction from Bio-Medical Data”.

The second project is a RHU (“Recherche Hospitalo-Universitaire”) project entitled Fight Heart Failure (FHF), where we are in charge of a workpackage about entitled “Network-based analysis and integration”. Accordingly, we are working on the definition of multidimensional similarity measure for comparing and clustering sets of patients. Each cluster should correspond to a bioprofile, i.e. a subgroup of patients sharing the same form of the disease and thus the same diagnosis and care strategy. The first results were presented at the “International Symposium on Aggregation and Structures (ISAS 2016)” [74] where we proposed an approach for complex graph aggregation resulting in a similarity graph between a subset of nodes. In a recent work we explored an alternative to define and efficiently compute pairwise patient similarity thanks to “Unsupervised Extremely Randomized Trees” [62].

The next challenge is to build a prediction model for each bioprofile/subgroup, once validated by clinicians, to be integrated in a decision support system. Currently, we are investigating “Statistical Relational Learning” and analogy-based methods for achieving this goal.

Validation of Pharmacogenomics Knowledge

A standard task in pharmacogenomics research is identifying genes that may be involved in drug response variability. Those genes are called “pharmacogenes”. As genomic experiments in this domain tend to generate many false positives, computational approaches based on background knowledge may generate more valuable results. Until now, the latter have only used molecular network databases or biomedical literature. We developed a new method that takes advantage of various linked data sources to evaluate the validity of uncertain drug-gene relationships, i.e. pharmacogenes [5]. One advantage relies on the standard implementation of linked data that facilitates the joint use of various sources and makes easier to consider features of various origins. The second advantage is related to graph mining approaches that we are using, which consider linked data in their original form, i.e. as graphs. We selected, formatted, interconnected and published an initial set of linked data sources relevant to pharmacogenomics, named PGxLOD (for “PharmacoGenomic Linked Open Data”). We applied and compared distinct numerical classification methods on these data and identified candidate pharmacogenes.

This work is a first attempt for validating state-of-the-art knowledge in pharmacogenomics, which is one objective of the ANR project “PractiKPharma” initiated in 2016 (http://practikpharma.loria.fr/). This year, we improved and enriched PGxLOD in various ways. Firstly, we wanted PGxLOD to be able to encompass pharmacogenomic knowledge of various origin, such as scientific literature, specialized databases, or Electronic Health Records (EHRs). To represent the fact that a given knowledge unit may have distinct provenances, we developed a simple ontology named PGxO (“Pharmacogenomic Ontology”) which relies on the Standard Ontology PROV-O to represent provenance. This makes possible to compare similar knowledge units that may have distinct origins [45].

Analysis of biomedical data annotated with ontologies

In the context of the Snowflake Inria Associate Team (at present Snowball), we developed an approach based on pattern structures to identify frequently associated ADRs (Adverse Drug Reactions) from patient data either in the form of EHR or ADR spontaneous reports. In this case, pattern structures provide an expressive representation of ADR, taking into account the multiplicity of drugs and phenotypes involved in such reactions. Additionally, pattern structures allow considering diverse biomedical ontologies used to represent or annotate patient data, enabling a “semantic” comparison of ADRs. Up to now, this is the first research work considering such representations to mine rules between frequently associated ADRs. We illustrated the generality of the approach on two patient datasets, each of them linked to distinct biomedical ontologies. The first dataset corresponds to anonymized EHRs, extracted from “STRIDE”, the EHR data warehouse of Stanford Hospital and Clinics. The second dataset is extracted from the U.S. FDA (for Food & Drug Administration) “Adverse Event Reporting System” (FAERS). Several significant association rules have been extracted and analyzed and may be used as a basis of a recommendation system [29].

Previous |

Home | Next next