EN FR
EN FR


Section: New Results

KDDK in Life Sciences

Participants : Mehwish Alam, Yasmine Assess, Sid-Ahmed Benabderrahmane, Emmanuel Bresso, Thomas Bourquard, Adrien Coulet, Sébastien Da Silva, Marie-Dominique Devignes, Anisah Ghoorah, Renaud Grisoni, Mehdi Kaytoue, Jean-François Kneib, Florence Le Ber, Bernard Maigret, Jean-François Mari, Lazaros Mavridis, Amedeo Napoli, Violeta Pérez-Nueno, Dave Ritchie, Malika Smaïl-Tabbone, Vishwesh Venkatraman.

One of the major challenges in the post genomic era consists in analyzing terabytes of biological data stored in hundreds of heterogeneous databases (DBs). The extraction of knowledge units from these large volumes of data would give sense to the present data production effort with respect to domains such as disease understanding, drug discovery, and pharmacogenomics or systems biology. Research reported here addresses these important issues and shows the spreading of KDDK over such domains.

Ontology-based Functional Classification of Genes

Functional classification involves grouping genes according to their molecular functions or the biological processes they participate in. This unsupervised classification task is essential for interpreting gene datasets produced by postgenomic experiments. As the functional annotation of genes is mostly based on the Gene Ontology (GO), many similarity measures using the GO have been described, but few of them have been used for clustering  [107] . We have evaluated a functional classification of genes using our previously described IntelliGO semantic similarity measure with the help of reference sets [38] . The IntelliGO measure computes semantic similarity between genes for discovering biological functions shared by genes and takes into account domain knowledge represented in Gene Ontology  [82] . The reference sets consist of genes taken from human and yeast KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways and Pfam clans. Hierarchical clustering and heatmap visualization were used to illustrate the advantages of IntelliGO over several other measures. Because genes often belong to more than one reference set, the fuzzy C-means clustering algorithm was then applied to the datasets using IntelliGO. The F-score method was used to estimate the quality of clustering and the optimal number of clusters. The results were compared with those obtained from the state of the art DAVID (Database for Annotation Visualization and Integrated Discovery) functional classification method. Overlap analysis allows to study the matching between clusters and reference sets, and leads us to propose a set-difference method for discovering missing information [38] . The IntelliGO similarity measure, the clustering tool and the reference sets used for the evaluation are available at http://plateforme-mbi.loria.fr/intelligo .

Use of Domain Knowledge for Dimension Reduction

Data complexity is a major challenge for knowledge discovery approaches. High dimensionality of datasets can impair the execution of most data mining programs and/or lead to the production of numerous and complex patterns, improper for interpretation by the supervising expert. Thus, an important research orientation is dimension reduction as part of the data preparation step [93] . Domain knowledge is essential for achieving such dataset modification with minimal loss of information. The Life Sciences constitute a suitable domain for testing knowledge-guided approaches for dimension reduction because of the continuous increase in the number of both complex datasets and bio-ontologies. Most of these bio-ontologies are used for annotating biological objects leading to high-dimensional datasets. We propose a new approach for reducing dimensions in a dataset by exploiting semantic relationships between terms of an ontology structured as a rooted directed acyclic graph [40] . Term clustering is performed thanks to the IntelliGO similarity measure and the term clusters are further used as descriptors for data representation. The technique was applied to a set of drugs associated with their side effects collected from the SIDER database. Terms describing side effects belong to the MedDRA terminology. The hierarchical clustering of about 1,200 MedDRA terms into an optimal collection of 112 term clusters led to a reduced data representation. Two data mining experiments were conducted to illustrate the advantage of using such reduced data representation.

Results obtained in the frame of the collaborative Grand Challenge project (see previous report 2009 and 2010) have been published this year. We have designed the HIV-PDI (Protein-Drug Interactions) resource as a decision making tool to propose alternative antiretroviral drugs (ARVs) for personalized antiretroviral treatment [22] . The HIV-PDI is an integrated database in which sequence mutations of viral proteins can be mapped onto three-dimensional structural interactions between these proteins and ARVs. Thus, critical loss of interactions leading to resistance can be detected and serve as indicators for proposing appropriate ARVs escaping the resistance. As a first step, the HIV-PDI was populated with data relating to HIV protease: clinical information on patients, resistance to ARVs treatments, HIV protease structures and mutations, ARV drugs and their 3D interactions with HIV protease models. Possible queries include protein, drug and treatment conditions, coupled with dedicated tools for visualization/analysis of 3D Protein-Drug interactions. Case-studies demonstrate the capabilities of the HIV-PDI resource for retrieving information associated with patients and for analyzing structural data relating proteins and ligands [23] .

Mining Agronomical Data with stochastic models

In the framework of agricultural landscape data mining, we have developed an original approach combining two methods used separately so far: the identification of explicit farmer decision rules through on-farm surveys methods and the identification of landscape stochastic regularities through data-mining of the mosaic of agricultural parcels, following preceding work [96] . This approach was assessed in a study on the Niort plain (West of France) database. In this database, provided by the CEBC (UPR CNRS), the land use occupations of the fields covering a 400km 2 area are recorded during 12 years. It results a segmentation of the landscape, based on both its spatial and temporal organization and partly explained by generic farmer decision rules. This consistency between results points out that the two modelling methods interact and may be combined for land-use modelling at landscape scale and for understanding the driving forces of spatial organization. Based on farm surveys, we were able to retrieve and measure changes in land use occupation and link some farmer decision and spatiotemporal regularities that were observed in the landscapes.