Section: New Results
Axis 1 : New Approaches for Knowledge Discovery in Structural Databases
Biomedical Knowledge Discovery
Our collaboration with clinicians at the CHRU Nancy in the framework of the RHU FIGHT-HF program and of the Contrat d'Interface has lead to two publications demonstrating the added value of database and knowledge graph exploitation when analyzing observational or prospective cohorts. In a retrospective observational study, we have identified and characterized patient subgroups presenting stable or unstable positivity to anti-phospholipid antobodies assays [15]. In the European FibroTarget cohort study, we have contributed to the characterization of at-risk phenotypic groups using proteomic biomarkers [16].
Another application is carried out in collaboration with the Orpailleur Team and concerns the PraktikPharma ANR project. We aim at building explanations for severe drug side effects (such as drug-induced liver injury or severe cutaneous adverse reaction) from pharmacogenomics RDF graph (PGXlod). We obtained a podium abstract at the MedInfo 2019 conference for providing molecular characterization for unexplained adverse drug reactions using pharmacogenomics RDF graph (PGXlod) [30].
Stochastic Decision Trees for Similarity Computation
In the frame of Kévin Dalleau's PhD thesis, we have designed a method to compute similarities on unlabeled data using stochastic decision trees [31], [27]. The main idea of Unsupervised Extremely Randomized Trees (UET) is to randomly and iteratively split the data until a stopping criterion is met. Pairwise similarity values are computed based on the co-occurrence of samples in the leaves of each generated tree. We evaluate our method on synthetic and real-world datasets by comparing the mean similarities between samples with the same label and the mean similarities between samples with distinct labels. Empirical studies show that the method effectively gives distinct similarity values between samples belonging to distinct clusters, and gives indiscernible values when there is no cluster structure. We also assessed some interesting properties such as invariance under monotone transformations of variables and robustness to correlated variables and noise. Our experiments show that the algorithm outperforms existing methods in some cases, and can reduce the amount of preprocessing needed with many real-world datasets. We extended the approach to the computation of pairwise similarity for graph nodes. The experimental results are competitive with state of the art methods. We are currently working on merging the two similarity methods (on attribute-value objects and on graph nodes) to attributed graphs where the nodes are described by attributes.
We plan to study the application of this pairwise similarity computation to quantify protein structural similarities. Two interesting problems will concern the representation of the protein structure and how to tackle extra constraints such as invariance under rotational and translational transformations.
Protein Annotation and Machine Learning
We have been involved in the 3rd international CAFA Challenge ("Critical Assessment of Functional Annotation") through our work on (i) domain functional annotation (Zia Alborzi's PhD thesis) and (ii) label propagation in graphs (Bishnu Sarker's PhD thesis). We were therefore contributors of the general report published this year [23].
As part of his PhD work, Bishnu Sarker developed and tested on UniProt/SwissProt a new method for functional annotation of proteins using domain embedding-based sequence classification [25].
Multiple Instance Learning (MIL) is a machine learning strategy that can be applied to sets of sequences describing organisms displaying a given property. The purpose here is to be able to classify a new organism with respect to this property based on its sequences and their similarity to the sequences of classified organisms. New MIL algorithms have been described and tested in the framework of a collaboration [26], [24]. Another collaborative work has lead to the development of a distributed algorithm for large-scale graph clustering [34].