Members
Overall Objectives
Research Program
Application Domains
Highlights of the Year
New Software and Platforms
New Results
Partnerships and Cooperations
Dissemination
Bibliography
XML PDF e-pub
PDF e-Pub


Section: Research Program

Symbolic methods for model space exploration: Ontologies and Formal Concepts Analysis

All methods presented in the previous section usually result in pools of candidates which equivalently explain the data and knowlegde. These candidates can be dynamical systems, compounds, biological sequences, proteins... In any case, the output of our formal methods generally requires a posteriori investigation and filtering. We rely on two classes of symbolic technics to this end: Semantic Web technologies and Formal Concept Analysis (FCA). They both aim at the formalization and management of knowledge, that is, the explicitation of relations occuring in structered data. These technics are complementary: The production of relevant concepts in FCA highly depends on the availability of semantic annotations using a controlled set of terms and conversely, building ontologies is a complex process that can be made much easier with FCA.

Semantic web for life sciences

Life sciences are intrinsically complicated and complex. Until a few years ago, both the scarcity of available information and the limited processing power imposed the double constraints that work had to be performed on fragmented areas (either precise but narrow or broad but shallow) as well as using simplifying hypotheses  [52] . The recent joint evolution of data acquisition capabilities in the biomedical field, and of the methods and infrastructures supporting data analysis (grids, the Internet...) resulted in an explosion of data production in complementary domains (*omics, phenotypes and traits, pathologies, micro and macro environment...)  [52] , [56] , [47] . This “data deluge” is the life-science version of the more general “big data” phenomenon, with the specificities that the proportion of generated data is much higher, and that these data are highly connected  [86] . In addition to the breakthrough in each of these domains, major efforts have been undertaken notably in Systems Biology for developing the links between them. The bottleneck that once was data scarcity now lies in the lack of adequate methods supporting data integration, processing and analysis. Each of these steps typically hinges on domain knowledge, which is why it resists automation. This knowledge can be seen as the set of rules representing in what conditions data can be used or can be combined for inferring new data or new links between data.

The knowledge we are focusing on is mostly symbolic, as opposed to other kinds of biomedical knowledge (probabilistic, related to chemical kinetics, 3D models of anatomical entities or 4D models of processes...). It should typically support generalization, association and deduction. There is a long tradition of works in order to come up with an explicit and formal representation of this knowledge that would support automatic processing.

This line of work resulted in the now widespread acceptance of ontologies  [87] , [59] to represent the biomedical entities, their properties and the relations between these entities. Bard et al. defined ontologies as “formal representations of knowledge in which the essential terms are combined with structuring rules that describe the relationships between the terms”  [44] . Ontologies range from fairly simple hierarchies to semantically-rich organization supporting complex reasoning  [59] . Ontologies are now a well established field  [59] , [54] that evolved from concept representation  [83] .

The emergence of ontologies in biomedical informatics and bioinformatics happened in parallel with the development of the Semantic Web in the computer science community  [81] , [83] . The Semantic Web is an extension of the current Web that provides an infrastructure integrating data and ontologies in order to support unified reasoning.

Life sciences are a great application domain for the Semantic Web  [57] , [76] , [46] . Semantic Web technologies have become an integral part of translational medicine and translational bioinformatics  [47] , [58] . The Linked Data initiative  [51] and particularly the Linked Open Data project promotes the integration of data sources in machine-processable formats compatible with the Semantic Web. Figure 5 shows the importance of life sciences. In the past few years, this proved instrumental for addressing the problem of data integration  [68] , [72] .

Figure 5. Linked Open Data cloud in August 2014. Nodes are resources. Edges are cross-references between resources. Life science resources constitute the purple portion in the lower right corner.
IMG/linkedData_datasets_201408.png

We are working on the integration of Semantic Web resources with our data analysis methods in order to take existing biological knowledge into account.

Formal Concept Analysis

Initially developed in the community of set and order theorists, algebraists and discrete mathematicians, formal concept analysis aims at the development of conceptual structures which can be logically activated for the formation of judgments and conclusions [90] . In its most simple form, one considers a binary relation between a set of objects O and a set of attributes A. The derivation operator ' associates to each subset U of O (resp. V of A) the subset of elements in A (resp. O) related to all elements in U (resp. V). A formal concept is characterized by an extension (subset of O, individuals belonging to the concept) and an intension (subset of A, properties applying to all objects in the extension), such that the two subsets are stable sets under the double derivation relation ”. Concepts are related within a lattice structure (Galois connection) by subconcept-superconcept relations, and this allows to draw causality relations between attribute subsets.

It is used in various domains managing structered data such as knowledge processing, information retrieval or classification [73] . We study the issues raised by its application in bioinformatics. Among others, it has been used to derive phylogenetic relations among groups of organisms [71] , a classification task that requires to take into account many-valued Galois connections. We have proposed in a similar way a classification scheme for the problem of protein assignment in a set of protein families [61] . One of the most important issue with concept analysis is due to the fact that current methods remain very sensitive to the presence of uncertainty or incompleteness in data. On the other hand, this apparent defect can be reversed to serve as a marker of incompleteness or inconsistency. This has been used for example for the drug repositioning issue [62] , where the completion of concepts is used as a support for the prediction of new relations in a drug-target-disease network and ultimately the assignment of drugs to new diseases. We have proposed a methodology to tackle the problem of uncertainty on biological networks where edges are mostly predicted links with a high level of false positives [91] . The general idea consists to look for a tradeoff between the simplicity of the conceptual representation and the need to manage exceptions. We are also interested in using ontologies to help this process or to help ontology refinement using concept analysis [74] , [50] , [78] .

Networks are widely used in bioinformatics for the integration of multiple sources of data inside a common model and this leads to very large networks (protein/protein interactions, signaling or regulation network, metabolic network...). Common difficult tasks in this context are visualization, search for local structures (graph mining) and network comparison. Network compression is a good solution for an efficient treatment of all these tasks. This has been used with success in power graphs, which are abstract graphs where nodes are clusters of nodes in the initial graph and edges represent bicliques between two sets of nodes [80] . In fact, concepts are maximal bicliques and we are interested in developing the power graph idea in the framework of concept analysis.