Note on the organization of the report.Regarding the organization of this report, for convenience, applications and scientific results are not presented in specific sections, but, instead, follow the theoretical topics on which they are based.
The ``orpailleur'' denotes in French a person who is searching for gold in the rivers. In the present case, gold nuggets correspond to knowledge units and may have two major different origins: explicit knowledge that can be given by domain experts, and implicit knowledge that must be extracted from data sources of different natures, e.g. rough data or textual documents. The main objective of the members of the Orpailleur team is to extract knowledge units from different data sources and to design structures for representing the extracted knowledge units. Knowledge-based systems may then be designed, to be used for problem-solving in a number of application domains such as agronomy, biology, chemistry, medicine, the Web...
The research work of the Orpailleur team may be considered mainly as kddk, i.e. Knowledge Discovery guided by Domain Knowledgeinvolving knowledge extraction and knowledge representation. First, the data sources are prepared to be processed, then they are mined, and finally, the extracted information units are interpreted for becoming knowledge units. These units are in turn embedded within a representation formalism to be used within a knowledge-based system. The mining processes are based on the classificationoperation, e.g. hidden Markov models, lattice-based classification, frequent itemset search, and association rule extraction. The mining process may be guided by a domain ontology, that is considered as a domain model, used for interpretation and reasoning, especially in the context of semantic Web.
The whole transformation process from rough data into knowledge units is based on the underlying idea of classification. Classification is a polymorphic process involved in a number of tasks within the data to knowledge transformation, e.g. mining operations, modeling of the domain for designing a domain ontology (or extending the ontology with extracted knowledge units), knowledge representation and reasoning. Finally, the knowledge extraction process and the associated knowledge base can be used for problem-solving activities within the framework of the Semantic Web, e.g. Web mining, intelligent information retrieval, content-based document mining...
is a process for extracting knowledge units from large databases, units that can be interpreted and reused within knowledge-based systems.
Knowledge discovery in databases( kdd) consists in processing a huge volume of data in order to extract useful and reusable knowledge units from these data. An expert of the data domain, called hereafter the analyst, is in charge of guiding the extraction process, on the base of his objectives and of his domain knowledge. The extraction process is based on data mining methods returning information units from the data. The analyst selects and interprets a subset of the units for building ``models'' that may be further interpreted as knowledge units with a certain plausibility.
The kddprocess is performed with a kddsystem based on four main components: the databases (or the set of data), a domain ontology (and an associated knowledge-based system), data mining modules (either symbolic or numerical), and interfaces for interactions with the system, e.g. editing and visualization. For handling huge volume of data in a given domain, a kddsystem may take advantage of domain knowledge, i.e. an ontology, and the problem-solving capabilities of a knowledge-based system working in the domain of data. In turn, closing the loop, the knowledge units extracted by the kddsystem may be integrated within the ontology to be reused by the knowledge-based system for future problem-solving operations.
Lattice-based classification, frequent itemset search, and association rule extraction.
Symbolic methods for kddmainly rely on lattice-based classification, frequent itemsets, and association rule extraction . Lattice-based classification is used for extracting from a database (or a set of rough data) a set of concepts organized within a hierarchy i.e. a partial ordering. Lattice-based classification relies on the analysis of binary tables relating a set of individuals with a set of properties (or characteristics), where truestands for the individual ihas the property p. The lattice may be built according to the so-called Galoiscorrespondence, classifying within a formal concept a set of individuals, i.e. the extension of the concept, sharing a common set of properties, i.e. the intension of the concept. In addition, lattice-based classification is the basic operation underlying the so-called formal concept analysis.
In parallel, the extraction of frequent itemsets consists in extracting from binary tables sets of properties occurring with a support or frequency, i.e. the number of individuals sharing
the properties, greater than a given threshold. From the frequent itemsets, it is possible to generate association rules of the form
relating the subset of properties
Awith the subset of properties
B, that can be interpreted as follows: the individuals including
Aalso include
Bwith a certain support and a certain confidence. The number of rules that can be extracted is very large, and there is a need for pruning the sets of extracted rules for interpretation
(most of the time, the analyst is in charge of interpreting the results of the rule extraction process). Different measures have been set on, mainly based on probability theory, that can be
used for pruning the sets of extracted rules (i.e. rule mining). The Orpailleur team is also interested in the mining of rare itemsets and rare rules, an research work that is an originality
of the team
,
,
. Accordingly, the team is currently
developing a platform for knowledge extraction with symbolic methods, called
Coron, that includes a collection of data filtering methods and symbolic data mining algorithms. The
Coronplatform is used in a number of
kddapplications that are described in the following.
Data integration and knowledge extraction in bioinformatics.
Biological datasets have tremendously grown in size and complexity in the past few years. Genome sequences, biological structures, expression arrays, proteomics represent terabytes of data which are stored under variable formats in dispersed heterogeneous databases ( db). More than 800 such DB have been listed at the beginning of 2006. One of the major challenges in the post genomic era consists in exploiting the vast amounts of biological data stored in those db. The extraction of knowledge from all these data is an increasingly challenging task which ultimately gives sense to the data production effort with respect to domains such as evolution and disease understanding, biotechnologies, systems biology, pharmacogenomics,etc. The knowledge discovery in biological databases process starts with two important steps: data selection from appropriate db, and data integration. In the biological domain, these tasks are hampered by at least two distinct problems: (i) identifying the relevant db, and (ii) managing the complexity and heterogeneity of biological data for their integration. Previous and present work within the Orpailleur group has been dealing with the first two aspects of the kddprocess: selection of biological databases and heterogeneous biological data integration.
Heterogeneous biological data integration: customized warehouses populated by a user-defined workflow of data retrieval from public databases (ACGR project)
Integration of heterogeneous biological data is addressed from a pragmatic and user-oriented point of view. In many concrete situations, data have to be collected from various public sources in order to answer complex queries. In previous work we have developed a generic solution for automated collection of biological data given a user-defined workflow. The Xcollect software described in the preceding reports provides users with structured documents containing retrieved data to answer his query. The need to combine various workflows and to store the retrieved data for further exploitation has led us to propose a concept of customized warehouse populated by user-defined workflows of data retrieval.
Given a specific need workflows of data retrieval from public databases (mostly web sources) are designed experimentally. A data model is then built to integrate the retrieved data in a database. Populating the database involves converting the format of retrieved data into an entry format for the database.
In practice retrieval workflows are Xcollect XML scenarios. The retrieved data are compliant with the Xcollect XML session DTD (Definition of Type of Document). The data model is implemented as a MySQL relational database. Mapping between the XML elements of a session document corresponding to a given scenario and the tables and attributes of the relational database model is expressed as a simple correspondence table. A generic python script takes this table as input and produces the XSL transformation file able to transform the XML session document into an SQL command. Execution of the resulting SQL command allows insertion of the retrieved data in the database.
This approach has been applied to the retrieval of candidate genes for a rare orphan disease : Aicardi Syndrome. Collected data concern human, mouse and fly genes related to the phenotype observed in diseased patients (for instance retina abnormalities or defects in neuron migration). Data such as chromosome localization, Gene Ontology (GO) annotations, homologous inter-species relationships, and interactions with other genes are retrieved thanks to various Xcollect scenarios and stored in the ACGR (Aicardi Candidate Gene Retrieval) database. For example users can retrieve from NCBI GENE database all human genes annotated by a given GO term (for example "neuron migration") and store in the ACGR database various pieces of information associated with these genes. Another scenario will retrieve from the same or another database all the genes interacting in some way with these genes and associated information such as chromosomal localization. Ultimately complex queries such as: "What are the genes located on chromosome X that interact with a gene annotated by the GO term 'neuron migration'" can be answered and give new insights on potential candidate genes.
The resulting ACGR database can be considered as a customized data warehouse integrating data from various public sources. Tracking of data origin is included in each Xcollect scenario and has been taken into account in the data model. Refreshing is performed by re-executing the Xcollect scenarios and producing a new release of the database. Data analysis and data mining methods can be plugged in the system according to user needs. Present work concerns prioritization of genes with respect to their probability of being the gene responsible for the disease since the best predictions must be tested now experimentally. Of course application to another disease is possible.
Organizing and Querying a metadata repository with Concept Lattices (The BioRegistry project).
The BioRegistry project aims at gathering and organizing knowledge about biological databases in order to facilitate and to optimize the selection of relevant databases with respect to a user query. In the ``BioRegistry'' repository the various metadata attached to biological databases are structured according to a model described in the previous report, and whenever possible expressed in terms of domain ontologies. Our model is compliant with the dcmi(``Dublin Core Metadata Initiative'') recommendations and uses two main domain ontologies to valuate metadata fields describing the content of the databases.The subjectsfield for instance contains terms extracted from the biomedical thesaurus MeSH, maintained by nlm. This thesaurus was chosen because it is widely used to index scientific literature, it presents a broad coverage of many biological domains and is regularly updated to take into account changes in the topics addressed by scientific papers. Concerning the organismsfield, the ncbitaxonomy of living organisms has been chosen since this taxonomy is also used to annotate biological sequences. In previous stage of the work, inclusion of several databases in the BioRegistry repository has been performed manually. To accelerate the process, an automatic procedure was designed to import the dbcatmetadata (see previous report). Since the dbcatcatalog is no more maintained, it was then decided to exploit the Nucleic Acids Research ( nar) 2006 catalog of molecular biology databases maintained at ncbi. Several text-mining programs were set up to translate the narinformation into controlled vocabulary terms.Terms found in the database short descriptions were cross-matched with the list of organism names available from the ncbitaxonomy. Retrieved terms were entered in the organismssubsection of the BioRegistry repository. In addition, we built a correspondence table between the narcategories and sub-categories and MeSH terms to be included in the subjectsfield of the BioRegistry. Alert and survey mechanisms have to be designed to detect any change or new release in existing databases as well as new databases appearing on the Web.
Formal Concept Analysis ( fca) was set up for organizing the BioRegistry and visualizing the sharing of metadata across the db. A formal context representing the relation between bioinformatics data sources and their metadata is provided, and the corresponding concept lattice is built (see previous report). The BR-explorer algorithm addresses the problem of retrieving the relevant data sources for a given query . It starts by building the query concept representing the query, and then inserts the query concept in the concept lattice. Then, the BR-explorer algorithm fills a list of candidate concepts, according to a relevance criteria which was first set as follows : a relevant concept is a concept that shares at least one property with the query concept. The BR-explorer algorithm thus explores the ascendants of the query concept in the concept lattice, until the top concept is reached. Finally, the BR-explorer algorithm returns the set of relevant data sources ranked according to their distance to the considered query. The distance measure was first set as follows : number of edges in the lattice between the query concept and the relevant concept. Various refinements of both the relevance criteria and the distance measure are currently under investigation for taking advantage of the semantics of queries and metadata. This work should be reusable for resource discovery and composition in the frame of the semantic web.
Knowledge extraction in pharmacogenomics.
Another ongoing research work consists in applying the whole kddprocess to the pharmacogenomics context, i.e. from data selection and filtering to knowledge extraction guided by the domain knowledge. More precisely, the goal is to discover knowledge about interactions between clinical, genetic and therapeutic data. For example, a given genotype –set of selected gene versions–may explain adverse clinical reactions, e.g. hyperthermy, toxic reaction...to a given therapeutic treatment. Indeed more and more pharmaceutical firms are willing to include the exploration of particular genomic variants in their drug clinical trials in order to detect relevant relationships between the three vertices of the pharmacogenomics triangle, i.e., (i) drug (properties and administration), (ii) phenotype (biological and clinical data) and (iii) genotype (genome variations).
We first focused on the genotype vertex by building SNP-Ontology that formalizes available knowledge about genomic variations. This allowed us reconciling the various heterogeneous representations of both private data and data coming from public databases (dbSNP, UCSC, HapMap...). A UML class diagram was first designed as an intuitive description of the relevant knowledge and then transformed into an OWL formal model. A dedicated wrapper was developed for gathering data from several sources. These data (A-box) instantiated successfully the concepts of SNP-Ontology (T-box) and consistency checking was successfully conducted as well. This constituted a first validation of SNP-Ontology .
SNP-Ontology goes far beyond a simple controlled vocabulary or taxonomy which is more or less the state of most bio-ontologies today. We demonstrate that semantic relationships other than the classical "is-a" one have to be used for representing knowledge about genomic variations and for enabling reasoning over it. SNP-Ontology differs from PharmGKB ontology which is expressed as an XML schema. PharmGKB ontology has a wide scope and covers all three vertices of the pharmacogenomics triangle. However concerning representation of genomic variations, PharmGKB schema is less open than SNP-Ontology which is much more focused and complete on that aspect of pharmacogenomics. Indeed the SNP-Ontology makes it possible for all representations of a given variant to co-exist in the SNP-knowledge base and be declared equivalent .
In a second phase, we have been working on the construction of a modular and formal representation of domain knowledge in pharmacogenomics. The resulting ontology is called SO-Pharm for Suggested Ontology for Pharmacogenomics. We adapted some well-known methodologies for ontology construction to the case of pharmacogenomics, based on three steps: (i) specification, including definition of ontology domain and scope; (ii) conceptualization, involving definition of lists of terms and of concepts; (iii) formalization and implementation i.e. the translation of the conceptual model in a knowledge representation formalism (OWL in our case).
Domain and scope of SO-Pharm were primarily defined as follows. The domain considered should cover pharmacogenomics clinical trials. The ontology has to precisely represent groups of individuals involved in trials, their genotype, their treatment, their observed phenotype and the potential pharmacogenomics relations discovered between these concepts. SO-Pharm scope is to guide KDD in pharmacogenomics. Term lists were established either by domain experts or by extraction from existing data or knowledge resources in the domain. Relevant reusable resources were select at this stage. OBO (Open Biomedical Ontologies) ontologies ( http://obo.sourceforge.net) were preferred and among them those involved in the OBO-Foundry project. As for the SNP-Ontology, a UML class diagram was used here for representing the conceptual model of SO-Pharm. Embedding and extension strategies were used to anchor existing ontologies to SO-Pharm concepts. Several highly specialized vocabularies such as Disease Ontology were embedded whereas formal ontologies, such as SNP-Ontology, extend definitions of more specific concepts pertaining from other ontologies. The consistency and the class hierarchy of SO-Pharm including reused ontologies have been validated with Racer 1.9 at each stage of the implementation thanks to standard reasoning mechanisms. As a preliminary validation of the ontology, several examples of published pharmacogenomics knowledge have been expressed with the SO-Pharm concepts. The assertions of individuals and related information (clinical trial, treatment) lead us to enrich SO-Pharm concepts. The ontology construction method has been published .
In summary, SO-Pharm construction favors the reuse of concept definitions existing in other ontologies. This reuse mechanism will become more and more important since more and more autonomous ontologies are being produced in the biomedical domain, e.g. for representing phenotypes. SO-Pharm is available (in OWL format) at http://www.loria.fr/~coulet/ontology/sopharm/version1.2/sopharm.owl. We plan to submit SO-Pharm to OBO portal to gain in visibility and facilitate further improvements.
SO-Pharm is a crucial component for a future knowledge-based application dedicated to pharmacogenomics knowledge discovery. A complete validation has now to be conducted, aimed at evaluating how SO-Pharm is able to guide the KDD process. A significant issue will be to develop appropriate wrappers to achieve heterogeneous data integration as for SNP-Ontology. Then mining methods will have to be articulated with the ontology in order to extract new relevant knowledge units that will enrich the ontology.
Association rule extraction in a biological database.
Relying on the kddprinciples, a research work is currently under investigation in the domain of biology for searching associations between biological parameters involving cardiovascular ( cv) risk factors in a given population of individuals. The studies carried out here rely on a real-world individual database, the Stanislascohort. It is a ten-years study holding supposed healthy French families. Families are examined every five years. At the beginning of the study, in 1993, 1006 families (composed by two parents and at least two children) were recruited for medical examination at the ``Centre de Médecine Préventive de Vandœuvre-lès-Nancy (France)''. Families have been examined further around 1998–1999, and 2003–2004.
The cohort is explored for searching for genotypes and intermediate phenotypes of cardiovascular diseases ( cvd), which are multifactorial pathologies resulting from gene-gene and gene-environment interactions. There is a need for extracting implicit and new potential risk factors for cvdwithin an always increasing volume of data (mainly due to the development of technologies such as pcrmultiplex or microarrays). In the Stanislascohort, information holds on environmental, clinical, biological and genetic data. The kddexperiments have given results in accordance with the domain knowledge, and as well, other results allowing new research insights for further investigations. Regarding statistical methods usually used in this context, the general idea of the present research work is to mine the cohort for extracting itemsets that may be in turn considered as hypotheses to be validated by statistical tests.
Experiments for extracting potential valuable information on the metabolic syndrome in the Stanislascohort have been carried on extracting frequent itemsets and association rules. A methodology for mining cohorts has been proposed , that can be applied with Coron, useful for various studies (not restricted to biology). The methodology is based on frequent itemsets and association rules extraction, and brings interesting results from a biological viewpoint, as it enables the expert of the domain to generate new research hypotheses validated by statistical tests or new lab experiments. However, based on the fact that the metabolic syndrome is a condition relatively rare in the Stanislascohort, which is composed of supposed healthy individuals, the mining work within the cohort has been oriented on the extraction of rare rather than frequent itemsets. In this way, different algorithms have been proposed for extracting rare itemsets and rules , , . Sandy Maumus is in charge of the mining work on the Stanislascohort. She has defended her PhD thesis on November 15, 2005, and she won the PhD Thesis Award of the University Henri Poincaré (Faculty of Pharmacy) . The work on rare itemset and rule extraction has to be followed, and different experiments are currently under investigation for evaluating the kind of information included in rare itemsets and rules, in comparison with frequent itemsets in the Stanislascohort.
Extracting knowledge in medico-economical databases.
Chronic diseases imply recurrent hospitalizations. In order to optimize healthcare resources, improve cooperation between hospitals treating chronic patients, it is very important to understand the factors that may determine the so-called pathwayof a chronic patient. The patient pathway may be seen as a time-ordered sequence of events affecting the health of the patient. An event describes a set of information related to an hospitalization, such as, diagnoses, medical or surgical procedures, hospitalization locations, durations, and costs... In France, the so-called pmsi(for ``Programme de Médicalisation des Systèmes d'Informations'') is the name of the information system collecting for an hospital the information mentioned above.
At present, we are carrying out a research work on the data collected within the pmsiwith the following objectives:
The discovery of elements that may characterize the patient pathway.
The classification of patients with respect to their pathway.
The visualization of the patient pathway.
The first objective relies on the extraction of frequent patterns, sequential and not sequential, from the data of pmsiassociated to the Lorraine Region. The database includes information on more than 800 000 hospitalizations per year. The two following objectives allow, based on the patterns that have been extracted, to build and to visualize a patient pathway classification, using concept lattices (or Galois lattices). More generally, this research work aims at investigating the relations that may exist between frequent itemsets, sequential itemsets, and knowledge representation and visualization with concept lattices. Various successful experiments have been carried out with data on cancer patients in the Lorraine Region , , .
Knowledge discovery in chemical reaction databases.
Chemical reactions are the main elements on which relies synthesis in organic chemistry, and this is why chemical reactions databases are of first importance. >From a problem-solving process point of view, synthesis in organic chemistry must be considered at several levels of abstraction: mainly a strategic level where general synthesis methods are involved, and a tactic level where actual chemical reactions are applied. The research work carried out in the present case is aimed at discovering general synthesis methods from chemical reaction databases in order to design generic and reusable synthesis plans.
A first research work based on levelwise frequent itemset search and association rule extraction, and on chemical knowledge, has been carried on, and has given substantial and promising results. At present, this first research work is extended, trying to adapt a graph-mining process for extraction knowledge from chemical reaction databases, but this time directly form the molecular structures and reactions themselves (both being represented as graphs in reaction databases). This research work is currently under investigation and should bring substantial results .
For designing a complete knowledge discovery system, we have developed stochastic models based on high-order hidden Markov models. These models are capable to map sequences of data into a Markov chain in which the transitions between the states depend on the nprevious states according to the order of the model. The following experiments are based on second-order hidden Markov models ( hmm2), i.e. the transitions between the states depend on the twopreceding states, for discovering spatial and temporal dependencies in databases. The main advantage of hmm2is the existence of a non-supervised training algorithm –the emalgorithm–, that allows the estimation of the parameters of the Markov model from a corpus of observations and an initial model. The resulting Markov model is able to segment each sequence of data into stationary and transient parts.
We focused our effort on two main points: (1) the elaboration of a process for mining spatial and temporal dependencies in order to extract knowledge units (for knowledge acquisition). This process involves an unsupervised classification of data. (2) The specification of adapted visualization tools giving a synthetic view of the classification results to the experts who have to interpret the classes and/or specify new experiments.
Several applications have been carried out during this last year, and two anrprojects in which the Orpailleur team is involved have been selected: the add-coptproject for ``Agriculture et Développement Durable'', and the ecogerproject (for ``Écologie pour la Gestion des Écosystemes et de leurs Ressources''). In parallel, the research project called foudangawithin the aci impbio(for ``Informatique, Mathématiques, Physique en Biologie Moléculaire'') is running.
All these research works have taken advantage of the CarottAgesystem, a generic data-mining system for spatio-temporal data, based on hmm2(the CarottAgesystem is a free software with a gpllicense).
Two applications in agronomy.
The anrproject, called add-coptfor ``Agriculture et Développement Durable'', aims at understanding the agriculture evolution for respecting the environment. An agriculture more respectful of the environment will modify the organization of the territory at several levels, i.e. spatial, economic and organizational levels. In this project, we work in collaboration with agronomists, but also with geographers, since there is a question on the representation of territories, with economists since bio-agriculture (organic agriculture) must remain economically viable, and with psychologists, who have to formalize how the different actors may share their knowledge for achieving this common objective of new agriculture. The goal of the add-coptproject is to specify an observatory of agricultural practices for supporting the different actors in the transformation process to this new agriculture: allowing these actors to confront and share their knowledge, to apprehend and analyze the observations made on the territory, and to assess the impacts of the changes in progress.
A second research project, called ecoger, is still lying in the context of the mining of environmental data. It groups together various competences such as agronomy, zoology, and data mining. We are currently using the CarottAgesystem to process at the same time temporal and space data, for allowing the agronomists to analyze data collected during several years on the ground occupation on a whole of points of the French territory. Preceding the ecogerproject, the CarottAgesystem has been already used for understanding the risks that the bustard was facing while the disappearance of the meadows is clearly impacting its migration. In the ecogerproject, the CarottAgesystem has to be used within a broader framework: environmental risks. The software has to be adapted to take into account the space organization of the successions of cultures. The challenge is double: whereas the software works with temporal data, it has to integrate spatial dimensions and, whereas it has been already tested on relatively homogeneous data, it has to be able to integrate data at different scales, e.g. satellite images, investigations with farmers...
An application in bioinformatics.
In the framework of the so-called Contrat de plan État-Région, we are carrying out a long term data mining project with the laboratory of genetics of the ``Université Henri Poincaré Nancy 1''. The biological material is the soil-dwelling, filamentous bacteria belonging to the genus Streptomyces, that is the greatest source of antibiotics amongst microorganisms. In particular, the 8,7M bases of the Streptomyces coelicolorchromosome have been entirely sequenced and annotated. One objective in this research work is to detect genome heterogeneity islands, and inter sequences dependencies, using hidden Markov models without prior knowledge.
The horizontal transfer understanding. Markovian models with ``specie specific homogeneity'' have been constructed, and coupled with transform filters. Their behavior generates regions with different statistical properties, allowing the user to separate ``foreign DNA regions'' with the own DNA regions of the studied specie itself. In Streptomyces coelicolor, the regions with such a statistical consensus have been detected, and correlated with potential events of horizontal transfer.
The detection of promoters. A data mining method based on second order hidden Markov models ( hmm2) that is able to process the whole genome sequences without prior hypotheses has been applied to the actinomycete genomes of Streptomyces and Mycobacterium species. The stochastic modeling of the genome with hmm2allows the extraction, and the classification of short segments (5 to 12 bp) having a significant different structure and composition without any prior knowledge. These segments appear to be parts of binding sites for transcriptional factors.
In order to confirm the applicability of this new method to the detection of transcriptional signals, the models have been applied to an experimentally determined co-regulated gene set (30 genes) dependent on SigR, a sigma factor of Streptomyces coelicolor involved in the oxidative stress response. A steady homogeneous second-order hidden state chain describes discrete heterogeneity visualized as peaks in the a posteriori observation of the hidden states.
The duration capabilities of the hmm2shows very good performances in the modeling of short segments such as tfbs(Transcriptional factors binding sites), and rbs(Ribosome binding sites). On different genomes of the actinomycete family, the hmm2reveal dnaheterogeneity, that are combined to predict known or potential tfbsor rbs. Based on these models, the present data mining method has proved to be efficient for the detection of dnamotifs involved in both transcriptional, i.e. sigma factor binding sites, and translational ( rbs) regulation.
The objective of a text mining process is to extract new and useful knowledge units in a large set of texts. The text mining process relies on the principles of kdd, although it shows some specific characteristics due to the fact that texts are written in natural language. The information in a text is expressed in an informal way, following linguistic rules, making the mining process more complex. To avoid information dispersion, a text mining process has to take into account paraphrases, ambiguities, specialized vocabulary, and terminology. This is why the first steps of a text mining process are usually dedicated to linguistic knowledge acquisition: lexicon, terminology, markers of semantic relations, discourse markers, specific syntactic or semantic structures...
To carry out studies on text mining, the Orpailleur team is interested in linguistic resources, working on real-world texts in application domains such as biology, astronomy..., using robust information extraction tools. Language thus is considered as a way for accessing information, and not as an object to be studied for its own.
This year was mainly dedicated to building ontologies from texts. However, we kept a ``background activity'' on mining the web and started some investigations on information extraction tools.
Building ontologies from texts.
In astronomy, people are now spending much more time in studying texts for acquiring and synthesizing already known information instead of making new ``physical'' observations. Moreover, part of the categorization process assigning types (galaxy, star...) to celestial objects is done manually. These considerations lead to the following questions: could we (partly) automate the collect of celestial object properties from texts and could we structure this information in an ontology providing a more exhaustive knowledge and a more exhaustive description of celestial objects?
This year, we developed a prototype using Formal Concept Analysis to build ontologies from texts. This prototype relies on the idea that verbs may be used to characterize objects. For example, the sentence ``We observed stars'' enable us to say that stars are ``observable''. We build a binary table (Objects ×Verbs) and then build the Galois lattice. Objects are then structured into classes following the properties they are associated with in the texts. A transformation function can convert the lattice into a hierarchy where objects of the domain are leaves. his approach has been tested over 72 scientific abstract texts, classifying 79 objects and 14 properties and is currently evaluated by astronomers. The lattice is composed of 16 formal concepts .
The main limitation of this approach is that properties are Boolean attributes and n-ary relations cannot be modeled. Our current work is based on Relational Analysis where concepts are related to other concepts.
Knowledge extraction from Web pages.
This research is concerned with the design of a system for extracting knowledge from Web pages. Knowledge is encoded as a ``semantic annotations'' for manipulating the documents by their content .
Most of the current works consider the annotation process as an Information Extraction task. They rely on patterns which aims at identifying in the documents concepts of the ontology. We propose a new approach which relies on classification. Then the annotation process integrates both the syntactic structure of a web page as well as semantic constraints coming from the ontology . The ontology is implemented within the Web Ontology Language ( owl) (reasoning mechanisms such as classification and subsumption are available).
More precisely, the semantic annotation of an element in a Web page relies on two main operations: (i) identification of the syntactic structure of a specific element in the Web page using the DOM Structure (Representation of the page as trees and subtrees), (ii) identification in the ontology of the most specific concept subsuming the extracted element, that will be used for building the annotation.
The global context of the present research work is the study of research themes within the European Research Community. The objective is to use the information provided by research teams on their website to generate knowledge about the European Research Community, for technological watch, analysis of research themes, or detection of new research directions. Preliminary results have been described within an article presented at the Web intelligence Conference .
is a process for representing knowledge within a knowledge representation formalism, giving knowledge units a syntax and a semantics. The Semantic Webis a framework for building knowledge-based systems for manipulating documents on the Web by their contents, i.e. taking into account the semantics of the elements included in the documents.
A knowledge system relies on a knowledge base and a reasoning module for problem solving and knowledge management in a given domain. Knowledge units are represented within a knowledge representation formalism where they have a syntax and a semantics. Inference can be drawn from already known knowledge units for deriving new units, that are useful for solving the current problem. Moreover, the units extracted from data by data mining procedures have to be represented within a knowledge representation formalism to be taken into account in the framework of a knowledge system.
In the team Orpailleur, two kinds of formalisms are particularly studied, namely description logics ( dl) and object-based knowledge representation ( obkr) formalisms. Knowledge units are represented within concepts (also called classes), with attributes (properties of concepts, or relations, also called roles in dl), and individuals. The hierarchical organization of concepts relies on a subsumption relation that is a partial ordering. These formalisms provide inference services such as subsumption, concept and individual classifications. Concept classification is used to insert a concept at the right location in the concept hierarchy (searching for its most specific subsumers and its most general subsumees). Individual classification is used for recognizing the concepts an individual may be instance of. In both cases, subsumption and classification are the main operations: this is why these systems are denoted here by ``classification-based systems''.
Classification-based reasoning may be extended into case-based reasoning ( cbr), that relies on three main operations: retrieval, adaptation, and memorization. A source case (srce,Sol(srce))lies in a case base, and can be seen as a problem statement srcetogether with its solution Sol(srce). Then, given a new target problem, say tgt, retrieval consists in the search for a memorized case whose problem statement srceis similar to the target problem tgt. Then, when srceexists, its solution Sol(srce)is adapted to fulfill the constraints attached to tgt. When there is enough interest, the new pair (tgt,Sol(tgt))can be memorized as a new case for further problem solving. In the context of a concept hierarchy, retrieval and adaptation may be both based on classification. Moreover, a number of studies within the Orpailleur team has been carried out on cbr, especially on ``adaptation-guided retrieval'', that consists in searching for a source case whose solution will be adaptable for the target problem, giving a kind of guarantee regarding the building of the solution of the source case.
In parallel with knowledge representation (and knowledge extraction), knowledge management is oriented toward the management of what could be called the ``cycle'' of knowledge, including acquisition, memorization, retrieval, maintenance, dissemination (or exchange) of knowledge. There is also a need for coupling knowledge with data, with respect to representation and management. This means in particular that, besides knowledge extraction from databases, there are some other needs such as knowledge-based information retrieval, content-based manipulation of documents, and knowledge mining. These new directions of investigation are particularly important in the framework of the semantic Web.
Today people try to take advantage of the Web by searching for information (navigation, exploration), and by querying documents using search engines (information retrieval). Then people try to analyze the obtained results, a task that may be very difficult and tedious. Tomorrow, the Web will be ``semantic'' in the sense that people will search for information with the help of machines, that will be in charge of posing questions, searching for answers, classifying and interpreting the answers. The Web will become a space for exchange of information between machines, allowing an ``intelligent access'' and ``management'' of information. However, a machine may be able to read, understand, and manipulate information on the Web, if and only if the knowledge necessary for achieving those tasks is available. This is why ontologies are of main importance with respect to the task setting up a semantic Web. Thus, there is a need for representation languages for annotating documents, i.e. describing the content of documents, and giving a semantics to this content. Knowledge representation languages are (the?) good candidates for achieving the task: they have a syntax with an associated semantics, and they can be used for retrieving information, answering queries, and reasoning.
The semantic Web has gained a great interest in the research work of the Orpailleur team. Indeed, it constitutes a good platform for experimenting a number of ideas on knowledge representation, reasoning, knowledge management, and knowledge discovery (and especially text mining) as well. Investigations mainly hold on the content-based manipulation of textual documents using annotation, ontologies, and a knowledge representation language. The idea is to build an xml-based ``bridge'' between documents, and the knowledge units of the domain of documents, lying in domain ontology. The annotations attached to documents, and the queries, are built with the help of the concepts of the domain ontology. Then, the manipulation of annotations, e.g. information retrieval, query answering, reasoning on the content of documents, is left to the reasoning module associated with the knowledge representation formalism.
The objective of the Kasimirresearch project is decision support and knowledge management for the treatment of cancer. This is a multidisciplinary research project in which participate researchers in computer science (Orpailleur), in ergonomics ( Laboratoire d'ergonomie du cnam ,Paris), experts in oncology ( Centre Alexis Vautrinor cav, Vandœuvre-lès-Nancy), Oncolor (a healthcare network in Lorraine involved in oncology), and Hermès (an association for the sharing of resources in informatics and medicine).
For a cancer localization, e.g. the breast, the treatment is based on a protocol similar to a medical guideline. This protocol is built according to evidence-based medicine principles. For most of the cases (about 70%), a straightforward application of the protocol is sufficient, and provides a solution, i.e. a treatment, that can be directly reused.
A case out of the 30%remaining cases is ``out of the protocol'', meaning that either the protocol does not provide a treatment for this case, or the proposed solution raises difficulties, e.g. contraindication, treatment impossibility, etc. For such an out of the protocol case, oncologists try to adaptthe protocol (actually they discuss such a case during the so-called ``breast therapeutic decision meetings'', including experts of all domains in breast oncology, e.g. chemotherapy, radiotherapy and surgery). In addition, protocol adaptations are studied from the ergonomics and computer science viewpoints. These adaptations can be used to propose evolutionsof the protocol based on a confrontation with actual cases. The idea is then to make suggestions for protocol evolutions based on frequently performed adaptations.
Adaptation knowledge acquisition.
The adaptation in Kasimir, as well as in many cbrsystems, requires knowledge. The adaptation knowledge acquisition ( aka) is a current research work, that takes two directions: akafrom experts and semi-automatic aka.
akafrom experts consists in analyzing adaptations performed by experts. Interviews of experts confronted to decision problems requiring adaptation have been recorded to be afterward analyzed, and modeled within adaptation patterns .
Semi-automatic akais based on the ``mining of the protocols''. A protocol can be seen as a set of rules `` situation decision''. Knowing how the decisions change when the situations change from one rule to another rule provides a specific adaptation rule. By generalizing these specific rules, general adaptation rules may be obtained. This generalization process has been implemented thanks to a frequent close itemset extraction module of the Coronplatform (see § ). This requires a formatting of the situations and decisions of the protocol following the itemset mode. A system, called CabamakA, realizes this case base mining for adaptation knowledge acquisition, and provides pieces of information that can be used for building adaptation rules . This akais not fully automated: an analyst pilots CabamakA, following the principles of knowledge discovery. More precisely, the analyst uses filters to orientate the mining process, and interprets the extracted pieces of information in adaptation rules.
akafrom experts and semi-automatic akaare not completely satisfying: the former provides generic adaptation patterns that are intelligible, but cannot be directly operational, while the latter provides adaptation rules that can be directly implemented, but are difficult to understand (and thus, to validate). A future research work will combine the two kinds of akafor producing operational andintelligible adaptation knowledge units.
Knowledge representation for decision support tools.
Two versions of Kasimirare currently used: one based on an ad hocobject-representation formalism ( obrf), the other one based on semantic Web principles, in a semantic portal (as explained below). A number of knowledge bases corresponding to specific cancers (decision protocols) has been developed. Moreover, the inference engine has been extended for taking into account a fuzzy representation of concepts and fuzzy hierarchical classification. The system tries to detect and to propose more than one treatment for ``borderline cases'': this has been implemented for the obrfversion of Kasimir, and its implementation in the semantic portal is under development , . Another study is about ``multiple viewpoint representation and reasoning'', that may be useful for modeling the reasoning of the breast therapeutic decision committee, i.e. each viewpoint represents a domain in breast oncology. In , the formalism c-owlfor the representation of multiple contextualized ontologies in the semantic Web is adapted for the purpose of multiple viewpoint representation and decentralized cbr.
A semantic portal for oncology.
The current version of the Kasimirsystem is embedded within a semantic portal for oncology, i.e., a Web server relying on the principles and technologies of the semantic Web for providing an intelligent access to knowledge and services in oncology.
One of the main issues of the semantic Web relies on interoperability between applications and knowledge modules (e.g. ontologies). Thus, building a semantic portal implies a standardization of knowledge and software components of the Kasimirsystem. For the knowledge bases, standardization relies on a sharable domain model, and leads to the definition of general ontologies in oncology. This kind of ``knowledge base re-engineering'' requires to replace the ad hocknowledge representation formalism of Kasimirwith owl, the knowledge representation formalism of the semantic Web. The representation of protocols is also re-engineered in order to take a better advantage of the expressiveness of the owlformalism.
This work also implies a new software architecture, for the Kasimirreasoner and the editing, visualization and maintenance modules. This architecture must take into account constraints related to the distributed and dynamic environment of the semantic Web. In order to query the protocols represented within owl, an instance editor called EdHibouhas been developed. Another interface, called NavHibou, has been developed for navigating in the class hierarchies built by a reasoner based on owl. Moreover, since the Kasimirinference engine is based on subsumption, a study on the integration of an extended inference engine taking into account inferences based on cbr, and the integration within the semantic Web, has to be carried out. A service of cbrbased on an owlrepresentation has been developed for this purpose (see the thesis of Mathieu d'Aquin, defended at the end of 2005 , ).
Going further: knowledge discovery for the semantic Web.
The semantic portal of Kasimiris operational in the sense that, given a decision protocol represented in owland an adaptation knowledge base, it can be used to apply or to adapt the protocol to specific situations. Besides, some ongoing research in the Kasimirproject aims at acquiring knowledge, especially adaptation knowledge, as explained above.
This is the goal of the thesis of Fadi Badra, initiated in October 2005, to combine these two research issues, i.e. how knowledge discovery techniques can be used to feed a semantic portal, and how the knowledge server embedded in this portal can be used to assist the knowledge discovery processes.
>From a longer term perspective, the goal is the following: having a clear distinction between the notions of data and knowledge, try to build a distributed system, with knowledge bases, heterogeneous data bases, inference engines, knowledge discovery modules, allowing communications with human beings, such as experts and end-users.
In this framework, we work on two major themes, the representation of spatial structures in knowledge-based systems, and the design of reasoning models on these structures e.g. hierarchical classification and cbr. This research work is applied to answer agronomic questions regarding the recognition and the analysis of farmland spatial structures. Besides, we have been involved in the organization of the workshop rte 2006on spatial and temporal reasoning.
Lattice-based classification of spatial relations.
This work has been initiated during the thesis of Ludmila Mangelinck (1995–1998), in collaboration with the inra bialaboratory in Nancy. It has been carried out in the context of the design of a knowledge-based system for agricultural landscape analysis.
In this framework, we have designed a hierarchical representation of topological relations based on a Galois lattice–or concept lattice structure– relying on the Galois lattice theory. A Galois lattice is a multi-faceted tool for designing hierarchies of concepts: it allows the construction of a hierarchical structure both for representing knowledge and for reasoning. In a concept lattice structure, a concept may be defined by an extension, i.e. the set of individuals being instances of the concept, and by an intension, i.e. the set of properties shared by all individuals. In our framework, the extension of concepts corresponds to topological relations between regions of an image, and the intension of concepts corresponds to properties computed on that image regions ( computational operations). Thus, a concept lattice structure emphasizes the correspondence between qualitative models, e.g. topological relations, and quantitative data, e.g. vector or raster data.
Currently, this work is continuing with a deeper study of Galois lattices for linking qualitative topological relations, and computational operations on numerical (raster or vector) data. In particular, we focus on the comparison of lattices built on different sets of relations, or computational operations , .
CBR on spatial organization graphs.
This work has been undertaken in the framework of Jean-Luc Metzger thesis (2000 – 2005), in collaboration with inra sad. The objective was to develop a knowledge-based system, called rosa, for comparing and analyzing farm spatial structures. The reasoning in the rosasystem follows the principles of case-based reasoning ( cbr). In our research work, cbrrelies on the agronomic assumption that there exists a strong relation between the spatial and the functional organizations of farms, and thus, that similar spatial organizations correspond to similar functional organizations. According to this assumption, and given a set of previously studied farm cases, the rosasystem has to help agronomists to analyze new problems holding on land use and land management in farms.
The development of the system is achieved and tests have been done . This part of the project is stopped since J.-L. Metzger left the team.
Besides, the analysis of the knowledge acquisition and modeling processes, undertaken with the help of researchers in socio-psychology and linguistics ( codisant, lpi-grc, Université Nancy 2 and gric umr5612 cnrs, Lyon) is continuing , .
The availability and retrieval of information is of main importance in scientific and technical domains, e.g. for research and technological watch purposes. Nowadays, there is a large quantity of data available, and this requires to implement adapted tools for exploiting this mass of data. A research work holds on the definition and implementation of a toolbox allowing an ``intelligent'' access to information, by combining information retrieval, hypertext navigation, and data-mining. This toolbox can be used for document retrieval on the Web, bibliographical search or domain analysis.
In this framework, the design of a semantic-based algorithm for comparing and classifying documents is under investigation . The annotations of documents are represented as labeled trees, where nodes and edges are composed of concepts lying in a domain ontology associated with the topics of the considered documents. A reasoning process based on classification is carried out for comparing the labeled trees representing documents, i.e. the annotations, and thus for comparing the documents. This comparison process allows to compute a semantic similarity measure between documents, and then to classify documents according to their content.
Another important idea underlying the toolbox is that data-mining and information retrieval are complementary tasks for accessing and analyzing data. Data-mining allows the guiding of information retrieval by taking advantage of the knowledge units extracted from the data, for example the extraction of a lattice from the data may provide an organization on which the information retrieval process may rely. Conversely, information retrieval allows the guiding of the data-mining process by making available information on data that can be used for example for pruning a set of extracted rules, or for providing a focus for a classification process.
>From a practical point of view, the toolbox, called ``IntoWeb'', provides a set of tools for implementing the core tasks of the knowledge extraction process (see ). For building a generic knowledge-based information retrieval system, it is needed to precisely define the kinds of objects to be manipulated, and the manipulation operations. The objects may be, among others, url(reference to web documents), hypertextual documents (web documents), full-text documents, xmldocuments, vectors (sets of valuated properties), etc. Operations may be applied to these objects for producing new objects, containing characteristic information, e.g. objective of a research, constraints for guiding a data mining or information retrieval process, etc. These operations may be based on information retrieval or knowledge discovery, e.g. finding all hypertextual documents identified by a set of url, computation of the vector representation of a full-text or an xmldocument, extraction of an annotation tree from a textual document according to an ontology, extraction of a set of association rules from a set of xmldocuments, classification of web documents according to an ontology, etc.
An experiment is currently under study in the field of astronomy ( mda project, see § ). The focus is on the building of a prototype ontology in the field of astronomy (more precisely units of measurement for celestial objects) . A work has also been carried out on scientific bibliographic data for improving retrieval and navigation services on this kind of data . In this way, knowledge about publications and their domain is stored in an ontology. The ontology is then used for representing concepts and their relationships and for reasoning on documents. Reasoning may help researchers by providing more efficient (focused) document retrieval and navigation.
One of the goals of data mining is to extract hidden relations among objects and properties in databases. Usually frequent itemsets are used to find association rules, but the process produces a large number of rules, leading to the associated problem of ``mining the set of extracted rules''. Studies have shown that it can be more interesting to find only a subset of frequent itemsets, namely frequent closed itemsets( fcis) and frequent generators( fgs). In turn, fcis and fgs can be used for finding ``minimal non-redundant'' association rules.
We have developed a collection of programs for data mining that are grouped in the so-called Coronplatform. The platform contains a rich set of well-known algorithms in the data mining community, such as apriori, apriori-close, close, pascal, eclat, charm, and, as well, several original algorithms such as pascal+, zart, carpathia, eclat-z, and charm-mfi. The toolkit is composed of three main parts: (i) Coron-base, (ii) AssRuleX, (iii) pre- and post-processing modules.
With Coron-base, it is possible to extract different kinds of itemsets, e.g. frequent itemsets, frequent closed itemsets, frequent generators, etc. Each of the algorithms has advantages and disadvantages with respect to the form of the data that are mined. Since there is no best universal algorithm for any arbitrary dataset, Coron-base offers the possibility for users to choose the algorithm that best suits their dataset and needs.
Finding association rules is one of the most important tasks in data mining. The second part of the system, AssRuleX( Association Rulee Xtractor) can generate different sets of association rules. This can lead to another data mining problem: which rules are the most useful? Beside all possible rules, some useful rule subsets can be extracted, e.g. minimal non-redundant association rules, generic basis, informative basis.
The Corontoolkit supports the whole life-cycle of a data mining task. We have modules for cleaning the input dataset, and reduce its size if necessary. The module RuleMinerfacilitates the interpretation and the filtering of the extracted rules. The association rules can be filtered by (i) attribute, (ii) support, and/or (iii) confidence. It is also possible to color the most important attributes in the list of rules, for finding the most interesting rules from a given viewpoint.
Until now, studies in data mining have mainly concentrated on frequent itemsets and generation of association rules from them. Recently, we started to investigate the complement of frequent itemsets, namely the rare (or non-frequent) itemsets. In the literature, the problem of rare itemset mining and the generation of rare association rules has not yet been studied in detail, though such itemsets also contain important information just as frequent itemsets do. A particularly relevant field for rare itemsets is medical diagnosis. Coronalready contains some algorithms that are designed to extract rare itemsets and rare association rules, e.g. apriori-rare, arimaand btb.
The Corontoolkit is developed entirely in Java, which provides a maximal portability. The system is operational, and it has already been tested within several research projects, e.g. for mining the Stanislascohort, or in the CabamakAproject (which is part of the Kasimirsystem, see § ). Moreover, the Coronimplementation of the titanicalgorithm has been integrated into the Galiciaplatform, that is developed at the University of Montréal, Canada.
One aspect of data-mining is to provide a synthetic representation of data that a domain analyst can interpret. The purpose of the CarottAgesystem is to build a partition –called the hidden partition– in which the inherent noise of the data is withdrawn as much as possible. Then spatio-temporal data are explored for extracting homogeneous classes both in temporal and spatial dimensions, giving also a clear view of the transitions between the classes.
CarottAgeis a free software, under a gpllicense, taking as input an array of discrete data where the rows represent the spatial sites and the columns the time slots, and building a partition with the associated a posterioriprobability. This probability may be plotted as a function of time, and is a meaningful feature for the analyst searching for stationary and transient behaviors of data. This software is currently used by inraresearchers interested in mining the successions of land use processes, e.g. in order to build models simulating the contamination of cave and surface waters.
In the framework of the project ``Impact des OGM'' initiated by the French ministry of research, we have developed a software called GenExp for simulating bidimensional random landscapes, and then studying the dissemination of vegetable transgenes. The GenExp system is based on the CarottAgesystem, and on computational geometry. The simulated landscapes are given as input for programs such as Mapod-Maïsor GeneSys-Colzafor studying the transgene diffusion. This year, we have released a new version of GenExp allowing an interaction with R subroutines. This version is on the way to receive a gplLicense.
The system, called tamisfor ``Text Analysis by Mining Interesting ruleS'' is currently under development. This system allows the navigation through a large set of association rules, such as those produced by a text mining experiment. The tamissystem is based on a user-friendly interface, and it can be easily used by non-computer scientists, e.g. analysts, experts in the domain of the analyzed data. The association rules are extracted by a mining algorithm, e.g. using the Coronplatform in the present case, encoded in a predefined xmlformat. The tamissystem stores the rules in a database, and proposes eight different statistical measures for sorting the rules, e.g. support, confidence, interest, conviction, dependence... In this way, the analyst may focus on smaller sets of interesting rules satisfying a given set of constraints. These constraints may be expressed by means of operations on the values of the statistical measures, and on the content of the left/right hand side of a rule.
Two systems are under development. A first system, called ``IntoBib'', is a generic system designed for the exploitation of bibliographical data. Two kinds of objects are manipulated within the IntoBib system, namely bibliographical references and properties –or points of view– about these references, e.g. authors, keywords...The available operations on these specific objects are references filtering using one or more points of view, conceptual clustering of similar references with respect to a given point of view, and extraction of correlation between references. Accordingly, the IntoBib system is based on a toolbox providing a number of modules, among which, hypertext navigation, retrieval of bibliographical references, extraction of correlation between references, search for equivalent references (duplicates), conceptual clustering of similar references, normalization of fields e.g. author name, keywords...
The second system, called ''IntoWeb'', extends the IntoBib system. The objective is to provide a more generic environment for an intelligent access to information, by combining information retrieval, hypertext navigation, and data-mining. The IntoWeb system contains a set of tools implementing the core tasks of a knowledge extraction process, i.e. collecting, filtering, and mining data. Solving a given problem of information retrieval, or data mining, is performed by a well chosen sequence of operations that are available in the system.
The ``DefineCrawler'' system can be seen as an information retrieval ``meta-system'', in the sense that it can be parameterized for satisfying different information retrieval tasks. The DefineCrawler system is based, on a classical information retrieval architecture, and on search engines available on the Web. A number of parameters have been retained, to be adjusted within an xmlfile for implementing and controlling different information retrieval system behaviors.
Initialization parameters ( Start) include the maximum depth of the crawl ( Depth), a set of starting points for navigation ( URL, possibly making reference to the urlof a search engine), the directory where have to be stored the data collected by the crawler ( Directory), the number of parallel processes crawling the Web ( NbThread), a halting condition ( Stop) making possible the specification of a maximal crawling time, and thus ensuring a termination of the information retrieval process.
Validation parameters ( Validation) include a set of conditions (connected by Boolean operators) that must be satisfied by the documents, for eliminating documents without interest with respect to the query, e.g. documents that do not satisfy some criteria, that are not in a fixed language...
Evaluation parameters within which additional conditions can be set, in order to evaluate the returned documents. The evaluation and validation conditions can be combined to calculate a score for a returned document. This score is then used to rank the returned documents.
Every validation and evaluation condition is defined by an external instruction, allowing the use of various commands or tools, e.g. for checking the presence of an element, for counting the occurrences of some elements, for calculating a similarity between documents...
Rosa, for ``Reasoning on Organization of Space in Agriculture'', is a system developed in collaboration with agronomists, whose objective is to record and to maintain an agronomic knowledge base on farms, and to solve problems in agronomy, based on this knowledge base. Two kinds of knowledge elements are considered: domain knowledge, and knowledge on spatial organization and functioning of specific farms. The domain knowledge is described by a hierarchy of spatial concepts and relations (spatial occupation and relations). The spatial organization of farms is described by the so-called ``space organization graphs'' ( sogs) linking spatial entities through spatial relations. A vertex of a sog(either a spatial entity or a relation) is labeled and linked to a concept of the domain knowledge hierarchy. The functioning of farms is described within ``explanations'' attached to sogs. An explanation holds on a particular function of the considered farm organization and functioning. The association of a particular sogwith an explanation composes a case, to be used within a case-based reasoning process. The Rosa system is under development, and is implemented within the racerdescription logic system.
The objective of the Kasimirsystem is decision support and knowledge management for the treatment of cancer. A number of tools have been developed within the Kasimirsystem: mainly modules for the editing of treatment protocols, visualization, and maintenance. The ontology editor Protégéhas been customized for editing the Kasimirprotocols, and it has been connected with the Kasimirinference engine. The use of the Protégéeditor involves a simplification of the protocol editing, and the detection of errors during the editing, thanks to the inference engine.
Two visualization modules have been integrated in Protégé, allowing the display of the Kasimirhierarchy of concepts representing the protocol being edited: Palétuvierand HyperTree( HyperTreehas been initially developed in the ecooteam at loria). The combined use of these two visualization modules, and of the classical tree widget of Protégé, provides several useful features for hierarchy visualization, navigation, and global or focused views.
Finally, a maintenance module has been developed and integrated into Protégé, that compares two versions of a protocol in order to separate changed and unchanged elements. This module can be used in particular during an editing session, to visualize the modifications since the beginning of the session.
Actually, two versions of Kasimirare currently used: one version is based on an ad hocobject-based representation formalism, and the other version is developed within the semantic portal, as introduced in the section . This latter is based on owland on some extensions of owl, and has motivated the development of the two user interfaces, namely EdHibouand NavHibou, presented above. The software CabamakA(see also section ) for case base mining for adaptation knowledge acquisition is part of the Kasimirsystem.
``Knowledge Web'' is the name of a European network of excellence initiated in 2004. Three inriateams are involved in Knowledge Web, namely acaciaat inria-sophia, exmoat inria-rhône-alpesand Orpailleur. The current World Wide Web ( www) is the syntactic Web, where the structure of the content of documents is presented, while the content of documents itself is inaccessible to computers. The next generation of the Web, the Semantic Web, aims at alleviating such problem, and provide specific solutions targeted to concrete problems. The Web resources will be much easier and more readily accessible by both human and computers, with an additional semantic information in a machine-understandable and machine-processible form. The Semantic Web will have much higher impact on eWork and eCommerce than the current version of the Web already had. Still, there is a long way to go transferring the Semantic Web from an academic adventure into a technology provided by software industry. Supporting this transition process of Ontology technology from Academia to Industry is the main and major goal of the ``Knowledge Web'' project. This main goal naturally translates into three main objectives, given the nature of such a transformation:
Industry requires immediate support in taking up this complex and new technology. Languages and interfaces need to be standardized to reduce the effort and provide scalability to solutions. Methods and use-cases need to be provided to convince and to provide guidelines for how to work with this technology.
Important support to industry is provided by developing high-class education in the area of Semantic Web, Web services, and Ontologies.
Research on Ontologies and the Semantic Web has not yet reached its goals. New areas such as the combination of Semantic Web with Web services realizing intelligent Web services require serious new research efforts.
More briefly, it is the mission of Knowledge Web to strengthen the European software industry in one of the most important areas of current computer technology: Semantic Web enabling eWork and eCommerce. Naturally, this includes education and research efforts to ensure the durability of impact and support of industry.
The research and development GenNet project is a European eureka-labeled project, involving two industrial societies, namely the French KIKA medicalsociety, and the Belgian Phenosystemssociety. Two members of the Orpailleur group drive a so-called ``thèse Cifre'' on the integration of clinical and genetic data for mining and pharmacogenomics knowledge extraction. This research work is in progress, and more developments are needed before substantial results may be obtained.
The FouDAnGA proposal, for ``Fouille de données pour l'annotation de génomes d'actinomycètes'' has been selected in June 2004 as an aci impbioproject in bioinformatics. This project involves two research teams from loria(namely adageand Orpailleur), and the Laboratory of Genetics and Microbiology of the University UHP Nancy 1. Since a number of years, these three teams have been collaborating within the prst``Intelligence logicielle – Bioinformatique et applications à la génomique'' (see hereafter). Being selected as an aci impbioproject has reinforced and structured the initial project, allowing two students to complete their thesis.
The scientific motivation of this project is to extract subsequences from dnawith informative and significant values in molecular genetics. In particular, the signals implied in the gene regulation are under investigation. The models used correspond to the bacteria of the group of the actinomycetes –in particular to Streptomyces– that is the main producer of antibiotics and of metabolites with therapeutic interest, and with Mycobacteries –for example M. tuberculosis– that is responsible for tuberculosis.
A steady homogeneous second-order hidden state chain describes discrete heterogeneities distributed with a strong bias in the intergenic regions. The a posteriori observation of the hidden states specifies short dnaloci (5 to 12 pb) corresponding mostly to targets for dnabinding proteins, including transcriptional regulators. The analysis of the Streptomyces coelicolor genome allows the detection of the exact location of all 30 SigR promoters, as well as 92 other known or putative relevant regulatory sequences described so far. These dnamotifs represent about 7,8% of the 3000 extracted from a database corresponding to 1,15 Mb of chromosomal dna.
The isibioproject for ``Information Systems Integration in Biology'' is a research project, supported since July 2004 by the Ministry of Research in the framework of the aci impbioinitiative. In this interdisciplinary project, the interest is on the exploration of the role of metadata and ontologies in the integration of information systems in biology. The isibioproject reinforces the existing collaborations between people from different disciplines, and stimulate new interactions at both the national and the international levels, by organizing twice a year an international seminar.
The second ISIBio seminar took place in Paris (Institut Pasteur) on December 12–13,2006 ( http://bioinfo.loria.fr/projects/isibio/isibio-presentation). The ISIbio group also co-organized the second OGSB workshop ``Ontologies, Grille et Intégration Sémantique pour la Biologie'' held in conjunction with the JOBIM conference in Bordeaux, on July 4th, 2006. The third ISIBio seminar has been hold on November 21st, 2006 in Nancy.
This research project ``Knowledge Discovery and Ontology Design in Astronomy'' is carried out in collaboration with the cdsin Strasbourg (``Centre de données astronomiques de Strasbourg''), and the iritcomputer science laboratory in Toulouse. Researchers in astronomy use every day an information network made of journal articles available under an electronic form, and a number of databases, such as the simbaddatabase recording bibliographical entries and measure sets on about three millions of astronomical objects, and the catalog server VizieR recording astronomical catalogs and measure tables published in the astronomical journals. Interested researchers should have access to the content of documents, e.g. journal articles, astronomical object catalogs, or measure tables. For facilitating this access, researchers in astronomy have at their disposal a base of the so-called ucdfor ``Unified Content Descriptors'', i.e. a hierarchical database that has been extracted and designed at the cdsfrom the content of astronomical catalogs and tables.
The research work currently carried out in collaboration with the cdsconcerns the study and the design of an ontology for representing the ucdand astronomical objects as well, starting from a collection of articles –thus involving text mining– and for extending the ucdbase. This ontology will be used for a number of important and different tasks for researchers in astronomy, such as intelligent information retrieval based on the content of documents, information manipulation for matching and comparing the content of the astronomical documents. This research work can be seen as a contribution to the research works on the Semantic Web, where the purpose is to attach semantics to astronomical documents, for defining an annotation method of astronomical documents, and for a knowledge-based information retrieval method in heterogeneous astronomical sources.
A methodology for building an owlontology of the ucds is currently under study . The specific task in which this ontology has been used is for retrieving the ucds representing at the best the description of an astronomical object given by a set of properties. An approximate 2-step classification process is performed by exploiting the metadata linking lexical items used in the descriptions of astronomical objects, and concept properties defining the ucds in the ontology. The recognition of composed ucds depending on several concepts has to be studied further. The classification of simple and composed ucds presents similarities to the works on disjunctive classification, where concepts are defined by union of properties: in this case, owning a subset of properties for an object is sufficient to be classified as an instance of the concept.
A research work on Adaptation Knowledge Acquisition ( aka) for the Kasimirsystem (see section ) is carried out in the framework of the cnrsinterdisciplinary project tcanfor ``Traitement des connaissances, apprentissage et NTIC''. The objective of akais to provide knowledge in the form of adaptation meta-rules:
Automated akais based on the mining of the protocols. A protocol can be seen as a set of rules situation decision. Knowing how the decisions change when the situations change from one rule to another rule may provide a specific adaptation rule. Clustering and generalizing these specific adaptation rules produce general adaptation rules, that have to be validated by experts.
Supervised akais based on the analysis of adaptations performed by experts. Interviews of experts confronted to decision problems requiring adaptation have been recorded to be afterwards analyzed and modeled within adaptation rules.
Orpailleur is involved in this tcanproject, together with the ``laboratoire d'ergonomie du cnamin Paris'', and the Centre Alexis Vautrin in Nancy. Beyond the application framework, this research work will involve progress in the akamethodology and techniques, that is an original research area in cbr(at its beginning, despite its importance for knowledge-intensive approaches in cbr).
Géomatique ( cnrs–stic): ``Modélisation, comparaison et interprétation d'organisations territoriales agricoles'' (in charge of Florence Le Ber).
Impact des ogm (menrt): ``Modélisation de la dispersion de transgènes à l'échelle de paysages agricole'' (in charge of Florence Le Ber).
Eau, environnement, sociétés Ressources – Usages – Risques Gestion ( cnrs–shs): RIBAVAL project "Conception d'un outil pour la simulation du fonctionnement d'un bassin versant et définition des conditions d'utilisation pour la co-gestion" (in charge of Florence Le Ber).
Programme fédérateur ``Agriculture et Développement Durable'': Conception d'Observatoires de Pratiques Territorialisées de la Durabilité de l'Agriculture ( coptda) (in charge of Jean-François Mari).
Collaborations: engeesStrasbourg, inrain Nancy-Mirecourt, Paris-Grignon, Dijon, and Toulouse, Laboratoire ese upresa8079 cnrs/Paris-Sud, Équipe Codisant, lpi grc, Université de Nancy 2, gric umr 5612 cnrsLyon, and engrefClermont-Ferrand.
The acronym PRST-IL stands for ``Programme Régional Scientifique et Technique Intelligence Logicielle'' in which is involved the LORIA Laboratory.
The prst ilproject ild-istcfor ``Ingénierie des langues et du document, information scientifique, technique et culturelle''.
The Orpailleur team is involved within the regional research project ild-istc. In this context, research work is carried out in association with the uriteam at inist cnrson the design of an operational text mining platform (mainly for technological watch with respect to scientific texts).
The prst ilproject bioinfofor ``Bioinformatique et applications à la génomique'' ( http://bioinfo.loria.fr/Bioinfo-Loria/.
The Orpailleur team is involved in three main collaborations with biology laboratories, namely ``Fouille de données pour l'annotation de génomes d'actinomycètes'' (with the Laboratory for Microbial Genetics lgm uhp–inra), ``Vers une exploitation sémantique des sources de données biologiques du Web" (with ea 3446, crb-inserm-u724, ea 4002), and ``Combinaison de méthodes symboliques-numériques de fouilles de données pour l'étude et l'analyse de la cohorte Stanislas'' (with inserm u525(Équipe 4).
The members of the Orpailleur team are involved, as members or as head persons, in a number of national research groups.
The members of the Orpailleur team are involved in the organization of conferences, as members of conference program committees, as members of editorial boards, and finally in the organization of journal special issues.
The members of the Orpailleur team are involved in teaching at all levels of teaching in the universities of Nancy (especially ``Université Henri Poincaré Nancy-1'' and ``Université de Nancy 2''; actually, it must be noticed that most of the members of the Orpailleur team are employed on university positions).
The members of the Orpailleur team are also involved in student supervision, again at all university levels, from under-graduate until post-graduate students.
Finally, the members of the Orpailleur team are involved in hdrand thesis defenses, being thesis referees or thesis committee members.
A delegation from the Orpailleur team was present on the inriabooth at eurobio 2006( http://www.eurobio2006.com/). The project ``Transfer and services for Genomics and Biomolecular Modelling in Lorraine'' (G-BioModeL) was presented and contacts were established with five companies present on the booth: aureus Pharma, gene-it, genomining, genostar, and heliosBioScience. Two demonstrations, snp-Converterand vsm-g, have illustrated the activities in bioinformatics within the Orpailleur team. A flyer is available on the bioinfo web site http://bioinfo.loria.fr/Members/pierronl/eurobio-2006/.
L. Szathmary, S. Maumus, P. Pétronin, Y. Toussaint and A. Napoli: Best paper award, EGC 2006 for the article ``Vers l'extraction de motifs rares''. EGC 2006, 6ièmes Journées d'Extraction et Gestion des Connaissances, 18–21 janvier, Lille, France, pages 499–510, 2006.
Sandy Maumus: PhD Thesis Award of Université Henri Poincaré (Faculty of Pharmacy), ``Approche de la complexité du syndrome métabolique et de ses indicateurs de risque par la mise en oeuvre de méthodes numériques et symboliques de fouille de données. Thèse de l'Université Henri Poincaré Nancy 1, Novembre 2005.