See
http://
Information available online is more and more complex, distributed, heterogeneous, replicated, and changing. Web services, such as SOAP services, should also be viewed as information to be exploited. The goal of Gemo is to study fundamental problems that are raised by modern information and knowledge management systems, and propose novel solutions to solve these problems.
A lot of work has been devoted to the ANR Project WebContent.
Serge Abiteboul has been the recipient of the EADS Award in Computer Science, that is selected by the Academy of Science.
A main theme of the team is the integration of information, seen as a general concept, including the discovery of meaningful information sources or services, the understanding of their content or goal, their integration and the monitoring of their evolution over time.
Gemo works on environments that are both powerful and flexible to simplify the development and deployment of applications providing fast access to meaningful data. In particular, content warehouses and mediators offering a wide access to multiple heterogeneous sources provide a good means of achieving these goals.
Gemo is a project born from the merging of INRIA-Rocquencourt project Verso, with members of the IASI group of LRI. It is located in Orsay-Saclay. A particularity of the group is to address data and knowledge management issues by combining techniques coming from artificial intelligence (such as classification) and databases (such as indexing).
Some prospective work is presented in . The goal is to enable non-experts, such as scientists, to build content sharing communitiesin a true database fashion: declaratively. The proposed infrastructure is called a data ring.
Databases do not have specific application fields. As a matter of fact, most human activities lead today to some form of data management. In particular, all applications involving the processing of large amounts of data require the use of databases.
Technologies recently developed within the group focus on novel applications in the context of the Web, telecom, multimedia, enterprise portals, or information systems open to the Web. For instance, in the setting of the EDOS EC Project, we are developing some software for the P2P management the data and metadata of Mandriva Linux distribution.
Some recent software developed in Gemo:
ActiveXML: a language and system based on XML documents containing Web service calls. ActiveXML is now in Open Source within the ObjectWeb Forge.
SomeWhere: a P2P infrastructure for semantic mediation.
SomeWhere+: a P2P infrastructure tolerant to inconsistency.
KadoP: a peer-to-peer platform for warehousing of Web resources.
OptimAX: an algebraic cost-based optimizer for ActiveXML.
TaxoMap: a prototype to automate semantic mappings between taxonomies
XTAB2SML: an automatic ontology-based tool to enrich tables semantically
WebQueL: a multi-criteria filtering tool for Web documents, developed in the setting of the e.dot project.
ULoad: a tool for creating and storing XML materialized views, and using them to answer XQuery queries.
GUNSAT: a greedy local search algorithm for propositional unsatisfiability testing
LN2R: a logical and numerical tool for references reconciliation
One of the reasons for the success of the relational data model was probably its clean theoretical foundations. Obtaining such a clean foundation for the semistructured data model and XML is still an on-going research task.
With XML documents, data may be extracted, queried, or used in navigation, because of its association with a position in a document, rather than because of its actual content. It is thus believed that those foundations will be based on tree automata and on Monadic-Second-Order (MSO) logic making use of the tree structure of XML data. Towards this direction we studied in some complexity issues related to a sequential family of tree automata, which has the same expressive power than unary Transitive-Closure logic. But of course data values cannot be completely ignored. In we show how to use decidable logics over infinite alphabet in the XML context for deciding XML-schema validation and XPath query inclusion.
By essence XML is used in an Internet based environment. On the Web, one may have to process on the fly, heavy streams of information, to support the surveillance of rapidly changing data sources. Also, by the nature of the web, the information is imprecise, incomplete, inconsistent, of uneven quality. The answer to a Web query may include huge number of results (see Google search) and it is typically as important to rank these results than to obtain them.
We have considered streaming XML data with limited memory resources. In this context, we considered in the DTD validation problem: checking whether a XML document conforms a DTD.
It is often desirable that a user only has a partial access to a database and several users see different parts of the databases. The subpart that is seen by a user is called a views. When a user specify its query, this query has to be rewritten according the real database and then evaluated. In we study the language necessary for rewriting a Conjunctive Queries.
We continued in work on probabilistic semi-structured models. We give complexity results about the probabilistic tree model (based on trees where nodes are annotated with conjunctions of probabilistic event variables) that was previously introduced. We identify a very large class of queries for which queries and elementary updates are tractable. We also consider other theoretical issues, as the equivalence of probabilistic trees or the validation of a probabilistic tree against a DTD.
A new challenge is the study of XML when used in the dynamic environment of the Web. As XML is used as an exchange format for data over the Web, systems using XML, such as Web services, must manipulate highly heterogeneous data formats. In order to reduce the risk of failure it is therefore important to be able to perform offline static analysis of the programs developed in such systems. Gemo has started studying problems related to verificationof systems for XML.
In order to improve Web Information Retrieval using Ontologies, we proposed an extension and an implementation of OWL for Web Queries Enrichment. This work has been done in the setting of the O3 approach designed by Cedric Pruski in Luxembourg during his Master of Science. O3 uses the WordNet linguistic tool in order to optimize, in terms of relevance, the returned documents when searching the Web. Its main idea consists in enriching, following well-defined rules, the query constructed by users by extracting from WordNet the appropriate vocabulary that characterizes best the search domain. O3 has been formalized using first-order logic and graph theory. This formal framework permitted the rigorous definition of query expansion rules. In parallel, the standardization of OWL has hastened the quick and massive development of OWL ontologies across the Web. This is why, to benefit from both O3 and OWL ontologies, we decided to make O3 compatible with OWL. We studied the possibilities offered by OWL that cope with O3 as well as an extension of the language. We implemented the so called extended OWL through query expansion rules and we made an experimental validation using the TARGET tool 2007 .
First, we surveyed techniques for ontology evolution 2007 . After identifying the different kinds of evolution the Web is confronted with, we detailed the various existing languages and techniques devoted to Web data evolution, with particular attention to Semantic Web concepts, and how these languages and techniques can be adapted to evolving data in order to improve the quality of Web Information Systems Applications. Second, we proposed a set of modelling features for ontology evolution 2007 . These features have been defined after the rigorous study of the evolution of a particular domain (the domain defined by the WWW series of conference topics) over a ten years period of time. The results of this study lead directly to the definition of the various kinds of evolution that can appear. They allowed the proposition of modelling features that aims at designing evolving ontologies. Indeed, these features will allow us to understand the evolution of ontologies and will aid to predict future versions of ontologies. We highlighted the contribution of such ontologies through an example implementing ontology-based query expansion techniques to improve the relevance of documents when searching the web.
Where data sources are numerous and heterogeneous, a data integration system needs automatic tools to annotate and query semi-structured documents. We propose an automatic approach for semantic annotation of HTML or XML documents . It relies on the model describing the domain of interest. The difficulty lies in the heterogeneous structure of the documents and in that a document contains both structured and unstructured parts. To overcome this problem, we have defined a first set of annotation rules using SWRL. That rules take into account both the semantic relations defined in the model and the heterogeneity of documents structure. The resulting annotated documents are represented in RDF language according to the semantic RDF Schema model which is extended to the annotation task from the domain description. Since october 2007, this work is done in the setting of the SHIRI project which is supported by the DIGITEO Foundation.
Peer-to-Peer Inference Systems (P2PISs) are made of autonomous peers (i.e., built and managed independently) that can communicate in order to perform an inference task at the P2PIS level (e.g., consequence finding or query answering). For that purpose, communication rules between peers are modeled by mappingsthat define semantic relationships between their knowledge.
A crucial aspect of that new setting is that peers are equivalent in functionalities and no actor has a global view of a P2PIS, i.e., there is no centralized control or hierarchical organization in the system. Each peer only knows the knowledge it manages and its mappings with some other peers. This raises exciting non trivial algorithmic issues since, in the literature, reasoning algorithms have been designed with the assumption that the knowledge on which inferences have to be performed is given as an input. New decentralized algorithms have to be designed with the idea that only a subset of the global knowledge is available to a peer as an input (i.e., the peer's knowledge and mappings), but the algorithms have still to be sound and complete for the inference task w.r.t. the global knowledge of the P2PIS (i.e., the knowledge and mappings of all the peers). The SomeWhere platform has been developed for experimenting with such distributed reasoning tasks. It is a building block of the MEDIAD project with France Telecom R&D, as well as one of the components being integrated in the platform to be produced by the WebContent project.
Many challenging Artificial Intelligence (AI) tasks like common sense reasoning, diagnosis, or knowledge compilation can be stated in terms of consequence finding. That key inference basically consists in deriving theorems of interest that are intentionally characterized within a logical theory. Such theorems can be those in terms of a fixed language, those resulting from some incoming knowledge in the theory, etc.
Recently, we have designed the first peer-to-peer inference systems for consequence finding, in which each peer manages a clausal theory of propositional logic in terms of its own set of propositional variables. A peer establishes (or suppresses) a mapping by adding to (or removing from) its theory a clause made of some of its variables and some variables from other peers, those peers being notified of the operation. For those systems, we have proposed the Decentralized Consequence finding Algorithm ( DeCA) that performs a decentralized resolution procedure in order to compute clausal implicates (i.e., consequences) of a clause submitted to a peer w.r.t. a P2PIS, including allthe proper prime ones(i.e., the strongest consequences).
A key point in the design of the above P2PISs is that mappings are undirected, i.e., any peer involved in a mapping can use it to propagate knowledge to the other peers participating in the mapping. Therefore, such systems model autonomous components that communicate through interfaces that are both input and output.
We have recently proposed an alternative design of P2PISs in which mappings are
directed. A mapping is stated between two peers, but only one of them can use the mapping to propagate knowledge to the other. From a practical viewpoint, a mapping from a peer to
another specifies some knowledge that the former peer has to observe and the knowledge it must notify to the latter peer if the observed knowledge holds. Such new P2PISs are of great interest
in order to apply AI reasoning because they can model many real applications in which autonomous components communicate through interfaces that are either input or ouput, like distributed
functions in Automotive Engineering, distributed control systems for industrial machinery or processes in Automation, etc. For those systems, we have proposed a new
Decentralized
Consequence finding
Algorithm for directed mappings (
DeCA
K) that computes clausal implicates of a clause submitted to a peer w.r.t. a P2PIS, including
allthe proper prime ones.
In P2PIS retaining the classical semantics for mappings (i.e. undirected view), the ability of each peer to freely add new mappings with other peers may have affect the consistency of the global resulting theory. This cannot be avoided because of the decentralised nature of the architecture. In order to prevent the trivialisation of the reasoning in such cases, we have designed a method able to detect incrementally all possible minimal causes of inconsistency and to store them in a distributed way in the P2PIS. Furthermore, we have proposed a new distributed consequence finding algorithm ( WFDeCA) able to perform well founded reasoning despite the presence of possible inconsistencies. These algorithms have been implemented and an experimental evaluation is underway. One noticeable feature of WFDeCAis that different consequences, though all well founded, may have different supports that are not necessarily consistent with each other. In such cases, it is up to user's responsibility to choose between consequences having incompatible supports. One possible choice criteria is to prefer the most trustableconsequences. We are currently investigating on different trust models that have been proposed for P2P file sharing system and consider their possible adaptation the task of distributed consequence finding.
The logical theory of consistency-based diagnosis has been worked out in the eighties in the centralized case. It starts from a model, assumed to be given, of the behavior of the (component based-) system in consideration (correct behavior and possibly some faulty behaviors if known by advance) and aims at maintaining consistency between the current hypotheses of behavioral modes of the components (correct or faulty) and the observations (e.g. sensors measurements). It is stated in a logical framework, where the model SD - for System Description - and the observations OBS are expressed in first order logic, the mode of each element in COMPS being explicitly represented thanks to the predefined Ab (for Abnormal) predicate (so, ¬Ab(c) means that component c is correct and Ab(c) that component c is faulty). Diagnostic reasoning is a typical example of non monotonic reasoning: initially all components are assumed to be correct, up to the moment this becomes inconsistent with observations. Then consistency between the model and the observations is restored by changing some component mode assignment from correct to faulty (in general, a principle of parsimony is applied and we are interested only in minimal - for cardinality or for set inclusion - sets of faulty components). Technically, from a logical inference point of view, computation of the diagnoses (complete components modes assignments consistent with observations) relies on calculus of (prime) implicates and implicants of SD OBS in terms of the target language built from the Ab(c), for c in COMPS. This diagnostic activity can be done off line - from a given set of observations - or on line in a general monitoring framework where new observations occur along time, and where the real (unknown) mode of each component can itself vary along time (from correct to faulty but also from faulty to correct in case of transient faults). Assuming centralized system, centralized model, centralized diagnostic algorithm is a severe restriction for several real case applications: the system can be "naturally" distributed (telecommunication networks, Web services, etc.), the system can be too huge or complex to have a unique storable and accessible global model, privacy issues can prevent the existence of such a global model, the diagnostic algorithm can take advantage to perform decentralized local diagnosis and its implementation to be decentralized on several control units. This is why decentralized diagnosis receives a growing interest from some years. The work that has been initiated is an attempt to design, implement and test distributed consistency-based diagnosis algorithms in a logical framework, relying on previous work conducted inside Gemo on P2PISs, in particular what concerns consequence finding and handling of inconsistencies. In this P2P framework, each peer represents a subsystem and its local theory is the propositionalized subsystem description, the mappings (shared variables) expressing connections between subsystems. Observation peers (sensors) have a local theory limited to a propositional symbol expressing the measurement's value. The algorithm currently developed for generating minimal diagnoses relies on a distributed computation of (restrictions to the target language of) implicants of the global (unknown as a whole) theory . Several problems will have to be addressed in the future: incrementality w.r.t. increasing asynchronous observations; characterization of all diagnoses in presence of fault models; on line monitoring and diagnosis with observations varying asynchronously along time; repair by reconfiguring the system (changing mappings); open world (addition or suppression of peers), etc.
In the setting of the MediaD project we address the problem of discovering mappings between distributed ontologies in the setting of SomeRDFS, a peer data management system (PDMS)derived from SomeWhere. Since the setting of PDMS is particular, we proposed techniques to take advantage of SomeRDFS reasoning in order to help discovering mappings between the knowledge of each peer, i.e. ontologies, that can be mapping shortcuts or new mappings 2007 . The aim of the proposed techniques is to discover elements that are relevant to be mapped. These elements will then be aligned applying usual alignment techniques. The implementation of this work is in progress.
We are working on the reference reconciliation problem. It consists in deciding whether different identifiers refer to the same data, i.e. correspond to the same world entity (the same hotel, the same person, ...). We have developped a logical and numerical approach named LN2R (L2R + N2R) which is automatic and guided by the semantic of an RDFS+ schema. L2R is logic-based , . In the N2R method, the semantics of the schema is exploited by an informed similarity measure which is used by a numerical computation of the similarity of reference pairs. This numerical computation is expressed in an equation system that is non linear. We have shown on one benchmark dataset that we can obtain better results than supervised approach. We have also studied the scalability of such approaches , . This work is done in the setting of the Picsel3 project.
This work has been initiated in the setting of the e.dot project. We worked on the mappings between different taxonomies in order to access to several sources from a unique querying system. We explored some alignment techniques to generate semantic mappings automatically. The originality of the approach is to be a combination of terminological, structural and semantic techniques well-suited to the mapping of taxonomies which are schemas with very poor definitions of concepts, so mainly defined with reference to the terminology. A prototype, TaxoMap, finds mappings or suggests indicators to help users find mappings 2007 .We continue our work on TaxoMap in the setting of the WebContent project. First, we investigated techniques which rely on an additional source, called background knowledge. We made a comparative analysis of works using background knowledge 2007 . We studied the difficulties encountered when using Wordnet 2007 and we showed how the Taxomap system can avoid these difficulties 2007 . Further work has been done on adapting TaxoMap for the Ontology Alignment Evaluation Initiative (OAEI 2007) campaign. So we participated in the OAEI 2007 campaign 2007 which consists of applying matching systems to ontology pairs and evaluating their results. Moreover, TaxoMap has been tested and evaluated together with OLA jointly developed by the teams at Diro, university of Montreal and at INRIA Rhône-Alpes (EXMO group) in the setting of the WebContent project. The following corpora have been chosen: a corpus delivered by EADS in the aeronautics field, OAEI Benchmark test and AGROVOC-NAL, two very rich thesauri used in the "food" corpus in the OAEI 2007 campaign. These experiments have shown the complementary nature of the two tools and have emphasized two main difficulties: the alignment of very large ontologies and the evaluation of the results when no reference mappings are provided.
The problem of XML query evaluation still poses significant challenges. In particular, the complexity of the XQuery language, standardized by the W3C, makes it very difficult to devise efficient storage and optimization strategies. We have proposed a new language for describing materialized XML views, which can be used to speed up the processing of XML queries. We have devised associated algorithms for rewriting XQuery queries based on this rich view language .
While materialized views can speed up query processing, their practical applicability requires several developments. First, they have to be maintained in the event of updates applied to the underlying documents. The internship of Abhipreet Das (IIT Bombay) has focused on proposing algorithms for incrementally propagating updates to the materialized views. Second, view selection may be cumbersome to the user, therefore automated view selection mechanisms are needed. The internship of Nikhil Pandey (IIT Bombay) has lead to some work in this area, however the problem was not fully solved.
The ActiveXML language (AXML in short) allows describing complex distributed data manipulation tasks. Each such task could be executed in many ways producing the same results but with very different performance. We have made important progress in laying out an algebraic formalism for optimizing AXML document evaluation, more precisely on specifying a small set of special Web services dedicated to distributed evaluation and on their usage within the optimizer. The first prototype of an AXML optimizer, OptimAX, has been developed and demonstrated . The optimizer is integrated with a new version of an AXML peer, developed mostly this year by E. Taroza.
Performance evaluation is a natural component in many data-oriented works such as those carried on in Gemo. However, the complexity of the languages we target, such as XQuery, and the complexity of settings in which our techniques are deployed, such as peer-to-peer systems, make the task of performance evaluation very complex. For instance, in a peer-to-peer XML data management setting, one has to distinguish the impact of the underlying peer network from that of data indexing, from that of query evaluation algorithms, and finally from the optimizer quality. Benchmarks are essential tools for performance evaluation. We have proposed a benchmark for XML data management in P2P, named P2PTester , designed to ease and systematize the task of performance evaluation. Performance evaluation in the large raises lively discussion; a panel organized in the VLDB conference on this topic received significant attention . Participants agreed on the need for a more thorough procedure both for performing performance evaluation and for ensuring such evaluations are repeatable.
We started some collaborative work with UCSC and U.Tel Aviv on a framework for Continuous On-Line Tuning , a novel self-tuning framework that continuously monitors the incoming queries and adjusts the system configuration in order to maximize query performance. The key idea behind Colt is to gather performance statistics at different levels of detail and to carefully allocate profiling resources to the most promising candidate configurations. Moreover, Colt uses effective heuristics to self-regulate its own performance, lowering its overhead when the system is well tuned and being more aggressive when the workload shifts and it becomes necessary to re-tune the system. We considered the design of the generic Colt system, and its specialization to the important problem of selecting an effective set of indices for a relational query load. We developped an implementation of the proposed framework in the PostgreSQL database system and evaluated its performance experimentally. Our results validate the effectiveness of Colt in self-tuning a relational database, demonstrating its ability to modify the system configuration in response to changes in the query load. Moreover, Colt achieves performance improvements that are comparable to more expensive off-line techniques, thus verifying the potential of the on-line approach in the design of self-tuning systems.
We have worked on the optimization of KadoP, a peer-to-peer platform for building and managing warehouses of Web resources. KadoPrelies on a Distributed Hash Table implementation (namely, FreePastry) to keep the network of peers connected, and to build a shared global resource index, and on the ActiveXML platform to store, query, and maintain the index. Furthermore, KadoPis able to process simple queries carrying over resources distributed in the whole network. A main goal is to be able to index not only extensional XML data but also intensional one and in particular Web services.
A recent development of the system includes two techniques meant to handle efficiently long posting lists exchanged during query processing. The first technique relies on a distributed search structure that parallelizes the transfer of long posting lists, while the second enables to reduce the transferred lists at the expense of some precision. These techniques are described in .
We have also participated to the development of a prototype for measuring the performance of P2P queries .
In the context of XML data warehousing, it often happens that different XML representations of a same object appear in the sources. In this context, it becomes necessary to identify common entities in the XML sources and propose a consolidated version thereof. We have proposed the XClean framework for declaratively specifying data cleaning processes, which are then compiled into XQuery queries . M. Weis has developed a prototype implementing this framework, which has been demonstrated .
This work, that began at the end of 2005, is carried out in the framework of the European project WS-DIAMOND, up to mid 2008. It is well-known that self-healing software is one of the challenges for IST research. This project aims to take a step in this direction by developing a framework for self-healing Web Services. The goal is to produce:
an operational framework for self-healing service execution of conversationally complex Web services, where monitoring, detection and diagnosis of anomalous situations, due to functional (in particular semantic) or non-functional errors (e.g., Quality of Service), is carried on and repair/reconfiguration is performed, thus guaranteeing reliability and availability of Web services;
a methodology and tools for service design that guarantee effective and efficient diagnosability/repairability during execution;
demonstration of these results on real applications.
Our main involvement in this project is about model-based diagnosis of cooperative Web services, i.e. apply to P2P distributed software systems the techniques developed in Artificial Intelligence and successfully applied to engineered centralized hardware systems. Our two other contributions concern formal models for Web services, as the method rests entirely on the existence of adequate behavioral models to which actual observations are compared, and study of diagnosability at the design stage, which is the common trend to diagnosis activities in all branches of industry.
During the two first years, the following work has been achieved:
Developing an observation and data log platform for basic Web services.
An extension of the Web service deployment specification (WSDD file) is defined, allowing the developer to specify for each operation what are the informations to log and the privacy police of their accessibility. The standard AXIS deployment platform is enriched by an observation handler generator and an information Web service generator. Each time a basic WS is invoked, its associated information WS is invoked too and records in databases (via an interface with MySQL) all its inputs, outputs and error messages specified in the WSDD extension, with the given privacy policy. This can be applied to the information WS itself, which is thus self-observed. All these extensions and log capabilities were implemented in Java. The logged information will be used by the diagnosis algorithm to identify the primary cause(s) of a detected symptom.
Modeling BPEL Web services for diagnosis.
A method to generate automatically a diagnosis model, in the form of data dependency relations (analogous to dynamic slicing methods in software debugging), for orchestrated complex Web services has been developed. BPEL (Business Process Execution Language) basic and structured activities are first modeled with Petri nets, places being used to represent data and transitions to represent activities. For that, control places, in charge of transmitting activation, are added to data places (in particular an input and output activation places) and reading arcs (along which tokens are not propagated) are added to normal arcs. Operational dependency between the transitions executions is thus captured. In order to capture data dependency (which is essential for diagnosis of semantic faults), each transition of the Petri net is enriched with a set of basic data dependency relations expressing that an input is just forwarded to output, or that an output is created by the operation, or that an output is elaborated from one or several inputs. In order to aggregate such enriched Petri nets, composition rules of these basic relations are defined, for different modes (sequential, alternative, hierarchical through data structures). Based on these rules, an algorithm is designed that builds the data dependency model of an orchestrated BPEL service from the analysis of its BPEL code and the models exposed by the private services it invokes. This data dependency model is expressed as a set of propositional Horn clauses that will be used by the diagnosis algorithm. The enriched Petri net generator, which takes as input a BPEL code and produces as output its enriched Petri net model in the form of an xml file, and the diagnostic model compiler, which takes as input an enriched Petri net model and produces as output its associated diagnostic knowledge base as a set of causal rules expressed as logical Horn clauses, both in the form of xml files, have been implemented in Java and tested on examples (in particular the Foodshop service used in the WDS-DIAMOND project).
Developing a decentralized diagnostic algorithm
A decentralized on line diagnostic algorithm for BPEL orchestrated Web services has been designed, that relies on the local diagnostic models of each Web service built off line as explained above and on the observations stored on line by the data log platform. A local diagnoser is provided to each BPEL service, that performs local consistency-based diagnosis thanks to the local diagnostic model of the service (initially a diagnostic session is triggered when a local diagnoser is awakened by an exception raised in its associated Web service). The local output diagnosis is made up of possible local faults as input data from users or faulty internal basic Web services (among those invoked by the BPEL service), or of input variables coming from shared variables in another composite Web service. These local diagnosers communicate (in both ways) with a coordinator, in charge of building global diagnoses by merging local ones. This coordinator does not initially have any information about the individual Web services except the shared variables between them, which are obtained off line and are at interface level, satisfying thus privacy issues. The coordinator tries to prolong each local diagnosis containing a suspected input variable coming from another service by invoking the local diagnoser of this service. At the end, global diagnoses thus generated are made up of input faulty data from users, faulty internal basic Web services or faulty interfaces between two Web services (these last ones being able often to be checked for confirmation through logged observation). In fact, the local diagnosers and the coordinator are regarded also as Web services communicating via WSDL messages, thus WSDL standard can be used to describe the diagnosis operation offered by a diagnostic Web service. Up to now, the local diagnosers and the coordinator have been implemented as Java objects, thus basic Web services, and interfaced with the data log platform, and are currently tested on applications, such as the Foodshop service.
In 2007, this work has been published in , , . Direct continuation of this work will include: implementing the diagnostic coordinator as a BPEL Web service; extending the diagnostic architecture to the case of choreographed Web services and testing the whole on real examples. Notice that the thesis work just set about by Vincent Armant about distributed diagnosis in a peer-to-peer framework is expected to be later tested with the local diagnostic knowledge bases of Web services produced here, in order to provide a completely distributed monitoring and diagnostic platform for Web services. Another connected work that just begins is the study of diagnosability (and recoverability) properties at design stage. The aim is to define formal properties of a discrete-event model, together with a predefined set of faulty non observable events and a predefined set of observable events, expressing that a given fault will always be detectable or that two given faults will always be discriminable, and then to design algorithms to check off line these properties on the model. These criteria and checking methods will be adapted for study of Web services diagnosability and recoverability and a methodology for designing Web services applications that respect these criteria will be developed.
We have worked on the conception and implementation of tools for monitoring Peer to Peer Systems.
A system named P2PMonitor has been developed for this purpose. It is a P2P system itself, with peers exchanging messages by Web service calls. This system is based on alerters, that are software modules placed on monitored peers, in charge of the surveillance of particular types of events (e.g. web service calls, database updates etc.).They produce streams of (Active)XML data. Our system implements an algebra over data streams. A declarative language allows the user to specify the complex events of interest and the ways the notifications about these events should be created and sent to her. The system is in charge of choosing the best execution plan and of placing the processors on peers. This work has been published in and .
A subject related to monitoring is view maintenance over active documents. Indeed, the monitoring problem can be seen as aggregating streams into an active document and incrementally evaluating a tree-pattern query over this active document. We have developed algorithmic datalog-based foundations for such an incremental query processing and this work has been published in .
A paper presenting a demonstration scenario for the monitoring system integrating the view maintenance for active documents as a way of defining complex monitoring tasks, has been published in .
We introduce in a new method for finding nodes semantically related to a given node in a hyperlinked graph, namely the Green method, based on classical Markov chains. It is generic, adjustment-free and easy to implement. We test it in the case of the hyperlink structure of the English version of an on-line encyclopedia, namely Wikipedia. We present an extensive comparative study of the performance of our method compared to several other classical methods. The Green method is found to have both the best average results and the best robustness.
In , we review a number of classical text mining approaches to synonym extraction over different kinds of corpora. We also introduce a graph mining technique that discovers related words in a monolingual dictionaries, closely inspired by Kleinberg's hubs and authorities, and discuss the more profound relations between classical text mining problems and graph mining.
Gemo has had technical meetings in 2006 with many industrial partners, in particular France Telecom R&D, Xyleme and Mandrakesoft, as well as national organizations, in particular, Institut National de Recherche en Agronomie.
The MediaD project aims at designing a declarative environment, SomeWhere, for building peer-to-peer data management systems based on a simple data model: propositional logic. A peer-to-peer data management system is a valuable alternative to a centralized information integration system like a mediator when the number of sources that have to be integrated becomes huge: building a global mediated schema coping with all the sources peculiarities is hardly possible and inefficient.
The goal of MediaD project is to deploy very large applications that scales to thousands of peers. It is organized in two tracks. The first one is to study query answering possibly in the presence of inconsistency. The second one is to develop techniques for cooperative statement of mappings that relate the knowledge of the different peers within the peer-to-peer data management system.
This project is the continuation of PICSEL2 on scaling up to the Web the mediator approach that has been implemented in PICSEL1.
The goal is twofold. It aims at automating the construction of wrappers which translate user queries into the query language accepted by each source and return answers from the sources in the language of the mediator. This work is concerned with mediation of ontologies. Furthermore, we are interested in reference reconciliation, i.e. identifying when different references in a data set correspond to the same real-world entity.
EDOS is a research project funded by the European Commission as a STREP project under the IST activities of the 6th Framework Programme. The project involves universities (Paris 7, Tel Aviv, Geneva), INRIA (Gemo and Cristal teams), research centers (CSP Torino) and private companies (Mandriva, Caixa Magica, Nexedi, Nuxeo, Edge-IT). It is centered around the software management and more particularly, of Mandriva Linux distribution.
In the EDOS Project, the Gemo group focuses on improving the process of data distribution of open source software, a challenging issue because of the scale of the distribution (large number of files and size), its dynamicity, the need for replication for better performance and the autonomy of actors.
The goal is to build a P2P distribution system that improves the classical approach based on hierarchies of mirrors, by providing a better sharing of resources. The system combines the functionalities of content (software) distribution with the idea of exchanging XML data in a P2P environment, in our case metadata about the software modules to be distributed. Metadata includes identifiers (name, version), static (size, license, summary, etc) and dynamic properties of software modules (composition, replica locations, statistics about the distribution process, etc).
We defined the P2P system architecture, based on three categories of actors: Publishers (that introduce new content in the system), Mirrors (trusted peers) and Clients (end users). Peers are organized in two sub-networks: the indexing network, composed of trusted peers (Publishers and Mirrors), storing the distributed index on metadata, and the distribution network, composed of all the peers, storing content replicas. The system's software architecture is based on a Java API implementing content distribution functionalities at several abstraction levels: publishing of new content, metadata indexing and querying, subscription to thematic distribution channels and event notification, download in flash-crowd (one source, many requests at the same time) and off-peak situations (many sources, content updates).
The project was successfully ended in September 2007. The effort in the last period has been directed to the consolidation of the system, to several optimizations, to the integration of
security mechanisms, to the development of an advanced GUI and to an evaluation on the Grid'5000 platform. The EDOS content distribution system has been published as an open source project on
the INRIA Gforge site (
http://
The WebContent project (
http://
WS-DIAMOND (“Web Services - DIAgnosability, MONitoring and Diagnosis”) is a FP6 European project (FET Open Strep) which started on Sept. 1st 2005 and will last until Feb. 29th, 2008. EU funding for University Paris-Sud is 188 kEuros. The project is coordinated by the University of Turin, and involves the Polytechnic University of Milan, the Vrije University of Amsterdam, the University of Vienna, the University of Klagenfurt, and from France the LAAS-CNRS, the University of Rennes 1, and the University of Paris-Sud. Participants from Gemo are Philippe Dague (site leader for U. Paris-Sud), Tarek Melliti (post-doc from Oct. 1st 2005 to Aug. 31th 2006, assistant professor at U. of Evry from Sep. 1st 2006), Yingmin Li (master internship from April 1st 2006 to Sept. 30th 2006, Ph.D. student from Oct. 2006), Lina Ye (master internship from March 19th 2007 to Sept. 18th 2007, Ph.D. student from end of Sept. 2007), Laura Brandan Briones (post-doc from May 2007) and Omar Aaouatif (engineer internship from March 5th 2007 to June 4th 2007).
In France, close links exist with groups at Orsay (databases, V. Benzaken and N. Bidoit; bio-informatics, C. Froidevaux; machine learning, M. Sebag), with the Cedric Group at CNAM-Paris; some INRIA groups (Atlas, P. Valduriez, DistribCom, A. Benveniste, at INRIA-Bretagne, Exmo, J. Euzenat, at INRIA Rhone-Alpes, Mostrare at INRIA Futurs Lille); the BIA group at INRA (P. Buche, C. Dervin), the GRIMM of the University of Toulouse Le Mirail (O. Haemmerlé), the LIRIS of the University of Lyon 1 (M. Hacid), the LIRMM of the University of Montpellier (M. Chein, M-L. Mugnier), the LI of the University of Tours (G. Venturini), and the UMPA at École normale supérieure de Lyon (Y. Ollivier).
DocFlow is a research project supported by the ANR Masses de données (2007-2009) with the Distribcom team at INRIA-Rennes (Albert Benveniste) and the Méthodes Formelles group at Labri-Bordeaux (Anca Muscholl). The topic is the analysis, monitoring, and optimization of Web documents and services. It builds on Active XML, a formalism for data exchange across peers developed by Gemo. The project aims at achieving a convergence of data and workflow management over the Web through the concept of active peer-to-peer documents.
TraLaLa stands for XML Transformation Languages: logic and applications. It is funded by the
ACI (Action Concertée Incitative) Masses de Données, has started in September 2004 and ended during the summer 2007. The setting is the integration and manipulation of massive data in
XML format. We are interested more specifically in the programming and querying languages aspects: expressivity, typing, optimization. We are also interested in studying how this can be done
in a context where documents are compressed or in a streaming scenario. The home page of the project could be found at:
http://
This ACI started in 2005 and is projected to last three years. This ACI is a collaboration between Benjamin Nguyen (University of Versailles), and François-Xavier Dudouet (CNRS, Laboratoire IRISES). The project has completed this year, but the work carried on has been merged (and continues through) the WebStand project (see below).
The objective of this ANR, that started in 2006, is to analyze the problems surrounding the use of semi-structured databases in social sciences. This ANR regroups both computer science and sociology laboratories. Work done in Gemo which contributes to WebStand includes XML data cleaning , and work on automatic selection and maintenance of materialized XML views. The joint work of the consortium has lead to a publication in a social sciences conference .
SHIRI is a research project funded by the Ile de France region as a Digiteo project which started on Oct. 1st 2007 and will last until Sept. 30th, 2011. It involves two partners of Digiteo, Supelec and the University of Paris-Sud. The aim of SHIRI is to design an annotation system to improve the relevance of the search on the Web when resources contain both semi-structured and textual data.
In Europe, close links exist with University of Dortmund (T. Schwentick), University of Athens (M. Vazirgiannis), University of Madrid (A. Gomez-Perez), University of Manchester (I. Horrocks), University of Rome (M. Lenzerini).
Particular projects that we conduct are detailed next.
NGWeMiS (Next Generation Web Mining and Searching) is a project lead by M. Vazirgianis (U. Athens). The project lies in the area of knowledge extraction and management from the massive and heterogeneous document collections on the World Wide Web. The main objective of the proposed project is the design guidelines and prototypes development for next generation web mining and searching techniques based on the P2P paradigm. The innovation lies in the usage of P2P paradigm in the various levels of web content management and searching, and the study and development of novel similarity measures among web documents that take into account multple facets including structure and semantics iii. clustering the web data and meta data taking into account their P2P organization paradigm.
Procope
Gemo has a PHC-Procope project with the database group of Thomas Schwentick at Dortmund University, Germany. The project will end in 2008. Its goal is to work on verification and queries in the presence of data values. It produced already several join papers between the two groups.
Polonium
Gemo has a PHC-Polonium project with the group of Lasota Slavomir at Warsaw University, Poland. The project will stop at the end of 2007. Its goal is to work on verification and queries in the presence of data values. It produced already several join papers between the two groups.
Van-Gogh
Gemo has a PHC-Van-Gogh project with the group of Maarten Marx at Amsterdam University, The Netherlands. The project will stop at the end of 2007. Its goal is to work on expressive power and performances of XML query languages.
TARGET
Gemo started a cooperation with the Luxembourg University in November 2005 which lead to a PhD in co-tutelle with Paris-Sud university. The PhD project is TARGET for opTimal Adaptive infoRmation manaGemEnT over the web. It aims at improving web information retrieval by integrating web data evolution, users knowledge evolution and search domain evolution. The PhD student is Cedric Pruski.
University of Oxford
Gemo started a collaboration with Georg Gottlob from University of Oxford on the topic of the definition of the Matchoperator in data exchange. This collaboration led to a three-month stay of Pierre Senellart at University of Oxford.
Gemo started a cooperation with the Gaston Berger University last year: a PhD in co-tutelle with Paris-Sud university started in december 2006. The subject of the thesis is the integration of semi-structured data for information retrieval. The PhD student is Mouhamadou Thiam.
Close links exist with University of Tel-Aviv (T. Milo).
Close links also exist with UC Santa Cruz (N. Polyzotis), U. of Rutgers (A. Borgida), Google Research (O. Benjelloun),
Since 2003, Gemo and the data management group at the University of California at San Diego (V. Vianu, A. Deutch, Y. Papakonstantinou) form an associated team funded by INRIA
International. This association is expected to last till end 2008. Victor Vianu and Ravi Vijay, a Ph.d student from UCSD spent 3 months in Gemo this summer. Bogdan Cautis spent 1 week in San
Diego. The home page of GemSaD can be found at
http://
This year the following professors visited Verso:
Tova Milo, professor at the University of Tel-Aviv (in February)
Neoklis Polyzotis, professor at the University of Southern California (in September)
Victor Vianu, professor, UC San Diego (July to September)
Yuhong Yan, research officer, NRC Canada, IIT, Fredericton (in December)
The following PhD students came for internships in the group: Ravi Vijay [UCSD, USA; 2 months, PhD internship].
The following PhD thesis were defended in 2007:
Andrei Arion, XML Access Modules: Towards Physical Data Independence in XML Databases.
Bogdan Cautis, Signing and Reasoning about Tree Updates.
Antonella Poggi (with Universita' degli Studi di Roma "La Sapienza"), Structured and Semi-structured Data Integration.
Fatiha Saïs, Semantic Data Integration guided by an Ontology.
Mathias Samuelides, Tree walking automata with pebbles.
Pierre Senellart, Understanding the Hidden Web.
Gemo project members have co-chaired scientific events:
S. Abiteboul has co-chaired the International Workshop on Data and Service Integration (SDIS'07), in cooperation with VLDB.
I. Manolescu has co-chaired the 10th International Workshop on the Web and Databases (WebDB 2007), in cooperation with ACM SIGMOD.
Members of the project have participated in program committees:
S. Abiteboul
World Wide Web Conference (WWW07)
International Conference on Very Large Databases (VLDB'07)
International Workshop on Web Information and Data Management (WIDM'07)
World Wide Web Conference (WWW08)
ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2008)
Journées Francophones de Bases de Données Avancées 2007
P. Chatalic
Journées Francophones de Programmation par Contraintes (JFPC 2007)
Ph. Dague
20th International Joint Conference on Artificial Intelligence (IJCAI) 2007
18th International Workshop on Principles of Diagnosis (DX) 2007
21th International Workshop on Qualitative Reasoning (QR) 2007
I. Manolescu
33rd Very Large Databases Conference (VLDB 2007)
Conference on Information and Knowledge Management (CIKM 2007)
Web Information and Data Management workshop (WIDM 2007), in cooperation with the CIKM conference
Experimental Evaluation in Databases (ExpDB 2007) workshop, in cooperation with the ACM SIGMOD conference
Journées Francophones de Bases de Données Avancées 2007
F. Goasdoué
16èmes congrès francophone Reconnaissance des Formes et Intelligence Artificielle (RFIA08)
C. Reynaud
Third workshop on Context and Ontology Representation and Reasoning (C&O:RR-2007)
16èmes congrès francophone Reconnaissance des Formes et Intelligence Artificielle, member of the editorial board (RFIA08)
Conférence Extraction et Gestion des Connaissances (EGC07)
17èmes Journées Francophones d'Ingenierie des connaissances (IC07)
1ères Journées Francophones sur les ontologies (JFO2007)
Atelier Modélisation des connaissances (EGC07)
Atelier Ontologies et Gestion de l'Hétérogénéité Sémantique (OGHS'07)
Atelier Ontologies et Textes (TIA07)
M-C. Rousset
International Joint Conference on Artificial Intelligence 2007
International Semantic Web Conference 2007
Atelier Modélisation des connaissances (EGC07)
Atelier Modélisation des connaissances (EGC08)
European Semantic Web Conference 2008
Fatiha Sais
Manifestation des Jeunes Chercheurs en Sciences et Technologies de l'Information et de la Communication (MajecSTIC 2007)
L. Segoufin
ACM Symposium on Principles of Database Systems (PODS'07)
EACSL Conference for Computer Science Logic (CSL'07)
P. Senellart
Text Mining Workshop 2007
L. Simon
International Conference on Theory and Applications of Satisfiability Testing (SAT 2007)
Journées Francophones de Programmation par Contraintes (JFPC 2007)
M. Vazirgiannis
International Conference on User Modeling (UM 2007)
D. Vodislav
Journées Francophones de Bases de Données Avancées 2007
Serge Abiteboul has been invited speaker at the Symposium on Theoretical Aspects of Computer Science, STACS 2007 . He has been invited speaker at the PhD Student Workshop of SIGMOD 2007 where he spoke on “Life in Academia”. He has been also invited at the Dagstuhl-Seminar on Programming Paradigms for the Web (2007).
Marie-Christine Rousset presented a tutorial at BDA 2007 on “Building scalable semantic peer-to-peer data management systems: the SomeWhere approach”. She presented a lecture talk at the ACAI 2007 Summer School on “Logic-based techniques for information integration”.
Editors
F. Goasdoué
Guest editor of a special issue of Technique et Science Informatiques (TSI) on the Semantic Web, Hermès-Lavoisier.
Member of the reading committee of the book Semantic Web Methodologies for E-Business Applications: Ontologies, Processes and Management Practices, Idea Group Publishing (scheduled for publication in 2008).
I. Manolescu
Guest editor of a special issue of the Elsevier Journal of Information Systems on Performance Evaluation in Database Systems.
C. Reynaud
Journal Electronique d'IA de l'AFIA (JEDAI).
Revue Information - Interaction - Intelligence (RI3).
Revue des Nouvelles Technologies de l'Information, Special issue "Fouille du Web" (RNTI).
M-C. Rousset
Interstices (revue electronique de vulgarisation sur la recherche en informatique):
http://
AI Communications (AICOM)
Electronic Transactions on Artificial Intelligence (ETAI) (for the areas: Concept-based Knowledge Representation and Semantic Web).
Revue Information - Interaction - Intelligence (I3)
L. Simon
Member of the Editorial Board of JSAT (the Journal on Satisfiability, Boolean Modeling and Computation)
Guest Editor of a Special Issue of JSAT on SAT 2006 Competitions and Evaluations.