Section: New Results

Semantic Data Integration

Participants : Michel Leclère, Michel Chein, Madalina Croitoru, Rallou Thomopoulos, Léa Guizol.

It often happens that different references (i.e. data descriptions), possibly coming from heterogeneous data sources, concern the same real world entity. In such cases, it is necessary: (i) to detect whether different data descriptions really refer to the same real world entity and (ii) to fuse them into a unique representation. Since the seminal paper [66] , this issue has been been studied under various names:“record linking”, “entity resolution”,“reference resolution”, ”de-duplication”, “object identification”, “data reconciliation”, etc., mostly in databases (cf. the bibliography by William E. Winckler (http://www.hcp.med.harvard.edu/statistics/survey-soft/docs/WinklerReclinkRef.pdf )). It has become one of the major challenges in the Web of Data, where the objective is to link data published on the web and to process them as a single distributed database. Most entity resolution methods are based on classification techniques; Fatiha Saïs, Nathalie Pernelle and Marie-Christine Rousset proposed the first logical approach [68] . Many experiments on public data are underway, in France (cf. DataLift(DataLift, http://datalift.org/ ) and ISIDORE(ISIDORE, http://www.rechercheisidore.fr/ ) projects) or internationally (e.g., VIAF project(The Virtual International Authority File, http://www.oclc.org/research/activities/viaf/ ) led by OCLC(Online Computer Library Center, http://www.oclc.org ), whose aim is to interconnect authority files coming from 18 national organizations).

Two years ago, we began a collaboration with ABES (National Bibliographic Agency for Universities, which takes part in the VIAF project). The aim of this collaboration is to enable the publication of ABES metadata bases on the Web of Data and to provide an identification service dedicated to bibliographic notices. ABES bibliographic bases, and more generally document metadata bases, appear to be a privileged application domain for the representation and reasoning formalisms developed by the team. This work has an interdisciplinary dimension, as it also requires experts in the Library and Information Science domain. We think that a logical approach is able to provide a generic solution for entity resolution in document metedata bases, even though it is generally admitted in Library and Information Science that “there is no single paradigmatic author name disambiguation task—each bibliographic database, each digital library, and each collection of publications, has its own unique set of problems and issues” [69] .

SUDOC Metadata Formalization

The first step of collaboration with ABES was to formalize the SUDOC catalogue, which contains all French academic libraries bibliographic notices, into a knowledge base using a suitable knowledge representation and reasoning language. This required to first analyze SUDOC content, as well as document description standards (CRM-CIDOC, FRBR, Dublin Core). We then designed an ontology expressed in the Semantic Web languages RDFS + OWL, compatible with document description standards, as well as translations from any SUDOC set of notices into a set of RDF facts according to this ontology. These translations have been implemented, which allows to export SUDOC bases into Semantic Web formats. Moreover, using the RDFS to CG second translation mentioned above, we are now able to import SUDOC bases into our tools CoGUI + CoGITaNT.

  • Technical report [40] .

Implementation of an Entity Identification Service

In order to perform entity resolution (for entities restricted to "authors" for now), we have defined a set of rules allowing to enrich Sudoc descriptions; then, using enriched descriptions, authors can be classified according to a proximity criterion. A prototype providing this service has been implemented on top of Cogui. Experiments are currently led in the context of the SudocAd project jointly conducted by ABES and GraphIK. SudocAd aims at enriching the author field of a bibliographic record describing a document with links to Sudoc authorities referring to the authors of the target document. A general description of the implemented approach, an analysis of this approach on a representative sample of bibliographic records and first results on 13400 bibliographic records extracted from a corpus independent from Sudoc catalog are presented in the final report of SudocAd.

Finally, we have defined an extension of our own logical framework (existential rules, constraints, homomorphism-based mechanisms) based on Hector J. Levesque and Gerhard Lakemeyer's Standard Names [62] , and the notion of knowledge base faithfulness with respect to the entity resolution problem (intuitively, the fact that the knowledge base is non-ambiguous). This is still ongoing work.

  • Research Report [38] .