Section: New Results

Semantic Data Integration

Participants : Michel Chein, Madalina Croitoru, Léa Guizol, Michel Leclère.

It often happens that different references (i.e. data descriptions), possibly coming from heterogeneous data sources, concern the same real world entity. In such cases, it is necessary: (i) to detect whether different data descriptions really refer to the same real world entity and (ii) to fuse them into a unique representation. Since the seminal paper [59] , this issue has been been studied under various names: “record linking”, “entity resolution”, “reference resolution”, “de-duplication”, “object identification”, “data reconciliation”, etc., mostly in databases (cf. the bibliography by William E. Winckler [60] ). It has become one of the major challenges in the Web of Data, where the objective is to link data published on the web and to process them as a single distributed database.

We investigate this problem in the specific context of bibliographic databases. Indeed, people working in bibliographical information systems have a lasting tradition of using norms and have integrated, along collections of documents notices (e.g. bibliographic records), collections of authority notices that categorize the different named entities used to describe documents (people, organizations, places, ...). In current databases, documents notices do not use directly the names of named entities to fill a particular field (author, editor, ...), but the unique identifier of the authority notice representing that named entity.

Past years, we began a collaboration with ABES (National Bibliographic Agency for Universities) to develop a method and a prototype to perform entity resolution between on one hand the authors of a new bibliographic record, and, on the other the authority references of an authority catalog (and namely the Sudoc catalogue from the ABES agency). The prototype providing this service has been implemented on top of Cogui and experiments have been led in the context of the SudocAd project (jointly conducted by ABES and GraphIK).

Our proposed method can be stated as follows: first, enrich authority records with knowledge extracted from bibliographic records in which the authority is mentioned ; then, use logical rules which conclude on different levels of reconciliation, to compare the authors of a new bibliographic record with the enriched authority records ; finally, for each author of the new bibliographic record, order the authority identifiers by level of reconciliation.

  • Work published in [30] .

A problem with this approach is that it relies upon pre-established links between bibliographic records and authority notices. However, our experimentation and evaluation have shown that many existing links were erroneous, and thus led to the propagation of new linkage errors. We have thus began to work on methods and tools to repair linkage errors in bibliographical databases. This year, this work has been pursued along three different axis:

  1. We have built a formal framework allowing to evaluate the quality of links in a documents database. We propose two different “quality” notions, based upon an identification predicate`id and a differentiation predicate di between pairs of authority notices identifiers (these predicates can be either given by an expert or computed using rules). We have first introduced the notion of a well-founded database, when id is an equivalence relation and di its complement. This property can be checked using logical inferences and combinatorial techniques. In the general case where a database is not necessarily well-founded, we have proposed different distances to a well-founded one. We have also introduced a more complex quality criterion that corresponds to stability by substitution (a fundamental property of logical equality that is not necessarily satisfied by id).

    • A research report should lead to a publication in 2014.

  2. We developed a methodology for detecting linkage errors and fixing them, based upon a clustering method of authors in bibliographic records. Last year, the general schema of the methodology was defined. It is based upon a set of criteria which allows us to cluster “similar” authors together. Each criterion represents a point of view on the author: name, publication time span, publication domain, etc... This year, two aggregation semantics for such criteria have been developed, implemented and evaluated.

    • Work published in AI-SGAI 2013 [34] .

  3. We have studied methods allowing to automatically extract similarity criteria between named entities. This problem is very similar to the automatic discovery of composite key constraints in RDF data sources that conform to a given ontology. We have studied the different existing methods allowing to discover such keys, and have proposed logical semantics for these different keys. These semantics allow to understand and compare the results produced by these different methods. These methods have been evaluated against the documentary databases provided by our partners ABES and INA.

    • Work described in a research report [48] , at the moment, two papers are submitted.