Section: New Results

Semantic Data Integration

Participants : Michel Chein, Madalina Croitoru, Léa Guizol, Michel Leclère, Rallou Thomopoulos.

It often happens that different references (i.e., data descriptions), possibly coming from heterogeneous data sources, concern the same real world entity. In such cases, it is necessary: (i) to detect whether different data descriptions really refer to the same real world entity and (ii) to fuse them into a unique representation. This issue has been been studied under various names: “record linking”, “entity resolution”, “reference resolution”, “de-duplication”, “object identification”, “data reconciliation”, etc., mostly in databases . It has become one of the major challenges in the Web of Data, where the objective is to link data published on the web and to process them as a single distributed database.

We investigate this problem in the specific context of bibliographic databases. Indeed, people working in bibliographical information systems have a lasting tradition of using norms and have integrated, along collections of documents notices (e.g. bibliographic records), collections of authority notices that categorize the different named entities used to describe documents (people, organizations, places, ...). In current databases, documents notices do not use directly the names of named entities to fill a particular field (author, editor, ...), but the unique identifier of the authority notice representing that named entity.

A few years ago, we began a collaboration with ABES (National Bibliographic Agency for Universities) to develop a method and a prototype to perform entity resolution between on one hand the authors of a new bibliographic record, and, on the other the authority references of an authority catalog (and namely the Sudoc catalog from the ABES agency). A problem with this approach is that it relies upon pre-established links between bibliographic records and authority notices. However, our experimentation and evaluation have shown that many existing links were erroneous, and thus led to the propagation of new linkage errors. We have thus began to work on methods and tools to repair linkage errors in bibliographical databases. The first step of our approach was to build a knowledge-base over an ontology (based on the international standards FRBR and CIDOC-CRM) aiming at representing bibliographic data (an RDFS base) as well as librarian knowledge.

From that, we developed a methodological framework allowing to design rules concluding on the coreference or the difference between entities of the bibliographic knowledge base. This framework was implemented in Cogui.

An Original Methodology to Compute Coreference and Difference Links

Our methodology can be briefly summarized as follows. The first step consists in computing “sure” links. In the second step, authority notices are enriched by information that comes from bibliographic notices to which they are linked by sure links. In the third step, Datalog rules that conclude on coreference or difference are triggered. The results are used to compute new sure links. These steps are iterated until stability i.e., no new sure link is discovered. More specifically, the Datalog rules are the following form. The body of a rule is a conjunction of similarity criteria on attributes and its head states the coreference or the difference of two individual entities with a specific confidence level (represented as a symbolic value). We are currently instantiating this methodology for the Sudoc catalog, jointly with the ABES librarians, which will allow them to evaluate it.

Partioning Semantics for Link Discovery in Bibliographic Knowledge Bases

With the aim of evaluating and improving the quality of links in bibliographical knowledge bases, we have developed a decision support system based on partitioning semantics. The novelty of our approach consists in using symbolic values criteria for partitioning and suitable partitioning semantics. We have investigated the limits of those partitioning semantics: how the characteristics of the input (objects and criteria) influences characteristics of the result, namely correctness of the result and execution time. We have also evaluated and compared the above mentioned semantics on a real qualitative sample. This sample is issued from the catalogue of French university libraries (SUDOC) maintained by ABES.

  • This work is part of Lea Guizol's PhD thesis [16] . Work published in Fuzz IEEE 2014 [46] .

Key Discovery on the Semantic Web

Many techniques were recently proposed to automate the linkage of RDF datasets. Predicate selection is the step of the linkage process that consists in selecting the smallest set of relevant predicates needed to enable instance comparison. We call keys this set of predicates that is analogous to the notion of keys in relational databases. We have formally explained the different assumptions behind two existing key semantics (IC), and have evaluated experimentally these keys semantics by studying how discovered keys could help dataset interlinking or cleaning.

  • Work published in IC 2014 [50] and ICCS 2014 [29] in collaboration with Manuel Atencia and Jerome David from LIG, and Nathalie Pernelle, Fatiha Sais and Danai Symeonidou from LRI. See also the reconciliation-based approach in[23] .

Fusion of Linked Data

The problem of data fusion starts from reconciled datasets, whose objects are linked with semantic sameAs relations, as described above. We attempt to merge the often conflicting information of these reconciled objects in order to obtain unified representations that only contain the best quality information. We are studying an approach to determine the most appropriate value(s). Our method combines different quality criteria based on the value and its data source, and exploits, whenever possible, the ontology semantics, constraints and relations. Moreover we create a mechanism to provide explanations about the quality of each value, as estimated by our system. To achieve this, we generate annotations used for traceability and explanation purposes.

  • Work described in the Qualinca deliverable 4.2 research report, and accepted for publication in EGC'2015 : "Linked Data Annotation and Fusion driven by Data Quality Evaluation" (authors: Ioanna Giannopoulou, Fatiha Saïs from LRI, and Rallou Thomopoulos)