Section: New Results
Quality and interoperability of large document catalogues
Participants : Michel Chein, Madalina Croitoru, Alain Gutierrez, Michel Leclère, Clément Sipieter.
The work in this research line mainly takes place in the ANR project Qualinca (see Section 8.1), devoted to methods and tools to repair linkage errors in bibliographical databases. Within this project, we specially work with our applicative partner ABES (French Agency for Academic Libraries, http://www.abes.fr/). ABES manages several catalogues and authority bases, in particular the Sudoc, the collective catalogue of French academic libraries. ABES also provides services to libraries and end-users, as well as to other catalogue managers (e.g., OCLC for Worldcat and, in France, Adonis for the Isidore platform).
Evaluating the Quality of a Bibliographic Database
This year, we have focused on the specification, development and test of the application allowing to evaluate reference quality in a bibliographic database. The goal is to evaluate “same-as” links between contextual references (references to named entities provided in the context of a bibliographic notice) and authority references (references establishing an identifier for a given named entity). Our approach to solve this problem consists in two successive steps:
-
use the linkage API developed last years to compute automatically weighted links between contextual references and authority references;
-
compare those weighted links with those present in the bibliographic database in order to produce an evaluation of those links quality.
The evaluation output considers 12 different cases split in 5 major link categories: valid, almost valid, erroneous, missing, doubtful. For the 3 latter categories, we can often provide a correction or completion proposal.
We have initially implemented this application as a standalone client written in Java (see Section 5.1). We have tested it on a benchmark comprising 550 links, for which the evaluation has been done by experts. Our application has obtained very good results, since more than 70% of the links are evaluated correctly, less than 1% wrongly, and the rest consists of links for which data is insufficient to provide an evaluation.
To allow professionals from ABES to use this application, we have developed an interactive web service: the user first asks for the evaluation output on the set of links induced by a subset of contextual references; then he can validate or invalidate the proposed correction/completion. The tool can be restarted after each correction/completion to improve the evaluation with this new data. Our ABES partner is currently developing an enhanced graphical interface for Sudoc users, that will communicate with that web service, in order to use the software in production conditions.
Finally, an evaluation of the time required by our application led to numerous optimizations. We have for now concluded that the time is essentially spent by the library functions computing similarities between attributes. We consider now using map/reduce techniques to parallelize those computations.
Argumentation for Quality Evaluation
Beside, we studied the use of the owl:sameAs property (expressing that two URIs actually refer to the same thing) in practice. Many existing identity links do not reflect genuine real identity and therefore might lead to inconsistencies. We formalized explanation dialogues that use argument-based explanation based on inconsistency-tolerant semantics, and showed how to support a domain expert in discovering inconsistencies due to erroneous SameAs links. We implemented a prototype of the explanation dialogue that communicates with our tool Graal and provided an example of sameAs invalidation over real data explaining what has been obtained while running dialogues and how such results might benefit domain experts.