Bilateral Contracts and Grants with Industry
Bilateral Contracts and Grants with Industry

Section: New Results

Entity-fishing: a generic named entity recognition and disambiguation for digital humanities projects

Participants : Marie Puren, Charles Riondet, Laurent Romary, Luca Foppiano, Tanti Kristanti.

Since several years (starting at the beginning of the EU Cendari project in 2012 [75]) we have been working on the provision of a generic named-entity recognition and disambiguation module (NERD) called entity-fishing[18] as a stable on-line service. The work we have achieved demonstrates the possible delivery of sustainable technical services as part of the development of research infrastructures for the humanities in Europe. In particular, our results contribute not only to DARIAH, the European digital research infrastructure for the arts and humanities, but also to OPERAS, the European research infrastructure for the development of open scholarly communication in the social sciences and humanities. Deployed as part of the French national infrastructure Huma-Num, the service provides an efficient state-of-the-art implementation coupled with standardised interfaces allowing easy deployment in a variety of potential digital humanities contexts. In 2018, we have specifically integrated entity-fishing within the H2020 HIRMEOS project where several open access publishers have used the service in their collections of published monographs as a means to enhance retrieval and access.

To this end, we have set up a common layer of services on top of several existing e-publishing platforms for Open Access monographs. The entity extraction task was deployed over a corpus of monographs provided by the HIRMEOS partners, with the following coverage:

The introduction of entity-fishing has undergone different levels of integration. The majority of the participating publishers provided additional features in their user interface, using the data generated by entity-fishing, for example, as search facets for persons and locations to help users narrow down their searches and obtain more precise results.

entity-fishing has been developed in Java and it has been designed for fast processing on text and PDF, with relatively limited memory and to offer relatively close to state-of-the-art accuracy (as compared with other NERD systems). The accuracy f-score for disambiguation is currently between 76.5 and 89.1 on standard datasets (ACE2004, AIDA-CONLL-testb, AQUAINT, MSNBC) (Table 1[74].

Table 1. Accuracy measures
Priors 83.1 66.1 80.3 71.1
entity-fishing 83.5 76.5 89.1 86.7
Wikifier 83.4 77.7 86.2 85.1
DoSeR 90.7 78.4 84.2 91.1
AIDA 81.5 77.4 53.2 78.2
Spotlight 71.3 59.3 71.3 51.1
Babelfy 56.1 59.2 65.2 60.7
WAT 80.0 84.3 76.8 77.7
(Ganea & Hofmann, 2017) 88.5 92.2 88.5 93.7

The objective, however, is to provide a generic service that has a steady throughput of 500-1000 words per second or one PDF page of a scientific article in 1-2 seconds on a medium range (4CPU, 3Gb Ram) Linux server.

From the point of view of the technical deployment itself, we have provided all the necessary components of a sustainable service: