ALMANACH - 2018 - Rapport annuel d'activité

ALMANACH

ALMANACH - 2018

Team Almanach

Team, Visitors, External Collaborators

Overall Objectives

Research Program

Application Domains

Application domains for ALMAnaCH

New Software and Platforms

New Results

Bilateral Contracts and Grants with Industry

Industrial Collaborations

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Entity-fishing: a generic named entity recognition and disambiguation for digital humanities projects

Participants : Marie Puren, Charles Riondet, Laurent Romary, Luca Foppiano, Tanti Kristanti.

Since several years (starting at the beginning of the EU Cendari project in 2012 [75]) we have been working on the provision of a generic named-entity recognition and disambiguation module (NERD) called entity-fishing[18] as a stable on-line service. The work we have achieved demonstrates the possible delivery of sustainable technical services as part of the development of research infrastructures for the humanities in Europe. In particular, our results contribute not only to DARIAH, the European digital research infrastructure for the arts and humanities, but also to OPERAS, the European research infrastructure for the development of open scholarly communication in the social sciences and humanities. Deployed as part of the French national infrastructure Huma-Num, the service provides an efficient state-of-the-art implementation coupled with standardised interfaces allowing easy deployment in a variety of potential digital humanities contexts. In 2018, we have specifically integrated entity-fishing within the H2020 HIRMEOS project where several open access publishers have used the service in their collections of published monographs as a means to enhance retrieval and access.

To this end, we have set up a common layer of services on top of several existing e-publishing platforms for Open Access monographs. The entity extraction task was deployed over a corpus of monographs provided by the HIRMEOS partners, with the following coverage:

4000 books in English and French from Open Edition Books
2000 titles in English and German from OAPEN
162 books in English from Ubiquity Press
765 books (606 in German, 159 in English) from the University of Göttingen

The introduction of entity-fishing has undergone different levels of integration. The majority of the participating publishers provided additional features in their user interface, using the data generated by entity-fishing, for example, as search facets for persons and locations to help users narrow down their searches and obtain more precise results.

entity-fishing has been developed in Java and it has been designed for fast processing on text and PDF, with relatively limited memory and to offer relatively close to state-of-the-art accuracy (as compared with other NERD systems). The accuracy f-score for disambiguation is currently between 76.5 and 89.1 on standard datasets (ACE2004, AIDA-CONLL-testb, AQUAINT, MSNBC) (Table 1) [74].

**Table 1.** Accuracy measures
	ACE 2004	AIDA CONLL-testb	AQUAINT	MSNBC
Priors	83.1	66.1	80.3	71.1
entity-fishing	83.5	76.5	89.1	86.7
Wikifier	83.4	77.7	86.2	85.1
DoSeR	90.7	78.4	84.2	91.1
AIDA	81.5	77.4	53.2	78.2
Spotlight	71.3	59.3	71.3	51.1
Babelfy	56.1	59.2	65.2	60.7
WAT	80.0	84.3	76.8	77.7
(Ganea & Hofmann, 2017)	88.5	92.2	88.5	93.7

The objective, however, is to provide a generic service that has a steady throughput of 500-1000 words per second or one PDF page of a scientific article in 1-2 seconds on a medium range (4CPU, 3Gb Ram) Linux server.

From the point of view of the technical deployment itself, we have provided all the necessary components of a sustainable service:

release and publish entity-fishing as open source software (http://github.com/kermitt2/nerd);
deploy the service in the DARIAH infrastructure through HUMA-NUM (http://nerd.huma-num.fr/nerd/);
produce evaluation data and metrics for content validation.

Previous |

Home | Next next