EN FR
EN FR


Section: New Results

Information Extraction with GROBID

Participants : Luca Foppiano, Mohamed Khemakhem, Laurent Romary.

GROBID is an open source software suite initiated in 2007 by Patrice Lopez with the purpose of extracting metadata automatically from scholarly papers available in PDF. Over the years, it has developped into a rich information extration environment, and deployed in many Inria projects, but also national and international services, among which we can quote HAL. It is a central piece for our information extraction activities and we have been particularly active in 2017 in the following domains:

  • General contributions to GROBID (https://github.com/kermitt2/grobid):

    • Major refactoring and design improvements

    • fixes, tests, documentation and update of the pdf2xml fork for Windows

    • added and improved several models in collaboration with CERN (e.g. for the recognition of arXiv identifier)

  • Contribution to entity-fishing (https://github.com/kermitt2/nerd):

    • integration into the main open-access platform: EKT/OMP, OAPEN, OpenEdition, Gottinghen University Library Press, Ubiquity press

    • deployment in the DARIAH infrastructure via Huma-NUM

    • adding supported languages for Italian and Spanish

    • various fixes and refactoring

    • Creation of a specific client for Historical documents, combined with a POS-tagger that connect the found entities between them and with their structural context[34]

  • Contribution to GROBID-Dictionaries (https://github.com/MedKhem/grobid-dictionaries): the lexical GROBID extension has been implemented and tested on modern and multilingual dictionaries [19]. The architecture has been further developed and an extension for etymology has been plugged-in on the top of the existing models. First experiments on etymological samples have been carried out and more work is required on the features selection. In parallel, the output of the system is actively synchronised with the Standardisation initiatives such as TEI Lex0 and ISO 24613 (LMF). Usability has been enhanced as well by lightening the annotation process and simplifying the setup process of the tool. Such measures are going to unlock the workforce potential of different interested research partners to generate more annotated data required for feature engineering. A first user experiment has been carried out during a dedicated workshop at the Lexical Masterclass, where the new features have been tested