Bilateral Contracts and Grants with Industry

Section: New Results

From GROBID to GROBID-Dictionaries

Participants : Luca Foppiano, Mohamed Khemakhem, Laurent Romary, Pedro Ortiz Suárez, Alba Marina Malaga Sabogal.

GROBID is an open source software suite initiated in 2007 by Patrice Lopez with the purpose of extracting metadata automatically from scholarly papers available in PDF. Over the years, it has developed into a rich information extraction environment, and deployed in many Inria projects, but also national and international services, such as HAL (front-end meta-data extraction from uploaded scholarly publications). It is a central piece for our information extraction activities and we have been particularly active in 2018 in the following domains:

  • General contributions to GROBID (https://github.com/kermitt2/grobid):

    • Major refactoring and design improvements

    • fixes, tests, documentation and update of the pdf2xml fork for Windows

    • added and improved several models in collaboration with CERN (e.g. for the recognition of arXiv identifier)

    • Further tests on the specific case of bibliographic documents[32]

  • Contribution to GROBID-Dictionaries (https://github.com/MedKhem/grobid-dictionaries): the lexical GROBID extension has been implemented and tested on modern and multilingual dictionaries[23]. In the context of several collaborative activities, GROBID-Dictionaries has been applied on several documentary sources:

    • Early editions of the The Petit Larrousse Illustré in the context of the Nénufar project[45], [29]

    • Further experiments on etymological dictionaries from the Berlin Brandenburg Academy of Sciences

    • Experiments on entry-based documents such as manuscript catalogues (with University of Neuchâtel)[16] and the French address Directory Bottin from the end of the XIXth Century[22]

    These various experiments have been accompanied by an intense training and hand-on activity in the context in particular of the French research network CAHIERS (Huma-Num consortium), the Lexical Data Master Class and a series of workshop organised in South Africa under the auspices of a national linguistic documentation program. Finally, further alignments with the ongoing standardisation activities around TEI Lex0 and ISO 24613 (LMF) has been carried out to ensure a proper standards compliance of the generated output

The experience gained in the develoment and application of GROBID-Dictionaries has been the basis for the recently accepted ANR BASNUM project which aims at automatically structuring and enriching of the Dictionnaire universel (DU) by Antoine Furetière, in its 1701 edition rewritten by Basnage de Beauval and the doctoral work of Pedro Ortiz.