EN FR
EN FR


Section: New Results

Standardisation of Natural Language data

Participants : Loïc Grobol, Laurent Romary, Stefan Pernes, Jack Bowers, Charles Riondet, Mohamed Khemakhem.

One essential aspect of working with human traces as they occur in digital humanities at large and in natural language processing in particular, is to be able to re-use any kind of primary content and further enrichments thereof. The central aspect of re-using such content is the development and applications of reference standards that reflect the best state of the art in the corresponding domains. In this respect, our team is particularly attentive to the existing standardisation background when both producing language resources or devlopping NLP components. Furthermore, our specific leading roles in the domain of standardisation in both the Parthenos [41] and EHRI [40] projects as well as in related initiatives (TEI consortium, ISO committee TC 37, COST action ENeL (European Network in e-Lexicography), DARIAH lexical working group) has allowed to make progress along the following lines:

  • Contribution to the improvement of the TEI guidelines [15], [20] and in particular to the definition of an extension for stand-off annotation in the continuity of [52] (https://github.com/laurentromary/stdfSpec)

  • Editing an ISO standard on the annotation of reference phenomena in discourse (https://www.iso.org/standard/69658.html) that intends to be feature complete from a linguistic point of view (from simple co-reference to complex bridging anaphora phenomena) and compliant with the TEI stand-off annotation module from the point of view of its implementation [18]

  • Editing the draft for the future project ISO 24613-4, which, on the basis of the proposals made in [67], intends to provide a reference TEI based serialisation for the LMF model (comprising core model (ISO 24613-1), machine readable dictionary (ISO 24613-2) and etymology (ISO 24613-3, cf. below) modules). This work is also the basis for the output format of Grobid-dictionary [19]

  • Editing the draft for the future project ISO 24613-4, which will provide the model for representing etymological information in dictionaries and lexical resources, on the basis of [11]. Preliminary experiments have been carried out in [26], [27] (see also section 7.10)

  • Proposal of a modular specification of the TBX standard (ISO 30642) by means of a TEI ODD specification [24]

  • Participation to a call for contribution to the future evolution of the archival standard EAC-CPF (Encoded Archival Context for Corporate Bodies, Persons, and Families), proposing to use the TEI ODD specification language [47]