EN FR
EN FR


Section: New Results

Creation, Extraction and Standardisation of Etymological Information

Participants : Jack Bowers, Mohamed Khemakhem, Laurent Romary, Benoît Sagot.

A new, important line of research in 2017 was the work around etymological information and resources. This work can be divided into three main dimensions:

  • Standards for the representation of etymological information.

  • Extraction of etymological resources from existing datasets. Two main resource types were exploited:

    • Digitalised legacy etymological dictionaries, using GROBID-dictionaries, in collaboration with the Berlin-Brandenburg Academy of Sciences. The output of the process is a TEI-structured dictionary (see module 7.4 for more details).

    • The English Wiktionary, from which structured, formalised etymological information was extracted and published (open-source) in the form of a database of lexemes (i.e. language/lemma/meaning triples) and an associated database of etymological relations (input lexeme(s)/output lexeme/type of relation) [26], [27].

  • Etymological research (i.e. producing novel etymological hypotheses), in collaboration with Romain Garnier (Université de Limoges & Institut Universitaire de France) and, although to a lesser extent, Laurent Sagart (CRLAO, CNRS) [12], [37]. Although limited (for now), the contribution of computational models in our research is real; it allowed us to check the validity of the diachronic phonetic evolution model we have postulated for a new, hypothetical Indo-European language we suggest could have served as a source of borrowings for the ancestors of both Greek and Italic languages [12].