Section: New Results
Creation, Extraction and Standardisation of Etymological Information
Participants : Jack Bowers, Mohamed Khemakhem, Laurent Romary, Benoît Sagot.
A new, important line of research in 2017 was the work around etymological information and resources. This work can be divided into three main dimensions:
-
Standards for the representation of etymological information.
-
Extraction of etymological resources from existing datasets. Two main resource types were exploited:
-
Digitalised legacy etymological dictionaries, using GROBID-dictionaries, in collaboration with the Berlin-Brandenburg Academy of Sciences. The output of the process is a TEI-structured dictionary (see module 7.4 for more details).
-
The English Wiktionary, from which structured, formalised etymological information was extracted and published (open-source) in the form of a database of lexemes (i.e. language/lemma/meaning triples) and an associated database of etymological relations (input lexeme(s)/output lexeme/type of relation) [26], [27].
-
-
Etymological research (i.e. producing novel etymological hypotheses), in collaboration with Romain Garnier (Université de Limoges & Institut Universitaire de France) and, although to a lesser extent, Laurent Sagart (CRLAO, CNRS) [12], [37]. Although limited (for now), the contribution of computational models in our research is real; it allowed us to check the validity of the diachronic phonetic evolution model we have postulated for a new, hypothetical Indo-European language we suggest could have served as a source of borrowings for the ancestors of both Greek and Italic languages [12].