EN FR
EN FR


Section: Research Program

Overview and research strands

One of the main challenges in computational linguistics is modelling and coping with language variation. Language varies with respect to domain and genre (news wires, scientific literature, poetry, oral transcripts…), sociolinguistic factors (age, background, education; variation attested for instance on social media) and other dimensions (disabilities, for instance). But language is also in constant evolution at all time scales. Addressing this variability is still an open issue for NLP. Commonly used approaches, which often rely on supervised and semi-supervised machine learning methods, require huge amounts of annotated data. They are still struggling with the high level of variability found for instance in user-generated content or in ancient texts.

ALMAnaCH tackles the challenge of language variation in two complementary directions.

Research strand 1

We focus on linguistic representations that are less affected by language variation. This first requires improving the production of semantic representations (semantic parsing). This also involves investigating the integration of both linguistic and non-linguistic contextual information to improve automatic linguistic analysis. This is an emerging and promising line of research in the field of natural language processing (hereafter NLP). We have to identify, model and take advantage of each type of contextual information available. Addressing these issues enables the development of new lines of research related to conversational content. Applications thereof include chatbot-based systems and improved information and knowledge extraction algorithms. We especially focus our work on challenging datasets such as domain-specific texts and historical documents, in the larger context of the development of digital humanities.

Research strand 2

Language variation must be better understood and modelled in all its forms. In this regard, we put a strong emphasis on three types of language variation and their mutual interaction: sociolinguistic variation in synchrony (including non-canonical spelling and syntax in user-generated content), complexity-based variation in relation with language-related disabilities, and diachronic variation (computational exploration of language change and language history, with a focus on Old to all forms of Modern French, as well as Indo-European and Semitic languages in general). In addition, the noise introduced by OCR and HTR systems, especially in the context of historical documents, bears similarities with those brought by non-canonical input in user-generated content. This noise constitutes a more transverse kind of variation stemming from the way language is graphically encoded, which we call language-encoding variation. Dealing with diachronic and language-encoding variation, as well as their interaction, is the main motivations behind the creation of a joint project-team between Inria and EPHE.

Research strand 3

These two first research strands rely on the availability of language resources (corpora, lexicons). The development of raw corpora from original sources is a domain of expertise of ALMAnaCH’s EPHE members. The (manual, semi-automatic and automatic) development of lexical resources and annotated corpora is a domain of expertise of ALMAnaCH’s Inria and Paris 4 members. This complementary expertise in language resource development (research strand 3) benefits to the whole team and beyond, and both feeds and benefits from the work of the other research strands.