Team:ALMANACH

Inria | Raweb 2017 | Presentation of the Team ALMANACH | ALMANACH Web Site


	PDF	e-Pub

Previous |

Home | Next next

Section: Research Program

Computational Modelling of Linguistic Variation

NLP and DH tools and resources are very often developed for contemporary, edited, non-specialised texts, often based on journalistic corpora. However, such corpora are not representative of the variety of existing textual data. As a result, the performance of most NLP systems decrease, sometimes dramatically, when faced with non-contemporary, non-edited or specialised texts. Despite the existence of domain-adaptation techniques and robust tools, for instance for trying to process social media texts, dealing with linguistic variation is still a crucial challenge for NLP and DH.

Linguistic variation is not a monolithic phenomenon. Firstly, it can result from different types of processes, such as variation over time (diachronic variation) and variation correlated with sociological variables (sociolinguistic variation, especially on social networks). Secondly, it can affect all components of language, from spelling (languages without a normative spelling, spelling errors of all kinds and origins) to morphology/syntax (especially in diachrony, in texts from specialised domains, in social media texts) and semantics/pragmatics (again in diachrony, and also regarding intertextuality, on which see below). Finally, it can constitute a property of the data to be analysed or a feature of the data to be generated (for instance when trying to simplify texts for increasing their accesibility for disabled and/or non-native readers).

Nevertheless, despite this variability in variation, the underlying mechanisms are partly comparable. This motivates our general vision that many generic techniques could be developed and adapted to handle different types of variation. In this regard, three aspects must be kept in mind: spelling variation (human errors, OCR/HTR errors, lack of spelling conventions for some languages...), lack or scarcity of parallel data aligning “variation-affected” texts and their “standard/edited” counterpart, and the sequencial nature of the problem at hand. We therefore explore, for instance, how unsupervised or weakly-supervised techniques could be developed and feed dedicated sequence-to-sequence models. Such architectures could help develop “normalisation” tools adapted, for example, to social media texts, texts written in ancient/dialectal varieties of well-resources languages (e.g. Old French texts), and OCR/HTR system outputs.

Nevertheless, the different types of language variation require specific models, resources and tools. All these directions of research constitute the core of our second research strand described in this section.

Theoretical and empirical synchronic linguistics

We plan to explore computational models to deal with language variation. But it is important to start by getting more insights about language in general and about the way humans apprehend it. We do so in at least two directions, associating computational linguistics with formal and descriptive linguistics on the one hand (especially at the morphological level) and with cognitive linguistics on the other hand (especially at the syntactic level).

Recent advances in morphology rely on quantitative and computational approaches and, sometimes, on collaboration with descriptive linguists. In this regard, ALMAnaCH memebrs have taken part in the design of quantitative approaches to defining and measuring morphological complexity and to assess the internal structure of morphological systems (inflection classes, predictability of inflected forms…). Such studies provide valuable insights on these prominent questions in theoretical morphology. They also improve the linguistic relevance and the development speed of NLP-oriented lexicons, as also demonstrated by ALMAnaCH members. We shall therefore pursue these investigations, and orientate them towards their use in diachronic models (for which see section 3.3.3).

Regarding cognitive linguistics, we have the perfect opportunity with the starting ANR-NSF project “Neuro-Computational Models of Natural Language” (NCM-NL) to go in this direction, by examining potential correlations between medical imagery applied on patients listening to a reading of “Le Petit Prince” and computation models applied on the novel. A secondary prospective benefit from the project is information about processing evolutions (by the patients) along the novel, possibly due to the use of contextual information by humans.

Sociolinguistic variation

Because language is central in our social interactions, it is legitimate to ask how the rise of digital content and its tight integration on our daily life through social media and such has become a factor acting on language. This is even more actual as the recent rise of novel digital services opens new areas of expression, which support new linguistics behaviours. In particular, social medias such as Twitter provide channels of communication through which speakers/writers use their language in ways that differ from standard written and oral forms. The result is the emergence of new language varieties.

A very similar situation exists with regard to historical texts, especially documentary texts or graffiti but even literary texts, that do not follow standardized orthography, morphology or syntax.

However, NLP tools are designed for standard forms of language and exhibit a drastic loss of accuracy when applied to social media varieties or unstandardized historical sources. To define appropriate tools, descriptions of these varieties are needed. Yet such descriptions need tools to be validated. We address this circularity interdisciplinarily, by working both on linguistics descriptions and on NLP tool development. Recently, sociodemographic variables have been shown to bear a strong impact on NLP processing tools. This is why, in a first step, jointly with researchers involved in the ANR project SoSweet (ENS Lyon and Inria’s Dante), we study how these variables can be factored out by our models and, in a second step, how they can be accurately predicted from sources lacking these kinds of featured descriptions.

Diachronic variation

Language change is a type of variation pertaining to the diachronic axis. Yet any instance of language change, whatever its nature (phonetic, syntactic…), results from a particular case of synchronic variation (competing phonetic realisations, competing syntactic constructions…). The articulation of diachronic and synchronic variation is influenced to a large extent by both language-internal factors (i.e. generalisation of context-specific facts) and/or external factors (determined by social class, register, domain, and other types of variation).

Very few computational models of language change have been developed. Simple deterministic finite-state-based phonetic evolution models have been used in different contexts. The PIElexicon project [62] uses such models to automatically generate forms attested in (classical) Indo-European languages but is based on a idiosyncrasic and inacceptable reconstruction of the Proto-Indo-European language. Probabilistic finite-state models have also been used for automatic cognate detection and proto-form reconstruction, for example by [53] and [58]. Such models rely on a good understanding of the phonetic evolution of the languages at hand.

In ALMAnaCH, we focus on modelling phonetic, morphological and lexical diachronic evolution, with an emphasis on computational etymological research and on the computational modelling of the evolution of morphological systems (morphological grammar and morphological lexicon). These efforts are in direct interaction with sub-strand 3b (development of lexical resources). We go beyond the above-mentioned purely phonetic models of language and lexicon evolution, as they fail to take into account a number of crucial dimensions, among which: (1) spelling, spelling variation and the relationship between spelling and phonetics; (2) synchronic variation (geographical, genre-related, etc.); (3) morphology, especially through intra-paradigmatic and inter-paradigmatic analogical levelling phenomena, (4) lexical creation, including via affixal derivation, back-formation processes and borrowings.

We apply our models to two main tasks. The first task, for example in the context of the ANR project Profiterole, consists in predicting non-attested or non-documented words at a certain date based on attestations of older or newer stages of the same word (e.g., predicting a non-documented Middle French word based on its Vulgar Latin and Old French predecessors and its Modern French successor). Morphological models and lexical diachronic evolution models provide independent ways to perform the same predictions, thus reinforcing our hypotheses or pointing to new challenges.

The second application task is computational etymology and proto-language reconstruction. Our lexical diachronic evolution models are to be paired with semantic resources (wordnets, word embeddings, and other corpus-based statistical information). This makes it possible to formally validate or suggest etymological or cognate relations between lexical entries from different languages of a same language family, provided they are all inherited. Such an approach could also be adapted to include the automatic detection of borrowings from one language to the other (e.g. for studying the non-inherited layers in the Ancient Greek lexicon). In the longer term, we intend to investigate the feasibility of the automatic (unsupervised) acquisition of phonetic change models, especially when provided with lexical data for numerous languages from the same language family.

These lines of research rely on etymological datasets and standards for representing etymological information, for which see Section 3.4.2.

Accessibility-related variation

Language variation does not always constitute an additional complexity in the textual input of NLP tools. It can also be characterised by their intended output. This is the perspective from which we investigate the issue of text simplification (for a recent survey, see for instance [78]). Text simplification is an important task for improving the accessibility to information, for instance for people suffering from disabilities and for non-native speakers learning a given language [63]. To this end, guidelines have been developed to help writing documents that are easier to read and understand, such as the FALC (“Facile À Lire et à Comprendre”) guidelines for French. (http://www.unapei.org/IMG/pdf/GuidePathways.pdf)

Fully automated text simplification is not suitable for producing high-quality simplified texts. Besides, the involvement of disabled people in the production of simplified texts plays an important social role. Therefore, following previous works [57], [73], our goal is to develop tools for the computer-aided simplification of textual documents, especially administrative documents. Many of the FALC guidelines can only be linguistically expressed using complex, syntactic constraints, and the amount of available “parallel” data (aligned raw and simplified documents) is limited. We therefore investigate hybrid techniques involving rule-based, statistical and neural approaches based on parsing results (for an example of previous parsing-based work, see [51]). Lexical simplification, another aspect of text simplification [60], [64], is also to be investigted. (We have started a collaboration with Facebook's Parisian FAIR laboratory, the UNAPEI (the largest French federation of associations defending and supporting people with intellectual disabilities and their families), and the French Secretariat of State in charge of Disabled Persons.)

Accessibility can also be related to the various presentation forms of a document. This is the context in which we have initiated the OPALINE project, funded by the Programme d'Investissement d'Avenir - Fonds pour la Société Numérique. The objective is for us to further develop the GROBID text-extraxction suite in order to be able to re-publish existing books or dictionaries, available in PDF, in a format that is accessible by visually impaired persons.

Intertextual variation

Language variation is not restricted to language-internal dimensions such as the effects of sociolinguistic and diachronic factors. It also involves variation in the way a same content can be expressed. Detecting, analysing and qualifying this type of variation is a challenge that can be applied in different settings, such as the automatic study of intertextuality in ancient documents (different versions of a same myth, for instance), automatic comparison of documents dealing with the same facts and citations (e.g. journalistic articles and news wires), assessment of textual entailment, and automatic detection of plagiarism. In ALMAnaCH, we put an emphasis on the first two of these examples.

Intertextual comparison of close witnesses of the same text produces valuable data on orthographic, morphological or semantic equivalences and variance (textual criticism). Automatic parallel detection not only informs about the positive intertextuality between two sources (e.g. the use of Biblical quotations among Church Fathers or Rabbinic authors) but also reveal the differences in their use and transformation of the same textual material, and therefore the authorial strategies and politics.

In automatic language processing, it is customary to focus on similarities when dealing with distinct documents. Instead, we can focus on modelling what is idiosyncratic to a certain text, given a reference. This can allow, for instance, to identify whether an elided passage is relevant or not. Identifying such relevant omissions was one of the goals of the VerDi Project (on which see below).

Previous |

Home | Next next