The ALMAnaCH project-team 1 brings together
specialists of a pluri-disciplinary research domain at the interface
between computer science, linguistics, statistics, and the humanities,
namely that of natural language processing, computational
linguistics and digital and computational humanities and
social sciences.
Computational linguistics is an interdisciplinary field dealing
with the computational modelling of natural language. Research in this
field is driven both by the theoretical goal of understanding human
language and by practical applications in Natural Language
Processing (hereafter NLP) such as linguistic analysis (syntactic
and semantic parsing, for instance), machine translation, information
extraction and retrieval and human-computer dialogue. Computational
linguistics and NLP, which date back at least to the early 1950s, are
among the key sub-fields of Artificial Intelligence.
Digital Humanities and social sciences (hereafter DH) is an
interdisciplinary field that uses computer science as a source of
techniques and technologies, in particular NLP, for exploring research
questions in social sciences and humanities. Computational
Humanities and computational social sciences aim at improving the
state of the art in both computer sciences (e.g. NLP) and social
sciences and humanities, by involving computer science as a research
field.
The scientific positioning of ALMAnaCH extends that of its Inria predecessor, the project-team ALPAGE, a joint team with Paris-Diderot University dedicated to research in NLP and computational linguistics. ALMAnaCH remains committed to developing state-of-the-art NLP software and resources that can be used by academics and in the industry. At the same time we continue our work on language modelling in order to provide a better understanding of languages, an objective that is reinforced and addressed in the broader context of computational humanities. Finally, we remain dedicated to having an impact on the industrial world and more generally on society, via multiple types of collaboration with companies and other institutions (startup creation, industrial contracts, expertise, etc.).
One of the main challenges in computational linguistics is to
model and to cope with language variation. Language varies with
respect to domain and genre (news wires, scientific literature,
poetry, oral transcripts...), sociolinguistic factors (age,
background, education; variation attested for instance on social
media), geographical factors (dialects) and other dimensions
(disabilities, for instance). But language also constantly evolves at
all time scales. Addressing this variability is still an open issue for
NLP. Commonly used approaches, which often rely on supervised and
semi-supervised machine learning methods, require very large amounts of
annotated data. They still suffer from the high level of variability
found for instance in user-generated content, non-contemporary texts, as well as in domain-specific
documents (e.g. financial, legal).
ALMAnaCH tackles the challenge of language variation in two complementary directions, supported by a third, transverse research axis on language resources. These three research axes do not reflect an internal organisation of ALMAnaCH in separate teams. They are meant to structure our scientific agenda, and most members of the project-team are involved in two or all of them.
ALMAnaCH's research axes, themselves structured in sub-axis, are the following:
As described above, ALMAnaCH's scientific programme is organised around three research axes. The first two aim to tackle the challenge of language variation in two complementary directions. They are supported by a third, transverse research axis on language resources. Our four-year objectives are described in much greater detail in the project-team proposal, whose very recent final validation in June 2019 resulted in the upgrade of ALMAnaCH to the “project-team” status in July 2019. They can be summarised as follows:
Our first objective is to stay at a state-of-the-art level in key NLP tasks such as shallow processing, part-of-speech tagging and (syntactic) parsing, which are core expertise domains of ALMAnaCH members. This will also require us to improve the generation of semantic representations (semantic parsing), and to begin to explore tasks such as machine translation, which now relies on neural architectures also used for some of the above-mentioned tasks. Given the generalisation of neural models in NLP, we will also be involved in better understanding how such models work and what they learn, something that is directly related to the investigation of language variation (Research axis 2). We will also work on the integration of both linguistic and non-linguistic contextual information to improve automatic linguistic analysis. This is an emerging and promising line of research in NLP. We will have to identify, model and take advantage of each type of contextual information available. Addressing these issues will enable the development of new lines of research related to conversational content. Applications include improved information and knowledge extraction algorithms. We will especially focus on challenging datasets such as domain-specific texts (e.g. financial, legal) as well as historical documents, in the larger context of the development of digital humanities. We currently also explore the even more challenging new direction of a cognitively inspired NLP, in order to tackle the possibility to enrich the architecture of state-of-the-art algorithms, such as RNNGs, based on human neuroimaging-driven data.
Language variation must be better understood and
modelled in all its forms. In this regard, we will put a strong emphasis on four types of language variation and their mutual interaction: sociolinguistic variation in synchrony (including non-canonical spelling and syntax in user-generated content), complexity-based variation in relation to language-related disabilities, and diachronic variation (computational exploration of language change and language history, with a focus on Old to all forms of Modern French, as well as Indo-European languages in general). In addition, the noise introduced by Optical Character Recognition and Handwritten Text Recognition systems, especially in the context of historical documents, bears some similarities to that of non-canonical input in user-generated content (e.g. erroneous characters). This noise constitutes a more transverse kind of variation stemming from the way language is graphically encoded, which we call language-encoding variation. Other types of language variation will also become important research topics for ALMAnaCH in the future. This includes dialectal variation (e.g. work on Arabic varieties, something on which we have already started working, producing the first annotated data set on Maghrebi Arabizi, the Arabic variants used on social media by people from North-African countries, written using a non-fixed Latin-script transcription) as well as the study and exploitation of paraphrases in a broader context than the above-mentioned complexity-based variation.
Both research axes above rely on the availability of language resources (corpora, lexicons), which is the focus of our third, transverse research axis.
Language resource development (raw and annotated corpora, lexical resources) is not just a necessary preliminary step to create both evaluation datasets for NLP systems and training datasets for NLP systems based on machine learning. When dealing with datasets of interest to researchers from the humanities (e.g. large archives), it is also a goal per se and a preliminary step before making such datasets available and exploitable online. It involves a number of scientific challenges, among which (i) tackling issues related to the digitalisation of non-electronic datasets, (ii) tackling issues related to the fact that many DH-related datasets are domain-specific and/or not written in contemporary languages; (iii) the development of semi-automatic and automatic algorithms to speed up the work (e.g. automatic extraction of lexical information, low-resource learning for the development of pre-annotation algorithms, transfer methods to leverage existing tools and/or resources for other languages, etc.) and (iv) the development of formal models to represent linguistic information in the best possible way, thus requiring expertise at least in NLP and in typological and formal linguistics. Such endeavours are domains of expertise of the ALMAnaCH team, and a large part of our research activities will be dedicated to language resource development. In this regard, we aim to retain our leading role in the representation and management of lexical resource and treebank development and also to develop a complete processing line for the transcription, analysis and processing of complex documents of interest to the humanities, in particular archival documents. This research axis 3 will benefit the whole team and beyond, and will benefit from and feed the work of the other research axes.
This first research strand is centred around NLP technologies and some of their applications in Artificial Intelligence (AI). Core NLP tasks such as part-of-speech tagging, syntactic and semantic parsing is improved by integrating new approaches, such as (deep) neural networks, whenever relevant, while preserving and taking advantage of our expertise on symbolic and statistical system: hybridisation not only couples symbolic and statistical approaches, but neural approaches as well. AI applications are twofold, notwithstanding the impact of language variation (see the next strand): (i) information and knowledge extraction, whatever the type of input text (from financial documents to ancient, historical texts and from Twitter data to Wikipedia) and (ii) chatbots and natural language generation. In many cases, our work on these AI applications is carried out in collaboration with industrial partners. The specificities and issues caused by language variation (a text in Old French, a contemporary financial document and tweets with a non-canonical spelling cannot be processed in the same way) are addressed in the next research strand.
Our expertise in NLP is the outcome of more than 10 years in developing new models of analysis and accurate techniques for the full processing of any kind of language input since the early days of the Atoll project-team and the rise of linguistically informed data-driven models as put forward within the Alpage project-team.
Traditionally, a full natural language process (NLP) chain is organised as a pipeline where each stage of analysis represents a traditional linguistic field (in a structuralism view) from morphological analysis to purely semantic representations. The problem is that this architecture is vulnerable to error propagation and very domain sensitive: each of these stage must be compatible at the lexical and structure levels they provide. We arguably built the best performing NLP chain for French 70, 114 and one of the best for robust multilingual parsing as shown by our results in various shared tasks over the years 110, 107, 115, 76. So we pursue our efforts on each of our components we developed: tokenisers (e.g. SxPipe), part-of-speech taggers (e.g. MElt), constituency parsers and dependency parsers (e.g. FRMG, DyALog-SR) as well as our recent neural semantic graph parsers 107.
In particular, we continue to explore the hybridisation of symbolic and statistical approaches, and extend it to neural approaches, as initiated in the context of our participation to the CoNLL 2017 multilingual parsing shared task 2 and to Extrinsic Parsing Evaluation Shared Task 3.
Fundamentally, we want to build tools that are less sensitive to variation, more easily configurable, and self-adapting. Our short-term goal is to explore techniques such as multi-task learning (cf. already 99) to propose a joint model of tokenisation, normalisation, morphological analysis and syntactic analysis. We also explore adversarial learning, considering the drastic variation we face in parsing user-generated content and processing historical texts, both seen as noisy input that needs to be handled at training and decoding time.
While those points are fundamental, therefore necessary, if we want to build the next generation of NLP tools, we need to push the envelop even further by tackling the biggest current challenge in NLP: handling the context within which a speech act is taking place.
There is indeed a strong tendency in NLP to assume that each sentence is independent from its siblings sentences as well as its context of enunciation, with the obvious objective to simplify models and reduce the complexity of predictions. While this practice is already questionable when processing full-length edited documents, it becomes clearly problematic when dealing with short sentences that are noisy, full of ellipses and external references, as commonly found in User-Generated Content (UGC).
A more expressive and context-aware structural representation of a linguistic production is required to accurately model UGC. Let us consider for instance the case for Syntax-based Machine Translation of social media content, as is carried out by the ALMAnaCH-led ANR project Parsiti (PI: DS). A Facebook post may be part of a discussion thread, which may include links to external content. Such information is required for a complete representation of the post's context, and in turn its accurate machine translation. Even for the presumably simpler task of POS tagging of dialogue sequences, the addition of context-based features (namely information about the speaker and dialogue moves) was beneficial 79. In the case of UGC, working across sentence boundaries was explored for instance, with limited success, by 69 for document-wise parsing and by 98 for POS tagging.
Taking the context into account requires new inference methods able to share information between sentences as well as new learning methods capable of finding out which information is to be made available, and where. Integrating contextual information at all steps of an NLP pipeline is among the main research questions addressed in this research strand. In the short term, we focus on morphological and syntactic disambiguation within close-world scenarios, as found in video games and domain-specific UGC. In the long term, we investigate the integration of linguistically motivated semantic information into joint learning models.
From a more general perspective, contexts may take many forms and require imagination to discern them, get useful data sets, and find ways to exploit them. A context may be a question associated with an answer, a rating associated with a comment (as provided by many web services), a thread of discussions (e-mails, social media, digital assistants, chatbots—on which see below–), but also meta data about some situation (such as discussions between gamers in relation with the state of the game) or multiple points of views (pictures and captions, movies and subtitles). Even if the relationship between a language production and its context is imprecise and indirect, it is still a valuable source of information, notwithstanding the need for less supervised machine learning techniques (cf. the use of LSTM neural networks by Google to automatically suggest replies to emails).
The use of local contexts as discussed above is a new and promising approach. However, a more traditional notion of global context or world knowledge remains an open question and still raises difficult issues. Indeed, many aspects of language such as ambiguities and ellipsis can only be handled using world knowledge. Linked Open Data (LODs) such as DBpedia, WordNet, BabelNet, or Framebase provide such knowledge and we plan to exploit them.
However, each specialised domain (economy, law, medicine...) exhibits its own set of concepts with associated terms. This is also true of communities (e.g. on social media), and it is even possible to find communities discussing the same topics (e.g. immigration) with very distinct vocabularies. Global LODs weakly related to language may be too general and not sufficient for a specific language variant. Following and extending previous work in ALPAGE, we put an emphasis on information acquisition from corpora, including error mining techniques in parsed corpora (to detect specific usages of a word that are missing in existing resources), terminology extraction, and word clustering.
Word clustering is of specific importance. It relies on the distributional hypothesis initially formulated by Harris, which states that words occurring in similar contexts tend to be semantically close. The latest developments of these ideas (with word2vec or GloVe) have led to the embedding of words (through vectors) in low-dimensional semantic spaces. In particular, words that are typical of several communities (see above) can be embedded in a same semantic space in order to establish mappings between them. It is also possible in such spaces to study static configurations and vector shifts with respect to variables such as time, using topological theories (such as pretopology), for instance to explore shifts in meaning over time (cf. the ANR project Profiterole concerning ancient French texts) or between communities (cf. the ANR project SoSweet). It is also worth mentioning on-going work (in computational semantics) whose goal is to combine word embeddings to embed expressions, sentences, paragraphs or even documents into semantic spaces, e.g. to explore the similarity of documents at various time periods.
Besides general knowledge about a domain, it is important to detect and keep trace of more specific pieces of information when processing a document and maintaining a context, especially about (recurring) Named Entities (persons, organisations, locations...) —something that is the focus of future work in collaboration with Patrice Lopez on named entity detection in scientific texts. Through the co-supervision of a PhD funded by the LabEx EFL (see below), we are also involved in pronominal coreference resolution (finding the referent of pronouns). Finally, we plan to continue working on deeper syntactic representations (as initiated with the Deep Sequoia Treebank), thus paving the way towards deeper semantic representations. Such information is instrumental when looking for more precise and complete information about who does what, to whom, when and where in a document. These lines of research are motivated by the need to extract useful contextual information, but it is also worth noting their strong potential in industrial applications.
NLP and DH tools and resources are very often developed for contemporary, edited, non-specialised texts, often based on journalistic corpora. However, such corpora are not representative of the variety of existing textual data. As a result, the performance of most NLP systems decreases, sometimes dramatically, when faced with non-contemporary, non-edited or specialised texts. Despite the existence of domain-adaptation techniques and of robust tools, for instance for social media text processing, dealing with linguistic variation is still a crucial challenge for NLP and DH.
Linguistic variation is not a monolithic phenomenon. Firstly, it can result from different types of processes, such as variation over time (diachronic variation) and variation correlated with sociological variables (sociolinguistic variation, especially on social networks). Secondly, it can affect all components of language, from spelling (languages without a normative spelling, spelling errors of all kinds and origins) to morphology/syntax (especially in diachrony, in texts from specialised domains, in social media texts) and semantics/pragmatics (again in diachrony, for instance). Finally, it can constitute a property of the data to be analysed or a feature of the data to be generated (for instance when trying to simplify texts for increasing their accessibility for disabled and/or non-native readers).
Nevertheless, despite this variability in variation, the underlying mechanisms are partly comparable. This motivates our general vision that many generic techniques could be developed and adapted to handle different types of variation. In this regard, three aspects must be kept in mind: spelling variation (human errors, OCR/HTR errors, lack of spelling conventions for some languages...), lack or scarcity of parallel data aligning “variation-affected” texts and their “standard/edited” counterpart, and the sequential nature of the problem at hand. We will therefore explore, for instance, how unsupervised or weakly-supervised techniques could be developed and feed dedicated sequence-to-sequence models. Such architectures could help develop “normalisation” tools adapted, for example, to social media texts, texts written in ancient/dialectal varieties of well-resourced languages (e.g. Old French texts), and OCR/HTR system outputs.
Nevertheless, the different types of language variation will require specific models, resources and tools. All these directions of research constitute the core of our second research strand described in this section.
Permanent members involved: all
We aim to explore computational models to deal with language variation. It is important to get more insights about language in general and about the way humans apprehend it. We will do so in at least two directions, associating computational linguistics with formal and descriptive linguistics on the one hand (especially at the morphological level) and with cognitive linguistics on the other hand (especially at the syntactic level).
Recent advances in morphology rely on quantitative and computational approaches and, sometimes, on collaboration with descriptive linguists—see for instance the special issue of the Morphology journal on “computational methods for descriptive and theoretical morphology”, edited and introduced by 67. In this regard, ALMAnaCH members have taken part in the design of quantitative approaches to defining and measuring morphological complexity and to assess the internal structure of morphological systems (inflection classes, predictability of inflected forms...). Such studies provide valuable insights on these prominent questions in theoretical morphology. They also improve the linguistic relevance and the development speed of NLP-oriented lexicons, as also demonstrated by ALMAnaCH members. We shall therefore pursue these investigations, and orientate them towards their use in diachronic models (see section 3.3.3).
Regarding cognitive linguistics, we have the perfect opportunity with the starting ANR-NSF project “Neuro-Computational Models of Natural Language” (NCM-NL) to go in this direction, by examining potential correlations between medical imagery applied on patients listening to a reading of “Le Petit Prince” and computation models applied on the novel. A secondary prospective benefit from the project will be information about processing evolution (by the patients) along the novel, possibly due to the use of contextual information by humans.
Because language is central in our social interactions, it is legitimate to ask how the rise of digital content and its tight integration in our daily life has become a factor acting on language. This is even more actual as the recent rise of novel digital services opens new areas of expression, which support new linguistic behaviours. In particular, social media such as Twitter provide channels of communication through which speakers/writers use their language in ways that differ from standard written and oral forms. The result is the emergence of new language varieties.
A very similar situation exists with regard to historical texts, especially documentary texts or graffiti but even literary texts, that do not follow standardised orthography, morphology or syntax.
However, NLP tools are designed for standard forms of language and exhibit a drastic loss of accuracy when applied to social media varieties or non-standardised historical sources. To define appropriate tools, descriptions of these varieties are needed. However, to validate such descriptions, tools are also needed. We address this chicken-and-egg problem in an interdisciplinary fashion, by working both on linguistic descriptions and on the development of NLP tools. Recently, socio-demographic variables have been shown to bear a strong impact on NLP processing tools (see for instance 74 and references therein). This is why, in a first step, jointly with researchers involved in the ANR project SoSweet (ENS Lyon and Inria project-team Dante), we will study how these variables can be factored out by our models and, in a second step, how they can be accurately predicted from sources lacking these kinds of featured descriptions.
Language change is a type of variation pertaining to the diachronic axis. Yet any language change, whatever its nature (phonetic, syntactic...), results from a particular case of synchronic variation (competing phonetic realisations, competing syntactic constructions...). The articulation of diachronic and synchronic variation is influenced to a large extent by both language-internal factors (i.e. generalisation of context-specific facts) and/or external factors (determined by social class, register, domain, and other types of variation).
Very few computational models of language change have been developed. Simple deterministic finite-state-based phonetic evolution models have been used in different contexts. The PIElexicon project 93 uses such models to automatically generate forms attested in (classical) Indo-European languages but is based on an idiosyncratic and unacceptable reconstruction of the Proto-Indo-European language. Probabilistic finite-state models have also been used for automatic cognate detection and proto-form reconstruction, for example by 68 and 75. Such models rely on a good understanding of the phonetic evolution of the languages at hand.
In ALMAnaCH, our goal is to work on modelling phonetic, morphological and lexical diachronic evolution, with an emphasis on computational etymological research and on the computational modelling of the evolution of morphological systems (morphological grammar and morphological lexicon). These efforts will be in direct interaction with sub-strand 3b (development of lexical resources). We want to go beyond the above-mentioned purely phonetic models of language and lexicon evolution, as they fail to take into account a number of crucial dimensions, among which: (1) spelling, spelling variation and the relationship between spelling and phonetics; (2) synchronic variation (geographical, genre-related, etc.); (3) morphology, especially through intra-paradigmatic and inter-paradigmatic analogical levelling phenomena, (4) lexical creation, including via affixal derivation, back-formation processes and borrowings.
We apply our models to two main tasks. The first task, as developed for example in the context of the ANR project Profiterole, consists in predicting non-attested or non-documented words at a certain date based on attestations of older or newer stages of the same word (e.g., predicting a non-documented Middle French word based on its Vulgar Latin and Old French predecessors and its Modern French successor). Morphological models and lexical diachronic evolution models will provide independent ways to perform the same predictions, thus reinforcing our hypotheses or pointing to new challenges.
The second application task is computational etymology and proto-language reconstruction. Our lexical diachronic evolution models will be paired with semantic resources (wordnets, word embeddings, and other corpus-based statistical information). This will allow us to formally validate or suggest etymological or cognate relations between lexical entries from different languages of a same language family, provided they are all inherited. Such an approach could also be adapted to include the automatic detection of borrowings from one language to another (e.g. for studying the non-inherited layers in the Ancient Greek lexicon). In the longer term, we will investigate the feasibility of the automatic (unsupervised) acquisition of phonetic change models, especially when provided with lexical data for numerous languages from the same language family.
These lines of research will rely on etymological data sets and standards for representing etymological information (see Section 3.4.2).
Diachronic evolution also applies to syntax, and in the context of the ANR project Profiterole, we are beginning to explore more or less automatic ways of detecting these evolutions and suggest modifications, relying on fine-grained syntactic descriptions (as provided by meta-grammars), unsupervised sentence clustering (generalising previous works on error mining, cf. 9), and constraint relaxation (in meta-grammar classes). The underlying idea is that a new syntactic construction evolves from a more ancient one by small, iterative modifications, for instance by changing word order, adding or deleting functional words, etc.
Language variation does not always pertain to the textual input of NLP tools. It can also be characterised by their intended output. This is the perspective from which we investigate the issue of text simplification (for a recent survey, see for instance 111). Text simplification is an important task for improving the accessibility to information, for instance for people suffering from disabilities and for non-native speakers learning a given language 94. To this end, guidelines have been developed to help writing documents that are easier to read and understand, such as the FALC (“Facile À Lire et à Comprendre”) guidelines for French. 4
Fully automated text simplification is not suitable for producing high-quality simplified texts. Besides, the involvement of disabled people in the production of simplified texts plays an important social role. Therefore, following previous works 73, 105, our goal will be to develop tools for the computer-aided simplification of textual documents, especially administrative documents. Many of the FALC guidelines can only be linguistically expressed using complex, syntactic constraints, and the amount of available “parallel” data (aligned raw and simplified documents) is limited. We will therefore investigate hybrid techniques involving rule-based, statistical and neural approaches based on parsing results (for an example of previous parsing-based work, see 66). Lexical simplification, another aspect of text simplification 82, 95, will also be pursued. In this regard, we have already started a collaboration with Facebook's AI Research in Paris, the UNAPEI (the largest French federation of associations defending and supporting people with intellectual disabilities and their families), and the French Secretariat of State in charge of Disabled Persons.
Accessibility can also be related to the various presentation forms of a document. This is the context in which we have initiated the OPALINE project, funded by the Programme d'Investissement d'Avenir - Fonds pour la Société Numérique. The objective is for us to further develop the GROBID text-extraction suite 5 in order to be able to re-publish existing books or dictionaries, available in PDF, in a format that is accessible by visually impaired persons.
Language resources (raw and annotated corpora, lexical resources, etc.) are required in order to apply any machine learning technique (statistical, neural, hybrid) to an NLP problem, as well as to evaluate the output of an NLP system.
In data-driven, machine-learning-based approaches, language resources are the place where linguistic information is stored, be it implicitly (as in raw corpora) or explicitly (as in annotated corpora and in most lexical resources). Whenever linguistic information is provided explicitly, it complies to guidelines that formally define which linguistic information should be encoded, and how. Designing linguistically meaningful and computationally exploitable ways to encode linguistic information within language resources constitutes the first main scientific challenge in language resource development. It requires a strong expertise on both the linguistic issues underlying the type of resource under development (e.g. on syntax when developing a treebank) and the NLP algorithms that will make use of such information.
The other main challenge regarding language resource development is a consequence of the fact that it is a costly, often tedious task. ALMAnaCH members have a long track record of language resource development, including by hiring, training and supervising dedicated annotators. But a manual annotation can be speeded up by automatic techniques. ALMAnaCH members have also work on such techniques, and published work on approaches such as automatic lexical information extraction, annotation transfer from a language to closely related languages, and more generally on the use of pre-annotation tools for treebank development and on the impact of such tools on annotation speed and quality. These techniques are often also relevant for Research strand 1. For example, adapting parsers from one language to the other or developing parsers that work on more than one language (e.g. a non-lexicalised parser trained on the concatenation of treebanks from different languages in the same language family) can both improve parsing results on low-resource languages and speed up treebank development for such languages.
Corpus creation and management (including automatic annotation) is often a time-consuming and technically challenging task. In many cases, it also raises scientific issues related for instance with linguistic questions (what is the elementary unit in a text?) as well as computer-science challenges (for instance when OCR or HTR are involved). It is therefore necessary to design a work-flow that makes it possible to deal with data collections, even if they are initially available as photos, scans, wikipedia dumps, etc.
These challenges are particularly relevant when dealing with ancient languages or scripts where fonts, OCR techniques, language models may be not extant or of inferior quality, as a result, among others, of the variety of writing systems and the lack of textual data. We will therefore work on improving print OCR for some of these languages, especially by moving towards joint OCR and language models. Of course, contemporary texts can be often gathered in very large volumes, as we already do within the ANR project SoSweet, resulting in different, specific issues.
ALMAnaCH pays a specific attention to the re-usability 6 of all resources produced and maintained within its various projects and research activities. To this end, we will ensure maximum compatibility with available international standards for representing textual sources and their annotations. More precisely we will take the TEI (Text Encoding Initiative) guidelines as well the standards produced by ISO committee TC 37/SC 4 as essential points of reference.
From our ongoing projects in the field of Digital Humanities and emerging initiatives in this field, we observe a real need for complete but easy work-flows for exploiting corpora, starting from a set of raw documents and reaching the level where one can browse the main concepts and entities, explore their relationship, extract specific pieces of information, always with the ability to return to (fragments of) the original documents. The pieces of information extracted from the corpora also need to be represented as knowledge databases (for instance as RDF “linked data”), published and linked with other existing databases (for instance for people and locations).
The process may be seen as progressively enriching the documents with new layers of annotations produced by various NLP modules and possibly validated by users, preferably in a collaborative way. It relies on the use of clearly identified representation formats for the annotations, as advocated within ISO TC 37/SC 4 standards and the TEI guidelines, but also on the existence of well-designed collaborative interfaces for browsing, querying, visualisation, and validation. ALMAnaCH has been or is working on several of the NLP bricks needed for setting such a work-flow, and has a solid expertise in the issues related to standardisation (of documents and annotations). However, putting all these elements in a unified work-flow that is simple to deploy and configure remains to be done. In particular, work-flow and interface should maybe not be dissociated, in the sense that the work-flow should be easily piloted and configured from the interface. An option will be to identify pertinent emerging platforms in DH (such as Transkribus) and to propose collaborations to ensure that NLP modules can be easily integrated.
It should be noted that such work-flows have actually a large potential besides DH, for instance for exploiting internal documentation (for a company) or exploring existing relationships between entities.
ALPAGE, the Inria predecessor of ALMAnaCH, has put a strong emphasis in the development of morphological, syntactic and wordnet-like semantic lexical resources for French as well as other languages (see for instance 8, 1). Such resources play a crucial role in all NLP tools, as has been proven among other tasks for POS tagging 101, 103, 115 and parsing, and some of the lexical resource development will be targeted towards the improvement of NLP tools. They will also play a central role for studying diachrony in the lexicon, for example for Ancient to Contemporary French in the context of the Profiterole project. They will also be one of the primary sources of linguistic information for augmenting language models used in OCR systems for ancient scripts, and will allow us to develop automatic annotation tools (e.g. POS taggers) for low-resourced languages (see already 116), especially ancient languages. Finally, semantic lexicons such as wordnets will play a crucial role in assessing lexical similarity and automating etymological research.
Therefore, an important effort towards the development of new morphological lexicons will be initiated, with a focus on ancient languages of interest. Following previous work by ALMAnaCH members, we will try and leverage all existing resources whenever possible such as electronic dictionaries, OCRised dictionaries, both modern and ancient 100, 78, 102, while using and developing (semi)automatic lexical information extraction techniques based on existing corpora 104, 106. A new line of research will be to integrate the diachronic axis by linking lexicons that are in diachronic relation with one another thanks to phonetic and morphological change laws (e.g. XIIth century French with XVth century French and contemporary French). Another novelty will be the integration of etymological information in these lexical resources, which requires the formalisation, the standardisation, and the extraction of etymological information from OCRised dictionaries or other electronic resources, as well as the automatic generation of candidate etymologies. These directions of research are already investigated in ALMAnaCH 78, 102.
An underlying effort for this research will be to further the development of the GROBID-dictionaries software, which provides cascading CRF (Conditional Random Fields) models for the segmentation and analysis of existing print dictionaries. The first results we have obtained have allowed us to set up specific collaborations to improve our performances in the domains of a) recent general purpose dictionaries such as the Petit Larousse (Nénufar project, funded by the DGLFLF in collaboration with the University of Montpellier), b) etymological dictionaries (in collaboration with the Berlin Brandenburg Academy of sciences) and c) patrimonial dictionaries such as the Dictionnaire Universel de Basnage (an ANR project, including a PhD thesis at ALMAnaCH, has recently started on this topic in collaboration with the University of Grenoble-Alpes and the University Sorbonne Nouvelle in Paris).
In the same way as we signalled the importance of standards for the representation of interoperable corpora and their annotations, we will keep making the best use of the existing standardisation background for the representation of our various lexical resources. There again, the TEI guidelines play a central role, and we have recently participated in the “TEI Lex 0” initiative to provide a reference subset for the “Dictionary” chapter of the guidelines. We are also responsible, as project leader, of the edition of the new part 4 of the ISO standard 24613 (LMF, Lexical Markup Framework) 97 dedicated to the definition of the TEI serialisation of the LMF model (defined in ISO 24613 part 1 `Core model', 2 `Machine Readable Dictionaries' and 3 `Etymology'). We consider that contributing to standards allows us to stabilise our knowledge and transfer our competence.
Along with the creation of lexical resources, ALMAnaCH is also involved in the creation of corpora either fully manually annotated (gold standard) or automatically annotated with state-of-the-art pipeline processing chains (silver standard). Annotations will either be only morphosyntactic or will cover more complex linguistic levels (constituency and/or dependency syntax, deep syntax, maybe semantics). Former members of the ALPAGE project have a renowned experience in those aspects (see for instance 109, 96, 108, 87) and will participate to the creation of valuable resources originating from the historical domain genre.
Under the auspices of the ANR Parsiti project, led by ALMAnaCH (PI: DS), we aim to explore the interaction of extra-linguistic context and speech acts. Exploiting extra-linguistics context highlights the benefits of expanding the scope of current NLP tools beyond unit boundaries. Such information can be of spatial and temporal nature, for instance. They have been shown to improve Entity Linking over social media streams 72. In our case, we decided to focus on a closed world scenario in order to study context and speech acts interaction. To do so, we are developing a multimodal data set made of live sessions of a first person shooter video game (Alien vs. Predator) where we transcribed all human players interactions and face expressions streamlined with a log of all in-game events linked to the video recording of the game session, as well as the recording of the human players themselves. The in-games events are ontologically organised and enable the modelling of the extra-linguistics context with different levels of granularity. Recorded over many games sessions, we already transcribed over 2 hours of speech that will serve as a basis for exploratory work, needed for the prototyping of our context-enhanced NLP tools. In the next step of this line of work, we will focus on enriching this data set with linguistic annotations, with an emphasis on co-references resolutions and predicate argument structures. The midterm goal is to use that data set to validate a various range of approaches when facing multimodal data in a close-world environment.
ALMAnaCH's research areas cover Natural Language Processing (nowadays identified as a sub-domain of Artificial Intelligence) and Digital Humanities. Application domains are therefore numerous, as witnessed by ALMAnaCH's multiple academic and industrial collaborations, for which see the relevant sections. Examples of application domains for NLP include:
MElt is a freely available (LGPL) state-of-the-art sequence labeller that is meant to be trained on both an annotated corpus and an external lexicon. It was developed by Pascal Denis and Benoît Sagot within the Alpage team, a joint INRIA and Université Paris-Diderot team in Paris, France. MElt allows for using multiclass Maximum-Entropy Markov models (MEMMs) or multiclass perceptrons (multitrons) as underlying statistical devices. Its output is in the Brown format (one sentence per line, each sentence being a space-separated sequence of annotated words in the word/tag format).
MElt has been trained on various annotated corpora, using Alexina lexicons as source of lexical information. As a result, models for French, English, Spanish and Italian are included in the MElt package.
MElt also includes a normalization wrapper aimed at helping processing noisy text, such as user-generated data retrieved on the web. This wrapper is only available for French and English. It was used for parsing web data for both English and French, respectively during the SANCL shared task (Google Web Bank) and for developing the French Social Media Bank (Facebook, twitter and blog data).
The WOLF (Wordnet Libre du Français, Free French Wordnet) is a free semantic lexical resource (wordnet) for French.
The WOLF has been built from the Princeton WordNet (PWN) and various multilingual resources.
The French QuestionBanks is a corpus of around 2000 questions coming from various domains (TREC data set, French governmental organisation, NGOs, etc..) it contains different kind of annotations - morpho-syntactic ones (POS, lemmas) - surface syntaxe (phrase based and dependency structures) with long-distance dependency annotations.
The TREC part is aligned with the English QuestionBank (Judge et al, 2006).
The aim of text simplification (TS) is to make a text easier to read and understand by simplifying its grammar and structure while keeping the underlying meaning and information identical. It is therefore an instance of language variation, based on language complexity. It can benefit numerous audiences, such as people with disabilities, language learners and even the general public, for instance when dealing with intrinsically complex texts such as legal documents.
In 2017 we initiated a collaboration with the Facebook Artificial Intelligence Research (FAIR) lab in Paris and with the UNAPEI, the federation of French associations helping people with mental disabilities and their families. The objective of this collaboration is to develop tools to help the simplification of texts aimed at mentally disabled people. More precisely, the aim is to develop a computer-assisted text simplification platform (as opposed to an automatic TS system). In this context, a CIFRE PhD thesis was initiated in collaboration with the FAIR on the TS task. We first dedicated important efforts to the problem of the evaluation of TS systems, which remains an open challenge. Since the task has common points with machine translation (MT), TS is often evaluated using MT metrics such as BLEU 91. However, such metrics require high quality reference data, which is rarely available for TS. TS has the advantage over MT of being a monolingual task, which allows for direct comparisons to be made between the simplified text and its original version. We compared multiple approaches to reference-less quality estimation of sentence-level TS systems, based on the dataset used for the QATS 2016 shared task 113. We distinguished three different dimensions: grammaticality, meaning preservation and simplicity. We showed that
In 2019, we investigated an important issue inherent to the TS task. Although it is often considered an all-purpose generic task where the same simplification is suitable for all, multiple audiences can benefit from simplified text in different ways. We therefore introduced a discrete parametrisation mechanism providing explicit control for TS systems based on Seq2Seq neural models; users can condition the simplifications returned by a model on parameters such as length and lexical complexity. We also showed that carefully chosen values of these parameters allow out-of-the-box Seq2Seq neural models to outperform their standard counterparts on simplification benchmarks. Our best parametrised model improves over the previous state-of-the-art performance, as shown by our 2020 publication on the topic 86.
In 2020, we have been involved in the development of a new text simplification corpus, in collaboration with the University of Sheffield. In order to simplify a sentence, human editors perform multiple rewriting transformations: splitting it into several shorter sentences, paraphrasing (i.e. replacing complex words or phrases by simpler synonyms), reordering components and/or deleting information deemed unnecessary. Despite the vast range of possible text alterations, current models for automatic sentence simplification are evaluated using datasets that focus on single transformations, such as paraphrasing or splitting. This makes it impossible to understand the ability of simplification models in more abstractive and realistic settings. This is what motivated the development of ASSET, a new dataset for assessing sentence simplification in English 18. ASSET is a crowdsourced multi-reference corpus in which each simplification was produced by executing several rewriting transformations. Through quantitative and qualitative experiments, we have shown that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task. Furthermore, we motivated the need to develop better methods of automatic evaluation using ASSET, since we show that current popular metrics may not be suitable for assessment when multiple simplification transformations are performed.
In 2020 we also extended the scope of TS to multiple languages using a totally unsupervised method and with state-of-the-art results, even compared to previous supervised models 49. We use a large-scale mining pipeline to extract one billion sentences from the web using Common Crawl. These sentences are used to find millions of paraphrases in three languages: English, French and Spanish. This allows us to train our controllable models 86 in multiple languages, which we then adapt for the TS task. These models push the state-of-the-art further in all three languages and benchmarks. Generated simplifications are considered more fluent, and simpler than previous models. This research enables us to exploit our previous works and apply them to the Cap'FALC project, whose aim is to assist the simplification of French documents by people with intellectual disabilities.
Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages 71, 83. This makes practical use of such models—in all languages except English—very limited. In 2019, one of the most visible achievements of the ALMAnaCH team was the training and release of CamemBERT, a BERT-like 71 (and more specifically a RoBERTa-like) neural language model for French trained on the French section of our large-scale web-based OSCAR corpus 90, together with CamemBERT variants 85. Our goal was to investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We showed that the use of web-crawled data (such as found in OSCAR) to train such language models is preferable to the use of Wikipedia data, because of the homogeneity of Wikipedia data. More surprisingly, we also showed that a relatively small web crawled dataset (4GB of randomly extracted text from the French section of OSCAR) leads to results that are as good as those obtained using larger datasets (130+GB, i.e. the whole French section of OSCAR) 29, 37. CamemBERT allowed us to reach or improve the state of the art in all four downstream tasks.
In parallel, we used OSCAR to train ELMo language models 92, i.e. LSTM-based monolingual contextualised language models, for five mid-resource languages (Catalan, Finnish, Indonesian, Bulgarian and Danish). We showed that, despite the noise in the web-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia 33. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitute the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures, a result in line with our findings on CamemBERT.
Two teams involving ALMAnaCH members participated in the CLEF-HIPE 2020 challenge on Named Entity (NE) Processing on old newspapers, using different approaches and competing on different tasks within the challenge.
One team, a collaboration with the LATTICE (Sorbonne Université), competed under the name SinNer on Named Entity Recognition in French and German texts. The best system we proposed ranked third for these two languages. It uses FastText embeddings and ELMo language models, which we trained ourselves (FrELMo and German ELMo). We show that combining several word representations enhances the quality of the results for all NE types and that the segmentation in sentences has an important impact on the results 31.
The other team competed in the NERC-COARSE-LIT and “Entity Linking only” (EL-ONLY) tasks for English and French 28 using two systems jointly developed by Patrice Lopez and ALMAnaCH: 1) DeLFT, a Deep Learning framework for text processing; 2) entity-fishing, generic NE recognition and disambiguation service. The main goal of this participation was to assess the performance level of these tools, rather than to do everything possible to perform optimally in the competition. Despite this, we ranked second for the NERC coarse English strict (literal sense) task and first for the EL-ONLY task for English.
In 2020 we continued our experiments to investigate whether and under which conditions neural networks can be used to learn sound correspondences between two related languages, i.e. for the prediction of cognates of source language words in a related target language. In particular, we investigated the learnability of sound correspondences between a proto-language and daughter languages for two machine-translation-inspired models, one statistical, the other neural 21. We first carried out our experiments on plausible artificial languages, without noise, in order to study the role of each parameter on the algorithms' respective performance under almost perfect conditions. We then studied real languages, namely Latin, Italian and Spanish, using our EtymDB 2.0 etymological database 22, to see if the performance generalises well. We showed that both model types can learn sound changes despite data sparsity, although the best performing model type depends on multiple parameters such as the size of the training data, the ambiguity, and the prediction direction.
Since the experiments published in 21, we have improved both our way to extract data from EtymDB and our neural systems in multiple ways, leading to improved results that should be published in 2021.
The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science and is becoming more and more crucial and is one of the keys to detecting emerging community that may use a specific, almost self-identifying language. This is why as part of the ANR project SoSweet and the PHC Maimonide projects (in collaboration with Bar Ilan University for the latter), ALMAnaCH has invested a lot of effort since 2018 into studying language variation within user-generated content (UGC), taking into account two main interrelated dimensions: how language variation is related to socio-demographic and dynamic network variables, and how UGC language evolves over time. Taking advantage of the SoSweet corpus (600 million tweets) and of the Bar Ilan Hebrew Tweets (180M tweets) both collected over the last 5 years, we have been addressing the problem of studying semantic changes via the use of dynamic word embeddings (i.e. embeddings evolving over time). We devised a novel attention model, based on Bernouilli word embeddings, which are conditioned on contextual extra-linguistic features such as network, spatial and socio-economicvariables, which can be inferred from Twitter users metadata, as well as topic-based features. We posit that these social features provide an inductive bias that is likely to help our model overcome the narrow time-span regime problem. Our extensive experiments reveal that, as a result of being less biased towards frequency cues, our proposed model is able to capture subtle semantic shifts and therefore benefits from the inclusion of a reduced set of contextual features. Our model therefore fits the data better than current state-of-the-art dynamic word embedding models and therefore is a promising tool to study diachronic semantic changes over short time periods. These ideas and results are published in 77.
In addition to detecting semantic word variations over time, in the same Maimonide project, we explored the question of detecting word usage change by proposing a simple (yet effective) method that does not rely on complex word embeddings reformulation and a rich feature set. A common approach to the task of detecting word usage change is to train word embeddings on each corpus, to align the vector spaces, and then look for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and they result in unstable and therefore less reliable results. We proposed an alternative approach that does not use vector space alignment and instead considers the neighbours of each word. The method is simple, interpretable and stable. We demonstrated its effectiveness in nine different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors as well as the time of the tweet) and different languages (English, French and Hebrew). This work was published at ACL 2020, the most prestigious NLP/CL venue 24.
Pushing the boundaries of noisy user-generated content (UGC) processing and, given the increasing importance of being able to tackle low-resource dialects available through social media, last year, we released the first dataset for a non-formalised North-African dialect, Arabizi, written in Latin script and which is often code-mixed with French. The resulting annotated dataset and its accompanying unlabelled data constitutes the first UGC dataset for an Arabic-related language and therefore constitutes an important milestone for the community 34. All parsing issues raised for morphologically rich languages, UGC and noisy content are present in this dataset, which makes it a perfect crash test for current deep learning approaches to NLP. We are currently exploring many transfer-learning techniqes involving neural contextualised language models to address the challenges caused by this kind of highly variable dialect.
Building NLP systems for such highly variable and low-resource languages is a difficult challenge. The recent success of large-scale multilingual pretrained neural language models (including our CamemBERT language model for French) provides us with new modelling tools to tackle it. We have studied the ability of the multilingual version of BERT to model the same unseen dialect cited above (Arabizi).We have shown in different scenarios that multilingual language models are able to transfer to such an unseen dialect, specifically in two extreme cases: across scripts (Arabic to Latin) and from Maltese, a related language written in the Arabic script, unseen during pretraining. The first results have already been published 112, 89 and have led to fruitful collaborations with Yanai Elazar from Bar Illan University and Antonios Anastasopoulos from George Mason University. These works, expanding from North-African Arabic to many other languages that share similar properties, focus on the objective of understanding the inner workings of large multilingual neural language models when fine-tuned on cross-language scenarios with various degres of resource availability (from large to extremely scarce). Conducted in 2020, those works led to the submission of two papers 50, 88. The second article was accepted to EACL 2021 and the first is currently in submission at NAACL-HLT 2021.
In the context of the ANR project PROFITEROLE, we have been studying the diachronic evolution of Old French, in particular at the lexical and syntactic levels, through the development of linguistic resources.
In 2020, we published a new version of OFrLex, resulting from an automatic approach to the enrichment of lexical entries 26. We also updated LEMA, an application to update lexical entries and to validate automatically enriched ones. To study the lexical diachronic evolution of French, we made available an ELMO-like neural language model trained over the PROFITEROLE corpus, and we carried out some preliminary experiments with it. Still in relation to diachronic language models, we focused on learning meta word embeddings for diachronic French lexemes, by combining Ofrlex and the PROFITEROLE corpus.
At the syntactic level, Mathilde Regnault pursued the development of MetaMOF, a meta-grammar for Old French derived from FRMG, our large coverage meta-grammar for modern French. Improvements addressed in particular clitic raising, sentence modifiers and extraposed modifiers, and of course a freer word ordering than found in modern French. Coupling with Ofrlex lexicon has also been strengthened, through an in-depth revision of closed lexical categories (such as determiners and pronouns) but also by the constitution of a temporary list of verbal entries enriched with their valency (to be incorporated properly within Ofrlex). Thanks to these evolutions, the latest experiments run in December showed a promising coverage of 83% on the PROFITEROLE corpus.
This work on MetaMOF, in particular on reusing FRMG, has triggered a new interest in meta-grammars, and, driven by Mathilde Regnault, led to the creation of Demeter, a new working group within GdR Lift. The Demeter initiative has vocation to promote the use of meta-grammars for computational linguists and field linguists, through the diffusion of resources and the development of new tools. We organised a first session during the last GdR days (December 10th and 11th 2020), during which Mathilde Regnault and Éric Villemonte de la Clergerie presented MetaMOF, FRMG, but also SMG, the underlying meta-grammar formalism.
Still at the syntactic level, Mathilde Regnault was involved in neural parsing experiments for Old French, with the submission of a paper to ACL-IJCNLP 2021.
Like many other fields, research in computational linguistics requires scholarly information and in particular research data to be shared widely within communities so that experiments and results can be compared and, when possible, reproduced. We have therefore explored the nature of the underlying scholarly tenets and the ways forward to improve the global landscape in the domain of open science.
With Jennifer Edmond from Trinity College and current director general of the EU infrastructure DARIAH, we contributed 41 in the debate on the general evolution of scholarly publishing in relation to the digital turn. Further to a survey carried out within Inria, we also addressed 64 the situation in the domain of research data and contributed to the definition of the corresponding policy within our institution.
We also analysed the importance of standards in the reusability of research data in language processing research 58, following the experience gained in our various involvements within ISO committee TC 37 (Language and terminology). These ideas have been experimented in a large-scale experiment involving the extraction of named entities in an open corpus of digital book publications in relation to the EU infrastructure OPERAS 14. We also had the opportunity to provide a global vision of the Text Encoding Guidelines as an open science endeavour during an invited talk at the academy of sciences in Vienna 59.
Finally we addressed the issues related to new mechanisms for assessing scholarly research beyond the traditional peer review processes as we know them. We explored in particular the various aspects of post-publication peer review 57 and its implementation in the form of the Episciences journal management platform 63.
For several years, the ALMAnaCH team has taken a leadership role in defining standards for representing lexical content, either as a result of digitising legacy dictionaries or through the creation of new lexical resources serving as a basis for computational linguistics processes. In this respect, some significant progress has been made in 2020. First, the second part of the ISO 24613 portfolio dedicated to machine readable dictionaries 80 was published and two other parts have been moved forward to the last step of the ISO process: part 3 on the modelling of etymological process and part 4 (published in January 2021), providing a reference subset of the TEI guidelines, encompassing the features standardised in parts 1 to 3.
The bridge between the general ISO standardisation work and the community-based TEI guidelines has been further developed in the context of the DARIAH working group on lexical resources, whose objective consists in identifying univocal TEI-based constructs for a variety of lexical features encountered in machine readable dictionaries. The quality of the work carried out by the group (see https://
Building up on our long-standing contribution to the development of the GROBID suite, the ALMAnaCH team has pursued its effort in extending the scope and performance of the GROBID modules in relation to various ongoing collaboration schemes.
In the continuity of Mohamed Khemakhem's PhD thesis 44, which has demonstrated the possibility to parse lexical entries from legacy dictionary content in a very fine-grained way, we investigated the applicability of the same architecture on catalogue-like textual objects as demonstrated in 38.
The occurrence of the COVID pandemic in Spring 2020 has also been an opportunity to start a project with the Parisian hospital network (APHP). One of the goals of the project (and the action that ended up being the main focus of our attention) was to see how the GROBID suite could be further expanded to parse the variety of medical reports and documents associated with a patient so that doctors can trace the precise relevant information (anamnesis, symptoms, treatments etc.). The excellent results obtained so far have led us to build up a stable collaboration with the APHP to make the GROBID workflow an essential building block of their document processing chain. Other actions initiated in Spring, such as the automatic detection of the date of the first symptoms in emergency service records, as well as the exploitation of parsing technologies to improve information extraction from medical reports, proved less immediately operational.
Following the provision of a direct grant from the Ministry of Higher Education and Research (MESRI - DAHN project) and benefiting from a collaborative framework with the French National Archives (Lectaurep project 62), we have explored how our experience in extracting information from legacy documents could be made part of wider understanding of the components of a generic digitisation workflow of documents initially available as images. In particular, in collaboration with colleagues from the ÉPHÉ (École Pratique des Hautes Études), we contributed to the further development of the eScriptorium platform, which aims to provide an annotation, training and recognition environment for hand-written recognition tasks. We have made contributions in three complementary directions:
Ongoing contracts:
Active collaborations without a contract:
Ongoing discussions that should/could be formalised in the form of a contract in 2020: