2024Activity reportProject-TeamALMANACH
RNSR: 201722248N- Research center Inria Paris Centre
- Team name: Automatic Language Modelling and Analysis & Computational Humanities
- Domain:Perception, Cognition and Interaction
- Theme:Language, Speech and Audio
Keywords
Computer Science and Digital Science
- A3.1.1. Modeling, representation
- A3.1.7. Open data
- A3.1.8. Big data (production, storage, transfer)
- A3.1.11. Structured data
- A3.2.2. Knowledge extraction, cleaning
- A3.2.5. Ontologies
- A3.3.2. Data mining
- A3.3.3. Big data analysis
- A3.4. Machine learning and statistics
- A3.4.1. Supervised learning
- A3.4.2. Unsupervised learning
- A3.4.3. Reinforcement learning
- A3.4.4. Optimization and learning
- A3.4.5. Bayesian methods
- A3.4.6. Neural networks
- A3.4.8. Deep learning
- A3.5. Social networks
- A3.5.2. Recommendation systems
- A5. Interaction, multimedia and robotics
- A5.1. Human-Computer Interaction
- A5.1.1. Engineering of interactive systems
- A5.1.2. Evaluation of interactive systems
- A5.1.7. Multimodal interfaces
- A5.1.8. 3D User Interfaces
- A5.1.9. User and perceptual studies
- A5.6. Virtual reality, augmented reality
- A5.6.1. Virtual reality
- A5.6.3. Avatar simulation and embodiment
- A5.7. Audio modeling and processing
- A5.7.3. Speech
- A5.7.4. Analysis
- A5.7.5. Synthesis
- A5.8. Natural language processing
- A9. Artificial intelligence
- A9.1. Knowledge
- A9.2. Machine learning
- A9.3. Signal analysis
- A9.4. Natural language processing
- A9.7. AI algorithmics
- A9.8. Reasoning
- A9.10. Hybrid approaches for AI
Other Research Topics and Application Domains
- B1. Life sciences
- B1.1. Biology
- B1.2. Neuroscience and cognitive science
- B1.2.2. Cognitive science
- B1.2.3. Computational neurosciences
- B2.2.6. Neurodegenerative diseases
- B2.5. Handicap and personal assistances
- B2.5.2. Cognitive disabilities
- B9.5.1. Computer science
- B9.5.6. Data science
- B9.6. Humanities
- B9.6.1. Psychology
- B9.6.2. Juridical science
- B9.6.5. Sociology
- B9.6.6. Archeology, History
- B9.6.8. Linguistics
- B9.6.9. Political sciences
- B9.6.10. Digital humanities
- B9.7. Knowledge dissemination
- B9.7.1. Open access
- B9.7.2. Open data
- B9.8. Reproducibility
- B9.9. Ethics
- B9.10. Privacy
1 Team members, visitors, external collaborators
Research Scientists
- Benoît Sagot [Team leader, INRIA, Senior Researcher, HDR]
- Rachel Bawden [INRIA, Researcher]
- Justine Cassell [INRIA, Senior Researcher]
- Chloé Clavel [INRIA, Senior Researcher, HDR]
- Thibault Clérice [INRIA, Researcher, from Sep 2024, CDI]
- Thibault Clérice [INRIA, Starting Research Position, until Aug 2024]
- Djamé Seddah [INRIA, Researcher, from Aug 2024, HDR]
- Djamé Seddah [INRIA, Associate Professor Detachement, until Jul 2024]
- Éric Villemonte de La Clergerie [INRIA, Researcher]
Faculty Member
- Nicolas Rollet [IMT, Associate Professor Delegation, from Nov 2024]
Post-Doctoral Fellows
- Aina Garí Soler [INRIA, Post-Doctoral Fellow, from Oct 2024]
- Aina Garí Soler [TELECOM PARIS, Post-Doctoral Fellow, until Sep 2024]
- Emer Gilmartin [INRIA, Post-Doctoral Fellow, until Sep 2024]
PhD Students
- Reem Al Najjar [INRIA, from Nov 2024]
- Wissam Antoun [INRIA]
- Alix Chagué [UNIV MONTREAL, from May 2024]
- Alix Chagué [INRIA, until Apr 2024]
- Pierre Chambon [FACEBOOK, CIFRE, from Feb 2024]
- Lucie Chenain [UNIV PARIS EST]
- Cyril Chhun [INSTITUT INPT, until May 2024]
- Floriane Chiffoleau [INRIA, until Oct 2024]
- Nicolas Dahan [INRIA]
- Rasul Jasir Dent [INRIA]
- Paul-Ambroise Duquenne [FACEBOOK, until May 2024]
- Nathan Godey [INRIA]
- Cecilia Graiff [INRIA, from Dec 2024]
- Yanzhu Guo [INRIA, from Oct 2024]
- Francis Kulumba [MINARM]
- Gabrielle Le Bellier [INRIA, from Nov 2024]
- Simon Meoni [ARKHN]
- Biswesh Mohapatra [INRIA]
- Ha Anh Ngo [INRIA]
- Tú Anh Nguyên [META, CIFRE, until Apr 2024]
- Lydia Nishimwe [INRIA]
- Célia Nouri [INRIA, from Nov 2024]
- Oriane Nédey [INRIA, from Oct 2024]
- Ziqian Peng [CNRS]
- Ziqian Peng [CNRS, from Sep 2024]
- Arij Riabi [INRIA]
- Hugo Scheithauer [INRIA]
- Rian Touchent [INRIA]
- Lorraine Vanel [TELECOM PARIS]
- Yi Yu [INRIA, from Dec 2024]
- Armel Zebaze Dongmo [INRIA]
- You Zuo [qatent (Questel), CIFRE]
Technical Staff
- Hassen Aguili [INRIA, Engineer, from Sep 2024]
- Lauriane Aufrant [Inria, Engineer, until Sep 2024, Détachement from MINARM]
- Sarah Benière [INRIA, Engineer]
- Cindy Evellyn De Araujo Silva [INRIA, Engineer, from Jun 2024]
- Sinem Demirkan [INRIA, Engineer]
- Sophie Etling [INRIA, from Sep 2024]
- Cecilia Graiff [INRIA, Engineer, until Nov 2024]
- Juliette Janes [INRIA, Engineer]
- Hasan Onur Keles [INRIA, Engineer, from Jul 2024]
- Marius Le Chapelier [INRIA, Engineer]
- Menel Mahamdi [INRIA, Engineer, until Apr 2024]
- Malik Marmonier [INRIA, Engineer, from May 2024]
- Virginie Mouilleron [INRIA, Engineer]
- Célia Nouri [ECOLE POLY PALAISEAU, from Apr 2024 until Oct 2024]
- Oriane Nédey [INRIA, Engineer, until Sep 2024]
- José Carlos Rosales Núñez [INRIA, Engineer, until Mar 2024]
- Panagiotis Tsolakis [INRIA, Engineer, from Oct 2024]
- Hao Wang [INRIA, until Jun 2024]
Interns and Apprentices
- Reem Al Najjar [INRIA, Intern, from Mar 2024 until Jul 2024]
- Gabrielle Alimi [INRIA, Intern, from Oct 2024 until Oct 2024]
- Diana Bakina [INRIA, Intern, from Mar 2024 until Aug 2024]
- Barokshana Baskaran [INRIA, Intern, from Oct 2024 until Oct 2024]
- Théo Charlot [INRIA, Intern, from Jun 2024 until Aug 2024]
- Mathilde Deletombe [INRIA, Intern, from May 2024 until Jul 2024]
- Anna Celina Desiderio [INRIA, Intern, from Jun 2024 until Aug 2024]
- Gabrielle Le Bellier [INRIA, Intern, from Apr 2024 until Sep 2024]
- Ilyas Lebleu [INRIA, Intern, from Apr 2024 until Jun 2024]
- Ilyas Lebleu [ENS PARIS, Intern, until Mar 2024]
- Mira Lee [INRIA, Intern, from Feb 2024 until Aug 2024]
- Javier Alejandro Lopetegui Gonzalez [INRIA, Intern, from May 2024 until Aug 2024]
- Irène Metz [INRIA, Intern, from Feb 2024 until Jul 2024]
- Zofia Milczarek [INRIA, Intern, from May 2024 until Aug 2024]
- Qinyue Xu [INRIA, Intern, until Jun 2024]
Administrative Assistant
- Christelle Rosello [INRIA]
Visiting Scientists
- Marco Bronzini [University of Trento, from May 2024 until Jun 2024]
- Marine Carpuat [University of Maryland, from Sep 2024, On sabbatical]
- Ilyas Lebleu [ENS Paris, from Mar 2024 until Apr 2024]
2 Overall objectives
The ALMAnaCH project-team (Automatic Language Modelling and Analysis & Computational Humanities) is a pluridisciplinary team whose research activities are centred around natural language processing (hereafter NLP) and digital humanities (hereafter DH), and also include computational linguistics and digital social sciences. Our expertise lies at the crossroads between computer science, machine learning and deep learning, linguistics, and the humanities. ALMAnaCH is the successor at Inria of the project-team ALPAGE (2007-2016), a joint team between Inria and the linguistic department of Université Paris VII (now part of Université Paris Cité), itself a follow-up, for Inria, of the project-team ATOLL.
The evolution of our research field and the arrival of new permanent members have required us to rethink the way we structure our research programme. While preparing ALMAnaCH's 2024 evaluation report, we defined five research axes that bring together our main scientific goals. All five axes are underpinned by the fundamental challenge posed by language variation, in particular through the data-efficiency, the robustness and the adaptability of our models, but also through the need to develop resources for non-standard and low-resource language varieties. Although contemporary, standard French and English will continue to play a central role in our research, we will therefore give a particular importance to non-contemporary and non-standard French as well as to low-resource languages, with a specific focus on languages of France other than French.
3 Research program
As a field, NLP dates back to at least the early 1950s and is one of the main sub-fields of Artificial Intelligence. Two decades after the first revolution of the NLP field, when machine learning took over from rule-based approaches in most scenarios, NLP underwent its second revolution in the mid and late 2010's, when the rise of continuous representations 144 enabled deep learning techniques to become dominant. This deep learning era has itself undergone several transformations, in particular with the arrival of masked language models (MLMs) such as BERT 131, which could be fine-tuned to specific tasks, resulting in an impressive performance boost compared to previous methods. Later on, the arrival of large-scale generative models (or large language models, hereafter LLMs) and years after, the rise of conversational models such as ChatGPT revolutionised the perception of what could be a natural interaction with a language agent.
For the first time, the general public can access a range of natural language tools (question answering, machine translation (MT), automatic summarisation, information extraction, even text understanding to some extent) with ease through the simplest unified interface possible, language. Although major questions regarding LLMs and conversational models quickly emerged, in particular concerning their performance for languages other than English—especially minority languages and dialects—and about the cultural biases they might convey, the fact remains that these conversational language models excel at many tasks that used to simply be research topics, at least for edited texts in English and other high-resource languages. This evolution has blurred the lines between research and engineering, without providing a solution to the challenges related to language diversity, language variation and low-resource scenarios in NLP research.
Since ALMAnaCH's creation, our research programme in NLP and DH has been underpinned by these important scientific challenges; as highlighted in our 2019 team creation proposal, our goal has always been to model and to cope with both language variation1 and language diversity. Beyond socio-linguistic factors such as age, gender, origin and education level, language variation arises due to multiple factors such as domain and genre (news wires, scientific literature, oral transcripts, etc.), space and time (geographical variation and evolution over time, cultural differences), which often result in low-resource scenarios. 30 years after the advent of data-driven approaches, which typically rely on supervised and semi-supervised methods requiring large amounts of annotated data, we still often rely on annotated and domain-specific training data to reach the best performance. This is especially the case in specialised domains and low-resource scenarios characterised by a high level of variation, despite the increasing performance levels reached by LLMs, whose training is mostly based on non-annotated data.
Low-resource scenarios also occur as a result of language diversity, for instance when working on multilingual tasks such as MT. Cross-lingual transfer techniques can help dealing with such scenarios, but they cannot fully solve the issue. Language variation and language diversity are not independent challenges and share many characteristics. Dialectal and diachronic diversity form a continuum with language diversity amongst closely-related languages.
With these challenges in mind, ALMAnaCH's central objectives are:
- to develop state-of-the-art NLP techniques, tools and resources to be used by academics and industry;
- to apply our domain of expertise to digital and computational humanities as well as to computational linguistics, both as application domains for NLP and as sources of new NLP-related scientific questions;
- to be dedicated to having an impact on the industry and more generally on society, via collaborations with companies and other institutions (startup creation, industrial contracts, expertise, dissemination, etc.).
3.1 Research strands
Research axis 1: Language modelling
Over the last few years, training a state-of-the-art language model (LM) has gone from a research problem accessible to well-resourced academic labs to a huge engineering effort that only a handful of companies worldwide can carry out. We therefore mostly train LMs in order to investigate scientific questions on how they work and how we can improve LMs (new architectures, new units, new loss functions). Having access to all components of these models (dataset, code, parameters) will make it possible to carry out novel experiments on language models. Together with novel approaches such as the study of the formal capabilities of language models, especially large language models (LLMs),2 and the use of tailored augmented languages, we plan to develop novel model architectures that will be more data- and compute-efficient, and that will have access to more contextual information, provided within the left context (“in-context learning”) and externally using tools and agents (see also axis 5).
Another challenge that we will continue to explore, because of its huge importance from a scientific as well as a societal point of view, is the evaluation of LMs. Despite a number of large-scale initiatives, designing accurate evaluation protocols and datasets for LLMs, avoiding issues such as data contamination, is still an open problem. With respect to our research axis 3, part of this work will be dedicated to the development of alignment datasets3 specifically targeting French and its socio-cultural context. Another important aspect is the evaluation of a model's trustworthiness, in particular taking into account the possibility that it was altered, or its training corpus manipulated, including with malevolent intentions (“weaponisation” of LMs). We will develop methods to ensure that our pre-training datasets have not been contaminated with spam, low-quality and adversarial content, possibly LLM-generated (hence the need for LLM-generated content detection models, a task we are already working on).
Research axis 2: Machine translation
Text transformation tasks have become one of our key research directions, our main focus being MT. We will focus on three main scenarios. The first one is MT for scientific documents, which requires the development of document-level models (e.g. to guarantee document consistency and coherence, including with document-specific elements such as headers, captions and tables). It will also require the development of an interpretable document-level evaluation metric adapted to scientific texts, which will evaluate the quality of term translation, abbreviation handling and co-references. We will also tackle the lack of document-level training data by collecting post-edited versions of machine-translated abstracts, but also by developing data augmentation strategies for longer documents. We plan to integrate our MT models in the HAL publishing interface to encourage the submission of abstracts in other languages.
The second scenario is robust MT, i.e. MT for non-standard text. We will focus on two types of language variation, dialectal variation (translation from or to a dialectal continuum or closely related language varieties) and sociolectal variation as found in user-generated content (UGC). We will develop models that are structurally robust to these variation types (and others), and models that can be controlled, for instance to translate a UGC input into a UGC output with similar characteristics or to translate a text into a specific dialectal variety within a dialectal continuum.
The third scenario we will focus on is low-resource MT. We will adapt models to unseen or less represented languages in our training data by exploiting linguistic analysis and alternative resources, such as lexicons and grammars. Additionally, we will explore the connections to robustness by identifying similarities and analogies between texts from non-standard or rare language varieties and those from well-resourced languages. This approach will facilitate adaptation and extend beyond bilingual lexicon induction to include morphology and grammar.
Transversally to these three scenarios, we will explore the linguistics of NLP models to determine how linguistic information can enhance performance. This involves identifying useful contextual examples and understanding how abstract linguistic properties from previous examples can be used to improve predictions on new, complex cases, such as through in-context learning and memory networks. Additionally, we aim to go towards the use of LLMs to generate educational materials for language learning, focusing on creating similar example sentences for languages with limited resources. Our research will also concentrate on generating high-quality and challenging evaluation data. This will involve developing automated methods to detect such data from existing sources, minimising bias towards certain examples, and improving the detection of context-dependent sentences using more advanced techniques than are currently available.
Research axis 3: Conversational agents
LLMs enable conversational agents to engage users on a range of topics for extended periods. However, human conversation includes unique features often missing in LLMs, such as shorter utterances for listener feedback, complex turn-taking, and multiple concurrent goals such as social bonding and language alignment. We aim to better understand and model human interactions, creating agents that react more naturally and enhance human performance in collaboration and interaction with machines.
Our first research focus is user-oriented, under the prism of the acceptability4 of current dialogue technologies, mostly conversational neural LMs. Users of a conversational model might not share the same cultural background, values and beliefs, whereas conversational models mostly encode a single cultural viewpoint, creating a cultural gap.5 Making conversational agents culturally coherent yet adaptable to a specific user, as well as more reliable, will involve working on model architectures (e.g. disentangling information causing biases, using controllable generation mechanisms, hybridising them with knowledge graphs) and integrating the humanities and social sciences at the core of our research.6 Another challenge related to the acceptability of conversational models concerns human-machine interaction itself. We aim to understand and model the lexico-semantic alignment phenomenon between humans and the way they use “repair” mechanisms in dialogue, and to enhance conversational models with such abilities.
Our second research focus is more task-oriented. We will observe conversation among people in a range of ethologically valid contexts, discover and study the essential characteristics of conversation (e.g. sentence length, complex turn-taking, use of hedges), model those characteristics, and look at whether particular characteristics predict success on particular collaborative tasks (e.g. human peer tutoring), that allow us to have concrete metrics for the impact of these conversational devices. We will then implement them in embodied conversational agents, and carry out experiments to determine whether the presence of these characteristics improve success rates. More recently we have begun collecting more neurobiological evidence of the nature of conversation between people, and the impact of those conversational devices on performance. This is carried out through hyperscanning experiments, where we use fNIRS to simultaenously scan the brain waves of two participants in a conversation and extract moments of inter-brain synchrony, often thought to be evidence for shared mental models, and the conversational devices they co-occur with.
Research axis 4: Corpus development, computational linguistics and computational humanities
To support the development of NLP models, language documentation and research in digital humanities, we will keep working on the development of large-scale freely available corpora. Although web-crawled data is sometimes necessary, we will shift our primary focus from web-based data (e.g. OSCAR) to higher-quality, legally safer data. Firstly, we will build and distribute a large diachronic corpus of French, relying on existing data (e.g. the HAL and Persée collections of scientific documents) and on data that we will digitise thanks to our improved OCR/HTR expertise (see below), as well as on an improved processing pipeline, especially improved language detection. We will focus on several low-resource languages, in particular languages of France other than French such as Occitan, Alsatian, Picard and French-based Creoles, in the context of the COLaF Inria DEFI. We also aim to support the production of open corpora for old and ancient languages to ensure the transition of these languages from closed repositories (such as Thesaurus Linguae Graecae or Library of Latin Texts A/B) to open repositories. For these two sets of low-resource languages we will work on developing NLP tools and resources including morphosyntactic annotators, LMs and MT models.
In digital humanities, our long-term goal remains the creation of a complete pipeline able to go from layout segmentation and OCR/HTR to publication of the structured digitised corpus in a TEI format, with a focus on historical documents. We aim to improve the state of the art in both layout interpretation and text recognition, based on the eScriptorium platform. For text recognition, we will focus on Latin scripts from the Middle Ages to the contemporary period, but we will also work with other scripts, in particular in the context of the Inria “Action Exploratoire” BackInTime, whose focus is on 17th century encrypted manuscripts using custom symbols.
Finally, we will resume our work on computational linguistics, focusing on computational morphology and modelling of language diachrony (including etymology). These research directions will likely require the development of new language resources (e.g. morphological and etymological lexica) or the improvement of existing ones, in relation with our work on corpus development and OCR (e.g. structured lexical information extraction from OCRised dictionaries).
Research axis 5: NLP for specialised domains
We will keep working on specific challenges related to social media text (user-generated content), medical documents, legal texts, administrative documents, patents, employee surveys,7 as well as more niche yet scientifically interesting or societally impactful domains (e.g. oenology). A recurring research direction in this regard is the adaptation of LMs to specialised domains. Such domains pose various challenges, including domain-specific language variation, in particular terminological and stylistic specificities (e.g. legal style in patents), and access to knowledge bases (e.g. medical ontologies). Continual pre-training proved to be successful with CamemBERT-Bio 159, 71, and we will investigate how we can improve the performance of such adaptations, such as using LLM-generated synthetic data and cross-lingual transfer.
When we work on information extraction and text generation systems, their computational efficiency can be a crucial challenge in certain contexts. An example is the medical domain, because of the sensitivity, criticality, and confidentiality of patient data required by small LMs (SLMs) that can be run in relatively constrained local environments to avoid data leaks. Given the complexity of medical data, one approach is to design augmented models that can interact with knowledge databases, combining external calls and extensions of chain-of-thought (CoT) processes to handle complex knowledge-based reasoning. The design of such augmented SLMs is still an active area of research and should involve transfer from LLMs (via distillation, teacher-student methods, etc.), adaptation to domain-specific knowledge bases and instruction tuning via reinforcement learning.
4 Application domains
4.1 Application domains for ALMAnaCH
ALMAnaCH's research areas cover Natural Language Processing (nowadays identified as a sub-domain of Artificial Intelligence), Digital Humanities and Computational Social Sciences. Application domains are therefore numerous, as witnessed by ALMAnaCH's multiple academic and industrial collaborations, for which see the relevant sections. Examples of application domains for NLP include:
- Information extraction, information retrieval, text mining (e.g. opinion surveys)
- Language modelling
- Text generation, text simplification, automatic summarisation
- Spelling correction (writing aid, post-OCR, normalisation of noisy/non-canonical texts)
- Machine translation
- Chatbots, conversational agents, question answering systems
- Medical applications (analysis of medical documents, early diagnosis, language-based medical monitoring, etc.)
- Applications in the legal domain
- Applications in linguistics (modelling languages and their evolution, sociolinguistic studies, etc.)
- Digital humanities (exploitation of text documents, for instance in historical research)
- Computational social sciences (e.g. computational analysis of political discourse, social media content analysis, radicalisation detection, etc.)
5 Social and environmental responsibility
5.1 Footprint of research activities
Project ID | Type | Power draw | GPUs | GPU hours | Real hours | Power consumption (kWh) | CO2 emissions (kg) |
AD011013900R1 | V100 | 1520 | 4 | 23097 | 5774 | 10531 | 284.3 |
AD011013900R1 | A100 | 3700 | 8 | 8791 | 1098 | 4875 | 131.6 |
GC011015610 | H100 | 2800 | 4 | 22316 | 5579 | 18745 | 506.1 |
AD011015491 | AMD MI250x | 2500 | 4 | 133 | 33 | 99 | 2.6 |
AD011015491 | AMD MI250x | 2500 | 4 | 19143 | 4785 | 14355 | 387.5 |
AD011012254R4 | A100 | 3700 | 8 | 13 | 1 | 4 | 0.1 |
AD011012254R4 | H100 | 2800 | 4 | 4471 | 1117 | 3753 | 101.3 |
AD011012254R3 | A100 | 3700 | 8 | 10888 | 1361 | 6042 | 163.1 |
AD011012254R3 | V100 | 1520 | 4 | 14362 | 3590 | 6548 | 176.7 |
AD011014393R1 | A100 | 3700 | 8 | 5852 | 731 | 3245 | 87.6 |
AD011013674R2 | V100 | 1520 | 4 | 13731 | 3432 | 6259 | 168.9 |
AD011013674R2 | H100 | 2800 | 4 | 8189 | 2047 | 6877 | 185.6 |
AD011013674R2 | A100 | 3700 | 8 | 10257 | 1282 | 5692 | 153.6 |
AD011014232R1 | V100 | 1520 | 4 | 47188 | 11797 | 21517 | 580.9 |
AD011014232R1 | A100 | 3700 | 8 | 21655 | 2706 | 12014 | 324.3 |
AD011014232R1 | H100 | 2800 | 4 | 16497 | 4124 | 13856 | 374.1 |
AD011013908R2 | V100 | 1520 | 4 | 2767 | 691 | 1260 | 34 |
AD011013908R2 | H100 | 2800 | 4 | 5208 | 1302 | 4374 | 118 |
Total | V100 | 1520 | 4 | 101145 | 25286 | 46122 | 1245.3 |
Total | AMD MI250x | 2500 | 4 | 19276 | 4819 | 14457 | 390.3 |
Total | A100 | 3700 | 8 | 57456 | 7182 | 31888 | 861.0 |
Total | H100 | 2800 | 4 | 56681 | 14170 | 47612 | 1285.5 |
Total | 140079 | 3782.1 |
Given recent interest into the energy consumption and carbon emission of machine learning models, and specifically of those of language models 155, 126, we have decided to report the power consumption and carbon footprint of all our experiments conducted on the Jean Zay8 and Adastra9 supercomputers in 2O24. For this report, we follow the approach of 156. While the ALMAnaCH team uses other computing clusters and infrastructures such as CLEPS10 and NEF,11 these infrastructures are not optimized for large jobs based on multi-node computation, and we therefore consider the power consumption and CO
Node infrastructure:
We have access to three types of GPU node on Jean Zay:12
- Nodes comprising 4 Nvidia Tesla V100 SXM2 32GB GPUs, 192GB of RAM, and two Intel Cascade Lake 6248 processors. One Nvidia Tesla V100 card is rated at around 300W,13 while the Intel Cascade Lake processor is rated at 150W.14 For the DRAM we can use the work of 129 to estimate the total power draw of 192GB of RAM at approximately 20W. The total power draw of one Jean Zay node at peak use therefore adds up to around 1520W.
- Nodes comprising 8 Nvidia A100 SXM4 80GB GPUs, 512GB of RAM, and two AMD Milan EPYC 7543 processors. One Nvidia A100 card is rated at around 400W.15 while the AMD Milan processor is rated at 225W.16 Following 129, we estimate the total power draw of 512GB of RAM at approximately 50W. The total power draw of one A100 node at peak use therefore adds up to around 3700W.
- Nodes comprising 4 Nvidia H100 SXM5 80GB GPUs, 512GB of RAM and two Intel Xeon Platinum 8468 processors. One Nvidia H100 card is rated around 700W 17, while the Intel Xeon Platinum 8468 processor is rated at 300W.18 The total power draw of one H100 node at peak use therefore adds up to around 3450W.
We have recently gained access to the Adastra supercomputer at CINES,19 which has nodes comprising 4 AMD MI250X's and 1 AMD EPYC Trento processor. Adastra uses the same hardware as the LUMI cluster,20 for which the total power draw of one node at peak use has been reported as being approximately 2500W.21
With this information, we use the formula proposed by 156 and compute the total power required for each setting:
Where
We can further estimate the CO
All emissions are also reported in Table 1. The total emission estimate for the team is 3,782.1kg of CO
6 Highlights of the year
Note : Readers are advised that the Institute does not endorse the text in the “Highlights of the year” section, which is the sole responsibility of the team leader.
At the end of 2024, Inria's top management enacted a new Contract of Objectives, Means and Performance (COMP), which defines Inria's objectives for the period 2024–2028. It presages major changes for Inria, regarding both its missions and the way it operates. These changes, whose precise nature and impact on the staff are unclear, should become effective as soon as 2025 but have not been the subject of any consultation. The collaboration of Inria's staff is necessary to turn this disruption into a successful change. Yet staff opposition to these policies, which has been expressed in several votes and petitions, has been largely ignored by Inria's top management, sometimes even laughed at, including in front of Inria's “Conseil d'Administration”' (Executive Board).
The multiplication of new missions and priorities, particularly those related to the “program agency” or oriented towards defence applications, pushes the research carried out at Inria in the background. The constraints induced by this COMP will restrain the independence of scientists and teams, as well as their freedom to select research topics and collaborators.
Here are some of our concerns with the COMP:
- Restriction of international and industrial collaborations to partners chosen by the institute's management, with no clear indication of how such partners are chosen and how this positively impacts the quality of the research carried out at Inria.
- Individual financial incentives for researchers involved in strategic partnerships, whose topics are steered by the program agency, in contradiction with the academic freedom Inria researchers are entitled to.
- Placement of Inria in a “zone à régime restrictif” (ZRR), disrupting everyday operations, recruitement processes and overall attractivity.
- Priority given to “dual” research with both military and civilian applications, materialised by tighter links with the Ministry of Defence.
6.1 Awards
- The ILLC best student paper award at the LREC-COLING 2024 conference was awarded to Niyati Bafna (now a PhD at John Hopkins University, formerly a research engineer in the team), Cristina España-Bonet (DFKI), Josef van Genabith (DFKI), Benoît Sagot and Rachel Bawden for their paper entitled “When Your Cousin Has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages”.
- Thibault Clérice won the Open Science PhD Award for his PhD Thesis on the detection of isotopies in Latin texts and the production of open corpora and open tools.
- As part of the 2024 annual awards from Historia magazine, the Back In Time project, led by Cécile Perrot (CARAMBA, Inria Nancy) and Thibault Clérice , received the Innovation Award and the Jury's Prize.
- Benoît Sagot was included by the Le Point magazine in their 2024 list of the top inventors in artificial intelligence.
- Benoît Sagot was included in the MAD50 2024 list of emerging leaders in AI by an independent committee comprising members from Maddyness and Banque Richelieu, as well as representatives from Raise, France Fintech, and France Digitale. The list includes "emerging, influential, and promising" individuals who were proposed, evaluated, and ultimately chosen as the rising stars of artificial intelligence.
7 New software, platforms, open data
7.1 New software
7.1.1 HTR-United
-
Keywords:
HTR, OCR
-
Functional Description:
HTR-United is a Github organization without any other form of legal personality. It aims at gathering HTR/OCR transcriptions of all periods and styles of writing, mostly but not exclusively in French. It was born from the mere necessity for projects- to possess potentiel ground truth to rapidly train models on smaller corpora.
Datasets shared or referenced with HTR-United must, at minimum, take the form of: (i) an ensemble of ALTO XML and/or PAGE XML files containing either only information on the segmentation, either the segmentation and the corresponding transcription, (ii) an ensemble of corresponding images. They can be shared in the form of a simple permalink to ressources hosted somewhere else, or can be the contact information necessary to request access to the images. It must be possible to recompose the link between the XML files and the image without any intermediary process, (iii) a documentation on the transcription practices followed for the segmentation and the transcription. In the cases of a Github repository, this documentation must be summarized in the README.
A corpus can be sub-diveded into smaller ensembles if it seems necessary.
-
Release Contributions:
First version.
- URL:
-
Contact:
Alix Chague
7.1.2 VGAMT
-
Name:
Visually Guided and Adapted Machine Translation system
-
Keyword:
Machine translation
-
Functional Description:
Machine translation model that lets the user adding an image as input in addition to the sentence to be translated.
- Publication:
-
Contact:
Matthieu Futeral-Peter
7.1.3 CamemBERTa
-
Name:
a DeBERTa v3-based French language model
-
Keywords:
Language model, French
-
Functional Description:
CamemBERTa was initially evaluated on five distinct downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER), FLUE (French Language Understanding Evaluation), including natural language inference (NLI). It improves the state of the art for most tasks compared to previous monolingual and multilingual approaches, which again confirms the effectiveness of large pretrained language models for French. CamemBERTa is particularly effective for high-level tasks.
- URL:
-
Contact:
Wissam Antoun
7.1.4 CamemBERT-bio
-
Keywords:
Language model, Deep learning, NLP, Transformer
-
Functional Description:
CamemBERT-bio is a state-of-the-art french biomedical language model built using continual-pretraining from camembert-base. It was trained on a french public biomedical corpus of 413M words containing scientific documments, drug leaflets and clinical cases extrated from theses and articles. It shows significant improvement on multiple biomedical named entity recognition tasks compared to camembert-base.
- URL:
-
Contact:
Rian Touchent
7.1.5 RoCS-MT
-
Name:
Robust Challenge Set for Machine Translation
-
Keywords:
Machine translation, NLP, Evaluation, Robustness, User-generated content
-
Functional Description:
RoCS-MT, a Robust Challenge Set for Machine Translation (MT), is designed to test MT systems’ ability to translate user-generated content (UGC) that displays non-standard characteristics, such as spelling errors, devowelling, acronymisation, etc. RoCS-MT is composed of English comments from Reddit, selected for their non-standard nature, which have been manually normalised and professionally translated into five languages: French, German, Czech, Ukrainian and Russian. The challenge set was included as a test suite at the WMT 2023 conference. This repository therefore also includes automatic translations from the submissions to the general MT task.
- URL:
- Publication:
-
Contact:
Rachel Bawden
7.1.6 3MT_French Dataset
-
Name:
3 Minutes Thesis Corpus
-
Keywords:
Multimodal Corpus, Video annotation
-
Functional Description:
This new resource will be useful to computer science and social science researchers working on public speaking assessment and training. It will help refine the analysis of speaking from a fresh perspective based on social-cognitive theories rarely studied in this context, such as first impressions and theories of primacy and recency.
- URL:
- Publication:
-
Contact:
Chloe Clavel
7.1.7 CATMuS Medieval (Model)
-
Name:
Consistent Approach to Transcribing ManuScripts - Medieval model
-
Keyword:
Handwritten Text Recognition
-
Functional Description:
CATMuS (Consistent Approach to Transcribing ManuScript) Medieval is a model for automatically transcribing medieval manuscripts using Latin scripts, in particular Old and Middle French, Latin, Spanish (and other languages of Spain), and Italian. The model was trained on the largest and most diverse dataset known for medieval manuscripts in Latin scripts, with more than 110 000 lines of training data.
-
Contact:
Thibault Clerice
-
Partners:
University of Toronto, Ecole nationale des chartes, CIHAM UMR 5648, VeDPH - Ca' Foscari, Université de Genève, ENS Lyon
7.1.8 HTRomance
-
Keyword:
Handwritten Text Recognition
-
Functional Description:
The ground truth produced as part of the HTRomance project aims to provide diverse data, from the 12th century to the 19th century, for training handwritten text recognition models. It covers the following languages: Latin, various states of French, Spanish, Occitan and Italian.
- URL:
-
Contact:
Thibault Clerice
-
Partners:
VeDPH - Ca' Foscari, Ecole nationale des chartes, CIHAM UMR 5648, ENS Lyon
7.1.9 OSCAR
-
Name:
Open Super-large Crawled ALMAnaCH coRpus
-
Keywords:
Raw corpus, Multilingual corpus
-
Functional Description:
OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
OSCAR is currently shuffled at line level and no metadata is provided. Thus it is mainly intended to be used in the training of unsupervised language models for natural language processing.
Data is distributed by language in both original and deduplicated form. There are currently 166 different languages available.
-
Release Contributions:
Version 21.09 was generated using Ungoliant version v1, a new generation tool, faster and better documented/tested than the previous one, goclassy, used for OSCAR 1.0 (aka OSCAR 2019). As per OSCAR Schema v1.1, each document/record now has associated metadata. New languages with respect to version 2019: Manx, Rusyn, Scots and West Flemish. Their size and quality still has to be assessed. Removed languages with respect to version 2019: Central Bikol and Cantonese. Cantonsese was of a very low quality. Central Bikol corpus is still available on OSCAR 2019.
- URL:
- Publications:
-
Contact:
Pedro Ortiz Suarez
-
Participants:
Pedro Ortiz Suarez, Benoit Sagot, Julien Abadji
7.1.10 Expresso
-
Name:
Expresso - A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
-
Functional Description:
Expresso est un corpus de parole expressive de haute qualité (48 kHz) qui comprend à la fois de la parole lue de manière expressive (8 styles, au format wav mono) et des dialogues improvisés (26 styles, au format wav stéréo). L'ensemble de données inclut 4 locuteurs (2 hommes, 2 femmes) et totalise 40 heures (11h de lecture, 30h d'improvisation). Les transcriptions de la parole lue sont également fournies. La tâche du Benchmark Expresso est de resynthétiser l'audio entrant en utilisant un code discret à faible débit qui a été obtenu sans supervision à partir du texte.
-
Contact:
Tu Nguyen
7.1.11 SONAR
-
Keywords:
Sentence embeddings, Natural language processing, Multimodality, Speech, Text, Machine translation, Zero-shot
-
Functional Description:
SONAR (for Sentence-level multimOdal and laNguage-Agnostic Representations) is a multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.
- URL:
- Publication:
-
Contact:
Paul-Ambroise Duquenne
7.1.12 eScriptorium Documentation
-
Name:
Open documentation for eScriptorium
-
Functional Description:
Collaborative and open documentation redacted using the functionalities offered by Github et deployed thanks to Readthedocs. It offers an illustrated description of all the features of the eScriptorium application, which does not offer any other form of complete documentation, as well as tutorials.
- URL:
-
Contact:
Alix Chague
7.1.13 Narabizi Treebank
-
Name:
A multi-layered treebank for the Arabic dialect spoken in North Africa and written in Latin Script
-
Keywords:
Treebank, Named entities, Machine translation, Evaluation, Low resources languages
-
Functional Description:
We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1300 sentences, after deduplication, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing. We believe that what we present in this paper is useful beyond the low-resource language community. This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches. The current version of the dataset adds two novel annotation layers (named entity recognition and offensive language detection) and a re-annotation of the tokenization, morpho-syntactic and syntactic layers that ensure annotation consistency.
- URL:
-
Contact:
Djame Seddah
7.1.14 CATMuS Medieval (Dataset)
-
Name:
Consistent Approaches to Transcribing ManuScripts - Medieval Dataset
-
Keywords:
Handwritten Text Recognition, HTR, OCR
-
Functional Description:
Developed through collaboration among various institutions and projects, CATMuS provides an inter-compatible handwritten text recognition dataset spanning more than 240 manuscripts and incunabula in 10 different languages, comprising over 170,000 lines of text and 5 million characters spanning from the 8th century to the 16th.
-
Release Contributions:
- 40 new mansucripts
- Publication:
-
Contact:
Thibault Clerice
-
Partners:
CIHAM UMR 5648, Ecole nationale des chartes, University of Toronto, Antwerp University
7.1.15 Counter dataset
-
Name:
Counter Radicalization Dataset
-
Keywords:
NLP, Multilingual corpus, Data protection, Pseudonymization, Online radicalization
-
Functional Description:
This dataset includes multilingual content from forums, Telegram, social media, and other sources in English, French, and Arabic. It covers various radical ideologies and is pseudonymized to protect privacy while maintaining data utility. It includes annotations for call to action, radicalization level, and named entity recognition.
- URL:
-
Contact:
Djame Seddah
7.1.16 CUBANSPVARIETY
-
Keywords:
Variety identification, NLP
-
Functional Description:
A Cuban Spanish variety identification dataset consisting of 1,762 manually annotated tweets by three native speakers, with labels assigned based on agreement and covering Cuban and Cuban varieties and common examples.
-
Contact:
Djame Seddah
7.1.17 CamemBERTav2
-
Keywords:
Language model, French
-
Functional Description:
It is the second version of the CamemBERTa model, which is based on the DebertaV2 architecture with the Replaced Token Detection (RTD) objective.
The new update includes: 1) Much larger pretraining dataset: 275B unique tokens (previously 2B). 2) A newly built tokenizer based on WordPiece with 32,768 tokens, addition of the newline and tab characters, support emojis, and better handling of numbers (numbers are split into two digits tokens). 3) Extended context window of 1024 tokens
-
Contact:
Benoit Sagot
7.1.18 CamemBERTv2
-
Keywords:
Language model, French
-
Functional Description:
CamemBERTv2 is a French language model pretrained on a large up-to-date corpus of 275B tokens of French text. The model still based on the BERT architecture, demonstrates improved performance across a variety of NLP tasks, over the previous version
- URL:
-
Contact:
Benoit Sagot
7.1.19 HaSCoSVa
-
Name:
Hate Speech Corpus for Spanish
-
Keywords:
Hate Speech Detection, NLP
-
Functional Description:
A new corpus of tweets related to hate speech towards immigrants written in Spanish. This corpus contains information regarding the language variant. The dataset is subdivided into two subsets according to the language variant: (1) Latin American and (2) European.
- URL:
-
Contact:
Djame Seddah
7.1.20 LADaS
-
Name:
Layout Analysis Dataset with Segmonto
-
Keyword:
Layout Analysis
-
Functional Description:
LADaS is a dataset for training layout analysis on documents from the 16th to the 21st century. It helps reproduce documents in XML TEI.
- Publication:
-
Contact:
Thibault Clerice
7.1.21 CamemBERT-bio-gliner
-
Keywords:
Language model, Deep learning, NLP, Transformer
-
Functional Description:
CamemBERT-bio-gliner is a Named Entity Recognition (NER) model capable of identifying any french biomedical entity type using a BERT-like encoder. It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios. CamemBERT-bio is used as a backbone.
-
Contact:
Rian Touchent
7.1.22 CoMMuTE
-
Name:
Contrastive multilingual and multimodal translation evaluation
-
Keywords:
Machine translation, Evaluation, Image analysis
-
Functional Description:
CoMMuTE is a contrastive evaluation dataset designed to assess the ability of multimodal machine translation models to exploit images in order to disambiguate the sentence to be translated. In other words, given a sentence containing a word that can be translated in several ways, the additional image determines the meaning of the word to be translated. The model must then take the image into account to propose a correct translation. CoMMuTE is available from English into French, German and Czech.
- URL:
-
Contact:
Matthieu Futeral-Peter
7.1.23 mOSCAR
-
Keywords:
Raw corpus, Multilingual corpus, Multimodal Corpus, Multimodality, Text-image processing, Web crawling
-
Functional Description:
mOSCAR is a the first large-scale multilingual and multimodal web-crawled document corpus. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality.
- URL:
- Publication:
-
Contact:
Matthieu Futeral-Peter
-
Participants:
Matthieu Futeral-Peter, Benoit Sagot, Rachel Bawden, Julien Abadji, Armel Zebaze Dongmo, Cordelia Schmid, Remi Lacroix
7.2 Open data
ALMANaCH's general policy is to release all data and software under open-source licences when possible. This currently includes the following list. Please see the 'New Software' sections of this report and previous activity reports or the BIL for more detailed information.
- FRMG (LGPL-3.0): A large-coverage meta-grammar for French
- SYNTAX (CeCILL-C): Lexical and syntactic parser generator
- vera: Automatic analysis of answers to open-ended questions in employee surveys
- SxPipe (CeCILL-C): Shallow language pipeline
- DyALog (GPL-3.0): Environment for building tabular parsers and programs
- Mgwiki: Linguistic Wiki for FRMG
- Alexina: Morphological (and sometimes syntactic) lexicons (including the Lefff)
- Sequoia corpus (LGPL-LR): French corpus with surface and deep syntactic annotations
- dyalog-sr (GPL-3.0): Transition-based parser built on top of DyALog
- FQB (CC-BY-NC-SA): Multi-layered treebank made of questions for French
- FSMB (CC-BY-NC-SA 4.0): French social media bank
- GROBID-Dictionaries: GROBID module for structuring digitised lexical resources and entry-based documents
- entity-fishing: Entity recognition and disambiguation
- MElt (CeCILL-C): Statistical part-of-speech tagger
- WOLF (CeCILL-C): Free Wordnet for French
- GROBID (Apache-2.0): Library for extracting, parsing and re-structuring raw documents
- VerDI project release: Omission detection tool for journalistic content.
- OSCAR (CC-BY): Huge multilingual web-based corpus
- EtymDB (CC BY-SA 4.0): Etymological database extracted from wiktionary
- goclassy (Apache-2.0): Asynchronous concurrent pipeline for classifying Common Crawl
- CamemBERT (MIT): Neural BERT-like language model for French
- DiaBLa (CC BY-SA 4.0): Parallel dataset of English-French bilingual dialogues
- UDLexicons: Multilingual collection of morphological lexicons
- ELMoLex: Neural parsing system developed for ALMAnaCH's submission to the CoNLL-18 multilingual parsing shared task
- FrELMo: ELMo language model for French
- MRELMo: ELMo language models for 5 mid-resource languages (Bulgarian, Catalan, Danish, Finnish, Indonesian)
- DiscEvalMT (CC-BY-SA-4.0): Contrastive test sets for the evaluation of discourse phenomena in English-to-French machine translation
- ACCESS (CC-BY-NC): Controllable Text Simplification Model
- ASSET (CC-BY-NC): Text Simplification Evaluation Dataset
- tseval (CC-BY-NC): Text Simplification Evaluation Library
- EASSE (GPL-3.0): Text Simplification Evaluation Library
- PAGnol (MIT): Neural GPT-based language model for French
- PFSMB (CC-BY-NC-SA-4.0): FR-EN parallel corpus of noisy user-generated content
- KaMI-Lib (MIT): KaMI-lib is an HTR and OCR engine agnostic Python package for evaluating transcription models
- Ungoliant (Apache-2.0): High-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl.
- HTR-United (CC-BY): HTR-United is an open Github ecosystem designed to share training data for HTR and OCR tasks
- PMUMT (CC-BY-NC-SA): FR-EN Annotated parallel corpus of noisy user-generated content
- WikiCremma (CC-BY): Dataset for HTR training on Contemporary French
- grobid-medical-report: GROBID module for extracting and restructuring medical reports from PDF documents into encoded XML/TEI documents
- DESIR-CodeSprint-TrackA-TextMining: A tool for extracting scholarly documents and visualizing the results on PDF files using GROBID.
- ModFr-norm (CC-BY-SA-4.0): Normalisation of Modern (17th c.) French
- nerdKid: NerdKid is a tool for grouping Wikidata entities into 27 classes (e.g., ANIMAL, LOCATION, MEDIA, PERSON).
- FreEM-corpora: Corpora and NLP tools for Early Modern French (16th-18th c.)
- CCASS-sim: Similarity detection tool for legal texts from the Cour de Cassation
- D'AlemBERT (Apache 2): Neural BERT-like language model for Early Modern French
- D'AlemBERT NER (Apache 2): NER model for Early Modern French
- D'AlemBERT POS (Apache 2): POS tagger for Early Modern French
- CamemBERTa (MIT): A DeBERTa v3-based French language model
- CamemBERT-bio (MIT): Neural BERT-like language model for the French biomedical domain
- MANTa-LM (MIT): A differentiable tokenizer trained end-to-end with the language model.
- VGAMT (Apache-2.0): A multimodal machine translation model
- eScriptorium Documentation (CC-BY): Open and collaborative documentation for eScriptorium
- CoMMuTE (CC-BY-SA-4.0): A contrastive evaluation dataset for multimodal (text-image) machine translation.
- RoCS-MT (CC-BY-NC): Robust Challenge Set for Machine Translation
- feats2notes (LGPL): Generation of notes from structured data
- CharacterBERT-UGC (CC-BY-SA): A CharacterBERT language model for North-African Arabizi and French user-generated content
- 3MT French Dataset: 3 Minutes Thesis Corpus
- HTRomance: Ground-truth for training HTR models
- SONAR: SONAR (Sentence-level multimOdal and laNguage-Agnostic Representations) is a multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders
- SpeechMatrix (CC-BY-SA): Speech parallel corpus mined from VoxPopuli
- T-modules: Approach to cross-modal transfer between speech and text for translation tasks
- Expresso (CC-BY-SA): A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
- Narabizi Treebank (CC-BY-SA): A multi-layered treebank for the Arabic dialect spoken in North Africa and written in Latin Script
- CATMuS Medieval (Dataset) (CC-BY): Large-scale diverse dataset for handwritten text recognition of medieval manuscripts
- mOSCAR (CC-BY): Large-scale multilingual, multimodal (text-image) web-crawled corpus
- Bloom (BigScience RAIL License v1.0): Open large multilingual language model
- CATMuS Médieval (Modèle) (fr) / CATMuS Medieval (Model) (en) (CC-BY): Handwritten Text Recognition model for medieval manuscripts in Latin scripts
- Counter dataset
- SSK (CC-BY): Collection of research use case scenarios illustrating best practices in Digital Humanities and Heritage research
- OFrLex-modifier (AGPL-3.0): Online user interface to collaboratively modify and check the OFrLex lexicon
8 New results
8.1 Large Corpus Creation: From OSCAR to COLaF
Participants: Benoît Sagot, Thibault Clérice, Rachel Bawden, Juliette Janès, Oriane Nédey, Rasul Jasir Dent, Matthieu Futeral-Peter, Malik Marmonier.
8.1.1 OSCAR and mOSCAR: Collecting Corpora from the Web
Since the introduction of large language models (LLMs) in Natural Language Processing (NLP), large raw corpora have played a crucial role. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. There are some examples of freely available multilingual corpora for the training of deep learning NLP models, such as the Paracrawl corpus 125 and our own large-scale multilingual corpus OSCAR 150, 121, 120.24 However, they have quality issues, especially for low-resource languages, an issue investigated in a large-scale study we were involved in and whose initial publication in 2021 139 was followed by a publication in the Transactions of the Association for Computational Linguistics140. The latest OSCAR version is OSCAR 23.01.
In 2024, we worked on the development of a new, multimodal version of OSCAR, mOSCAR (for multimodal OSCAR). This works was carried out as part of Matthieu Futeral-Peter 's PhD thesis, a joint effort between the ALMAnaCH and WILLOW teams (Cordelia Schmid, from WILLOW, being one of Matthieu's supervisors). It is also the result of a collaboration with engineers working at GENCI on the Jean Zay supercomputer. The goal of the project, initiated in 2023, was to address the lack of publicly available interleaved text-image data for languages beyond English. We automatically collected a large number of images from the internet associated with snippets of text (some captions but also other types of associated texts related to the image from accompanying documents). Unlike existing datasets, which are either English-only, composed of caption-like data, or fully private, mOSCAR covers 163 languages and consists of 303 million documents, 200 billion tokens, and 1.15 billion images. We conducted extensive filtering and evaluation to ensure that the dataset is diverse, safe, and of high quality 105. The dataset is already publicly accessible under the Creative Commons CC BY 4.0 license, promoting further research in multilingual multimodal learning.25
8.1.2 COLaF: Collecting high-quality text corpora for French and other languages of France
We finalised the creation of COLaF (“Corpus et Outils pour les Langues de France” — corpora and tools for the languages of France), an Inria “DEFI” (Inria-internal multi-team project) jointly led by Benoît Sagot and Slim Ouni. COLaF is carried out in close interaction with the (informal) OSCAR project and in collaboration with our colleagues from the MULTISPEECH team in the Nancy Inria Centre (Emmanuel Vincent, Slim Ouni) and the former ALMAnaCH member Laurent Romary, as well as with the Inria headquarters' support (in particular Jean-Frédéric Gerbeau). Its goal is to contribute to the development of free corpora and tools for French and French-based Creole, as spoken in France and abroad, as well as of other languages of France, in close collaboration with academic and institutional partners.
In 2024 we worked on multiple research directions in the context of COLaF:
-
We strengthened and initiated interactions with a number of relevant external partners, including:
- language-specific institutions, including Académie Régionale de la Langue Picarde (Picard), Lo Congrès (Occitan), and, since 2024, The Historic New Orleans Collection (Louisiana French, Louisiana Creole) and the Institut du Monde Réunionnais (Réunion Créole).
- research laboratories such as LILPa (Université de Strasbourg and CNRS, Alsacian and other languages), LISA (Université de Corse and CNRS, Corsican), MoDyCo (Université Paris-Ouest and CNRS, Breton and other languages), and, since 2024, Tulane University (Louisiana).
In May we organised a kick-off meeting attended by over 50 people representing most of these partners and others.
- We finalised our work on text corpus representation based on the TEI guidelines, and published a first version of our COLaF TEI metadata scheme and the associated documentation (Web site, GitHub repository).
- We resumed our work on language identification, at the intersection between the OSCAR and COLaF projects, in the context of Rasul Dent's newly started PhD (funded by COLaF) and with a focus on French-based Creole languages. Our first result, the open corpus Molyé 47, combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies. New results in this direction will be published in 2025. This work is being carried out in close collaboration with Pedro Ortiz, a former ALMAnaCH PhD student 147 now working for Common Crawl.
- We have been working on collecting Occitan corpora and improving the state of the art in Occitan dialect identification and Occitan part of speech tagging, as part of Oriane Nédey 's engineer contract and subsequent PhD thesis.
- In the context of the ANR TraLaLaM project, and more precisely as a side subproject of Malik Marmonier 's engineering work, we carried out preliminary work on the identification of resources for Franco-Provençal (also known as Arpitan, not to be confused with Provençal, a variety of Occitan).
- We developed a first version of our database of language resources for French, other languages of France and French-based Creoles, as well as a bibliography on COLaF-related topics.
8.2 Language Modelling
Participants: Benoît Sagot, Djamé Seddah, Éric de La Clergerie, Rachel Bawden, Nathan Godey, Aina Garí Soler, Wissam Antoun, Armel Zebaze, Rian Touchent.
8.2.1 Training language models
Pretrained language models are now ubiquitous in NLP. Despite their success, many early models were either trained on English data or on the concatenation of data in multiple languages 130, 141. One of the most visible achievements of the ALMAnaCH team was the training and release of CamemBERT in 2019, a BERT-like 130 (and more specifically a RoBERTa-like) neural language model for French trained on the French section of our large-scale web-based OSCAR corpus 150 (see Section 8.1.1) and still downloaded several million times a month. In previous years we also developed CamemBERT variants 143, ELMo models trained on OSCAR corpora for other languages, including French 148, 149, as well as (more recently) CamemBERTa 123, an alternative of CamemBERT based on the DeBERTaV3 architecture 138 that further improved the state of the art for French NLP.
However, models published several years ago, like CamemBERT, face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In 2024 we introduce two new versions of the CamemBERT and CamemBERTa base models, named CamemBERTav2 and CamemBERTv2, designed to address these challenges 95. CamemBERTav2, like CamemBERTa, is based on the DeBERTaV3 architecture. It makes use of the Replaced Token Detection (RTD) objective for better contextual understanding. CamemBERTv2 is built on RoBERTa, like CamemBERT, which uses the Masked Language Modeling (MLM) objective. We trained both models on a significantly larger and more recent dataset with longer context length and an updated tokeniser that enhances tokenisation performance for French. We demonstrated the versatility of these models and their effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.
8.2.2 Understanding language models
Another crucial research direction is gaining a better understanding of how language models actually work and the properties of the vector representations they produce. This is the focus of Nathan Godey 's PhD work as well as one of the research topics of Aina Garí Soler . In this context, we worked on two different directions.
- Language models have long been shown to embed geographical information in their hidden representations. This line of work has recently been revisited in the context of LLMs. In a study published this year 53, we proposed to fill the gap between well-established and recent literature by observing how geographical knowledge evolves when scaling language models. We showed that geographical knowledge is observable even for tiny models and that it scales consistently as we increase the model size. Notably, we observed that larger language models cannot mitigate the geographical bias that is inherent to the training data.
- It has been observed that smaller language models can suffer from saturation, characterised as a drop in performance at some advanced point in training followed by a plateau. In NLP, it takes the form of anisotropy, a singular property of hidden representations that makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimising the cross-entropy loss on long-tailed distributions of tokens. Following our 2023 work on model anisotropy 137, we showed that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences 52. We also showed that the anisotropy problem extends to Transformers trained on other modalities. Our observations tend to demonstrate that anisotropy might actually be inherent to Transformers-based models. Investigating further, we found that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution 106. This mismatch affects the performance of the linear prediction head used in such models through the well-known softmax bottleneck phenomenon. We measured the effect of the softmax bottleneck in various settings and find that models based on fewer than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining, which leads to reduced evaluation performance.
- As part of Aina Garí Soler 's postdoctoral research, under the supervision of Chloé Clavel and Matthieu Labeau (Telecom-Paris), we investigated how subword-based tokenisation impacts word representations. When deriving contextualised word representations from language models, a decision needs to be made on how to obtain representations for out-of-vocabulary (OOV) words that are segmented into subwords. What is the best way to represent these words with a single vector, and are these representations of worse quality than those of in-vocabulary words? We carried out an intrinsic evaluation of embeddings from different models on semantic similarity tasks involving OOV words. We compared the contextualised representations of words that are segmented into subwords with those of words that have a dedicated embedding in BERT and other models through an intrinsic evaluation relying on similarity estimation 30. Our findings are relevant for any NLP practitioner working with contextualised word representations, and particularly for applications relying on word similarity. We showed that, out of the tested strategies for split-word representation, averaging subword embeddings is the best one, with few exceptions, the quality of split-word representations is often worse than that of full-words, although this depends on the kind of words considered, similarity values obtained for split-word pairs are generally higher than similarity estimations involving full-words, the best layers to use differ across split-types, a higher number of tokens does not necessarily, as intuitively thought, decrease representation quality and, in the within-word setting, word form has a negative impact on results when words are split. Our results also point to specific aspects to which future research and efforts of improvement should be directed. We make our split-sim dataset available to facilitate research on split-word representation.
Understanding how LLMs can make use of potentially complex and/or structured prompts is another research direction aimed at finding ways to improve their performance. For complex reasoning tasks that require step-by-step thinking, Chain-of-Thought (CoT) prompting has given impressive results, especially when combined with an approach called self-consistency. Nonetheless, some tasks remain particularly difficult for LLMs to solve. Tree of Thoughts (ToT) 160 and Graph of Thoughts (GoT) 127 emerged as alternatives, dividing complex problems into paths of subproblems. In 2024, we proposed a new type of in-context learning, which we called Tree of Problems (ToP), a simpler version of ToT, which improves performance on complex tasks that can be divided into identical subtasks 72. Our empirical results show that our approach outperforms ToT and GoT, and in addition performs better than CoT on complex reasoning tasks.
8.3 Machine Translation (MT)
Participants: Rachel Bawden, Benoît Sagot, Matthieu Futeral-Peter, Lydia Nishimwe, Nicolas Dahan, Ziqian Peng, Armel Zebaze Dongmo, Malik Marmonier.
8.3.1 Shared Tasks, Benchmarking and Evaluation
As in previous years, we continued our participation in the organisation of the shared tasks at the main conference in MT (WMT). Rachel Bawden was again member of the organising committees of the general MT task 59, 107 and the biomedical MT task 64, whose purpose is to evaluate models on multiple domains and on the biomedical domain respectively. With the continuing use of LLMs for MT, both tasks studied some of the new challenges linked to the use of LLMs for translation.
In terms of benchmarking, there were a few other contributions other than the shared tasks. Rachel Bawden participated in an initiative led by Jesujoba Alabi, a previous engineer in the team and now a PhD student at Saarland University, on the creation of a docment-level corpus for African languages, AfriDoc, which was then used to evaluate LLM performance at the document level 94. She also continued her contribution to the benchmarking of the BLOOM language model through an informal initiative “Bloume” for the evaluation of the LLM specifically for French. Several tasks, including MMT were included in the evaluation 41.
8.3.2 Low-resource MT
Improving low-resource MT by understanding how LLMs can take advantage of analogies is the topic of Armel Zebaze's PhD, supervised by Rachel Bawden and Benoît Sagot. He has been studying the use of different language models for few-shot example selection, particularly to help the translation of low-resource language pairs. The ability of generative LLMs to perform in-context learning has given rise to a large body of research into how best to prompt models for various NLP tasks. For low-resource MT, no systematic studies had been published on how best to select examples, and mixed results had been reported on the usefulness of similarity-based selection over random selection. We published a study covering multiple LLMs and multiple in-context example retrieval strategies, comparing multilingual sentence embeddings 114. We covered several language directions, representing different levels of language resourcedness (English into French, German, Swahili and Wolof). Contrarily to previously published results, we found that sentence embedding similarity can improve MT, especially for low-resource language directions, and studied the balance between selection pool diversity and quality. We also highlighted potential problems with the evaluation of LLM-based MT (dealing with copying of the source text, empty translations and translating in the wrong language) and suggested a more appropriate evaluation protocol, adapting the COMET metric to the evaluation of LLMs.
Capitalising on our work on Tree of Problems (cf. Section 8.2) we developed a new LLM-based translation paradigm, named compositional translation, which relies on the notion of analogy in a different way 113. Compositional translation uses an LLM to decompose a sentence into simpler phrases (or segments), and then to translate each phrase with the help of retrieved demonstrations. Finally, the LLM is prompted to translate the initial sentence with the help of the self-generated phrase-translation pairs, thereby relying on another way to characterise analogy in translation. Our intuition was that this approach should improve translation because these shorter, simpler phrases should be intrinsically easier to translate and easier to match with relevant examples. This proves especially beneficial in low-resource scenarios, and more generally whenever the selection pool is small or out of domain. We showed that compositional translation boosts LLM translation performance on a wide range of popular MT benchmarks, including FLORES 200, NTREX 128 and TICO-19.
Another approach to LLM-based low-resource MT is to explore whether LLMs can benefit from an explicit description of the low-resource language (such as a grammar, a monolingual lexicon, or a bilingual lexicon) in order to facilitate translation to or from that language. In the context of the TraLaLaM project, we investigated the capacity of LLMs to engage in explicit learning of a new, unseen language, i.e. assimilating metalinguistic explanations, to perform language tasks 109. Using constructed languages derived from French using cryptographic methods as controlled test environments, we designed experiments to evaluate how well an LLM can learn and apply grammatical rules. Our findings indicate that while LLMs demonstrate a measurable ability for explicit learning, contrarily to what previous authors published 124, this capability diminishes as the complexity of the linguistic phenomena increases. Supervised fine-tuning on chains of thought significantly enhances LLM performance but struggles to generalise to typologically novel or more complex linguistic features.
8.3.3 MT applied to specific domains
One of the major challenges for the development of high quality MT models is adapting them to specific domains (including scientific domains), or even ideally to multiple domains at once. We have made several contributions to MT for specific domains this year.
In the context of the DadaNMT project led by Rachel Bawden , which ended at the end of 2023, in we finalised and published several works in 2024 investigating how to improve MT for domain adaptation and cross-domain transfer. Firstly, we studied how incorporating bilingual lexicons could help MT of specific domains by using a technique that is commonly used in domains with a controlled language (i.e. with no ambiguity): inline injection of candidate translations within the source sentence 132. We sought to see whether this technique could be successfully used in the general MT setting (where there is ambiguity and the lexicon is not specific to the specific language of the domain) to help better translate specific domains. Working with a strong encoder-decoder model (mT5 finet-uned on parallel data) and using the test case of German-to-English, we show that the technique does not result in the same improvements as previously shown. However, in our analysis using distractor annotations, we show that the model can learn to select appropriate translation candidates and ignore irrelevant ones, thereby exhibiting more than a systematic copying behaviour, particularly in a low- or mid-resource setting 35. In a follow-up, we moved to LLMs and how to specialise them to specific domains through in-context example selection and topic models. We tested the relevance of topic models by using them to select informative examples even for out-of-domain inputs, experimenting on 7 diverse domains and 11 language pairs of differing resourcedness and comparing performance to the use of domain labels and keywords. Our results showed that few-shot examples and related keywords consistently improved translation quality, that example diversity must be balanced with source similarity, and that our pipeline was overly restrictive for example selection when a targeted development set was available. However, this work provided a starting point for later work carried out on analogy for low-resource MT int he context of Armel Zebaze Dongmo 's PhD (see Section 8.3.2).
A large part of our work in domain-specific MT has been on the translation of scientific documents in the context of the MaTOS (Machine Translation for Open Science) ANR project, led by François Yvon (CNRS). There is a specific focus on English–French and French–English translation, since the high quality of MT for these directions allows us to work on more high-level challenges such as term translations, consistency and document coherence. Two PhD students are currently co-supervised by Rachel Bawden jointly with François Yvon in the projet: Ziqian Peng (recruited by the CNRS) on document-level MT for scientific documents and Nicolas Dahan on the evaluation of MT. As part of the project, we published two surveys on each of these topics 110, 102 and trained MT models for the translation of documents in the NLP and Earth and Planetary Science domains 112. One of the major challenges for the translation of scientific documents is the translation of documents in one go (rather than sentence by sentence) to ensure consistency in the translation of terms and to maintain document coherence. MT models tend to struggle to translate longer segments, a challenge we studied in the context of Ziqian Peng 's PhD, looking at how we can evaluate the effect of document length on translation performance and what effect it has. Through carefully controlled experiments, we showed that translation performance decreases with the length of the input text and that translation quality depends on the position within the document, i.e. sentences at the beginning are better translated than those at the end 75, 111.Having identified this challenge, we studied an approach to improve the translation of longer segments and of later positions within source texts by manipulating the distribution of positions within documents during training so that all positions up to a maximum length are seen approximately the same number of times (i.e. exposing the model to later positions as much as earlier ones). Our results showed that the approach only marginally mitigates such problems and that document-level MT does not yet match the performance of sentence-based MT 111. In the context of Nicolas Dahan 's PhD, we continued to look at ways of evaluating document-level consistency and coherence when translating. In research inspired by the use of contrastive test examples, consisting of correct and incorrect translations, we automatically generate contrastive examples designed to stress test different metrics on potential errors of document MT systems (e.g. truncating translations, shufflinf sentences), document-level effects to which metrics should be robust (splitting sentences, use of synonyms) and other challenges linked to consistency and coherences. Testing on English–French, we evaluated several MT metrics, using multiple segmentation strategies and span lengths. Our experiments showed that some metrics highly sensitive to local changes under-estimate long-range disruptions, and vice versa and that certain perturbations (e.g., related to coherence and repetition) reveal shortcomings in metrics that would appear robust under sentence-level evaluations. Finally, we also published our study on postediting automatic translations, analysing the performance of the different models and studied the differences between postedits produced by professional translators and those produced by authors in the NLP community 42, and we extended our study to the Earth and Planetary Sciences domain.
8.3.4 MT Applied to Non-standard Texts
User-generated content (UGC) such as texts found on social media are characterised by various phenomena not typically present in standard edited texts and which present challenges for MT (e.g. spelling mistakes, acronyms, truncations, and contractions). It is important to develop MT models that are robust, which means they are able to translate these kinds of text just as well as if the texts had not displayed non-standard variation.
In the context of Lydia Nishimwe 's PhD on robust MT, supervised by Rachel Bawden and Benoît Sagot , we have been working on building robust sentence embeddings, using the test case of LASER, and, inspired by T-modules (research carried out in the context of Paul-Ambroise Duquenne's PhD), using distillation to reduce the distance between non-standard sentences and their normalised versions. We introduced RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of UGC sentences and the LASER representations of their standard counterparts 67. We showed that with training only on standard and synthetic UGC-like data, RoLASER significantly improves LASER's robustness to both natural and artificial UGC data by achieving up to 2x and 11x better scores. Evaluation on downstream tasks showed that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data. Since then, we have extended the approach to SONAR 133, the follow-up to LASER, thereby creating RoSONAR, and we have carried out experiments to see whether RoSONAR can be used to create robust and modular MT architectures.
8.3.5 Multimodal MT
In the context of Matthieu Futeral-Peter 's PhD thesis, co-supervised by Rachel Bawden , Benoît Sagot , and Cordelia Schmid (WILLOW project-team), we have continued work on multimodal (image-text) MT. Our focus has remained on exploiting visual context to address ambiguity in translation, which is a fundamental challenge.
In 2024, our research resulted in two major publications. The first paper involves mOSCAR, our new large-scale multilingual and multimodal document corpus (see Section 8.1.1). We demonstrated that training multilingual models on mOSCAR significantly improves few-shot learning capabilities across various multilingual image-text tasks, reinforcing the importance of diverse multimodal data for multimodal LLMs 105.
Our second major contribution in 2024, ZeroMMT 104, introduced a method to train multimodal MT systems without requiring fully supervised multimodal parallel data (i.e., sentence pairs with corresponding images). Instead, we adapted a strong text-only MT model using a joint training objective: visually conditioned masked language modeling and a Kullback-Leibler divergence loss that encourages the MMT model's outputs to remain consistent with the original MT model while integrating visual cues. This approach allows for effective learning from multimodal English data alone, making it applicable to language pairs with no existing multimodal training corpora. ZeroMMT demonstrated disambiguation performance close to state-of-the-art multimodal MT models trained on fully supervised data, as evaluated on standard MMT benchmarks and our contrastive evaluation set CoMMuTE 135. To further validate its generalisation capabilities, we extended CoMMuTE to three new languages: Arabic, Russian, and Chinese. Additionally, we introduced an inference-time trade-off mechanism using classifier-free guidance, allowing for a controlled balance between translation fidelity and disambiguation strength without requiring additional data.
8.4 Speech Modelling
Participants: Tú Anh Nguyen, Paul-Ambroise Duquenne, Benoît Sagot.
2024 was the final year for a set of related efforts towards language modelling for speech data, mostly carried out in collaboration with META in the context of Robin Algayres's PhD thesis, defended in 2023, as well as Tú Anh Nguyên and Paul-Ambroise Duquenne 's PhD theses, respectively co-supervised by Emmanuel Dupoux and Holger Schwenk on the META side, who both defended their PhDs in 2024.
In the final paper to which we contributed 28, we introduced SpiRit-LM, a foundation multimodal language model that freely mixes text and speech. The model is based on a 7B pretrained text language model that we extended to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically curated speech-text parallel corpus. SpiRit-LM comes in two versions: a base version that uses speech phonetic units (HuBERT) and an expressive version that models expressivity using pitch and style units in addition to the phonetic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrated that SpiRit-LM can learn new tasks in a few-shot fashion across modalities (i.e., ASR, TTS, Speech Classification). Model weights and inference code are freely available.
8.5 Hate Speech and Radicalisation Detection
Participants: Djamé Seddah, Arij Riabi, Wissam Antoun, Virginie Mouilleron, Menel Mahamdi, José Carlos Rosales Núñez.
Under the umbrella of the H2020 CounteR project (2021-2024), we pursued our work on language variation through the prism of multilingual and cross-dialect hate speech. This is especially important since most multilingual language models (being discriminative such as mBERT, XLM-R, or autoregressive such as GPTx or LLaMa) are trained on a variety of data sources that tend to consider dialectal variations as lone instances of a given language, de facto ignoring important language nuances that can lead to misinterpretation of produced speech utterances or worse entailing misclassifications based on user-specific geolect varieties.
8.5.1 Identifying Common Examples in Spanish Dialects in Hate-Speech Contexts
Our previous research on multilingual zero-shot hate speech detection 145 highlighted the cultural gap in cross-lingual settings. For instance, the Spanish word puta `prostitute', is often used as a non-offensive intensifier but was flagged as offensive by models trained on an English dataset that focuses on misogyny. While such cultural differences across languages are increasingly studied, their impact on dialects of the same language remains unclear. After our work on hate speech contrasting different varieties of Spanish spoken either in South America or in Spain 128, we extended our approach, this time focusing on the Cuban Spanish variant and the identification of cross-dialect common examples.
As mentioned above, variations in languages across geographic regions or cultures are crucial to addressing biases in NLP systems designed for culturally sensitive tasks, such as hate speech detection or dialogue with conversational agents. In languages such as Spanish, where varieties can significantly overlap, many examples can be valid across varieties, which we refer to as common examples. Ignoring these examples may cause misclassifications, reducing model accuracy and fairness. Accounting for these common examples is therefore essential to improving the robustness and representativeness of NLP systems trained on such data. We used training dynamics to automatically detect common examples or errors in existing Spanish datasets. We demonstrated the effectiveness of using predicted label confidence for our Datamaps 157 implementation for the identification of hard-to-classify examples, especially common examples, enhancing model performance in a variety of identification tasks. Besides the original use of the Datamaps for this task, an interesting outcome lies in the introduction of a Cuban Spanish Variety Identification dataset with common example annotations developed to facilitate more accurate detection of Cuban and Caribbean Spanish varieties. To our knowledge, this is the first dataset focused on identifying the Cuban, or any other Caribbean, Spanish variety. After a first preprint in Autumn of 2024, this work was published in the VARDIAL specialised workshop, colocated with COLING 36.
8.5.2 The CounteR Dataset
The first half of the CounteR project faced an almost fatal challenge, namely the lack of availability of a multilingual radical content dataset covering the project's target ideology spectrum (from white-supremacist to Jihadism). This is why we had to first develop our multitask learning architecture on adjacent domains such as hate speech before being able to focus on real radical content suitable for the law enforcement agencies that were part of the project and designed to be our end-users. Such data was made available to us by an EU-external third-party contractor, validated by the European Commission project officer, over an 18-month period during the second half of the project. The final dataset covers 12 languages, some of which were actually provided by the LEAs themselves. ALMAnaCH was responsible for the whole NLP work package (“Data Analytics for Detecting Radical Content”), which spanned the entire duration of the project. Due to the UE classified nature of our core deliverables (D4.3 “NLP Basic Features” and D4.4 “transfer learning for NLP”) we cannot present our results on cross-lingual transfer and out-of-domain scenarios on radical content detection here. As per the consortium agreement, the only data that could be released would have to be fully GDPR-compliant, entailing a perfect pseudo-anonymisation process. Therefore, given the extent of the work needed for this task, we focused on a subset of languages comprising French, Arabic and English. The next section, Section 8.5.2, presents this work, the analysis we carried out, as well as the challenges of using such dataset in classification tasks, are presented in Section 8.5.2.
Semantic Pseudonimisation of Radical Content
NLP methods have been used to detect and analyse radicalisation mechanisms such as propaganda, recruitment, networking, data manipulation, and disinformation 158, 122, 136. However, the effectiveness of such detection models depends on the availability and quality of training and evaluation datasets. Protecting user privacy, especially for sensitive tasks, is imperative when sharing such datasets. Finding the right balance between the obligation to build accurate anonymisation methods and the need to maintain a decent level of performance is hard, as relevant information may be present in some identifiers (usernames, URLs, locations, etc.) and their associated socio-demographic or geographic markers. Hence, a brutal anonymisation of a dataset can hinder its usability, especially in a domain where radicalisation clues are often found through these indicators 151.
Ensuring the privacy of individuals is critical, especially in light of regulations such as the General Data Protection Regulation (GDPR). This is why we believed that despite implementing various laws to minimise harm and protect sensitive information, there was, and still is, a need to explore how technological advancements intersect with data protection laws and impact the collection, storage, and use of confidential data 142.
Therefore, to be able to release a dataset following the GDPR principles, we developed a manual pseudonymisation methodology tailored for our radicalisation dataset that (i) ensures performance to be comparable to the original data while maintaining its semantic properties and (ii) protects user privacy. To highlight the importance of establishing a standard framework for privacy and usefulness when processing sensitive NLP data, we shared the complete pseudonymisation process for our datasets, including our guidelines, and the challenges we faced. It is a highly sensitive task that requires 100% accuracy; any oversight can render the dataset invalid.
To give more details, the manual annotation process we devised guarantees a high level of precision and enables us to better explore the interaction of our NLP tools and improve user safety. Furthermore, a critical component of our methodology involves identifying the exceptions for which anonymisation does not need to be applied. For example, keeping well-known events and public figures enables us to leverage the knowledge embedded in the language model about specific entities and prevent pseudonymisation from corrupting the relationships and alignment between named entities and other elements within the text, thereby enhancing the effectiveness of our system. Our evaluation results show that models trained on our pseudonymised data maintain similar levels of performance to their original counterparts. An interesting side effect of our pseudo-anonymisation process arises from the necessary identification of all named-entities: Because we needed to identify them all (names, addresses, aliases, etc.), we had to annotate them manually (using a classic bootstrap process using state-of-the-art multilingual NER for our 3 languages). Hence, not only that the dataset is now pseudo-anonymised, it now contains a multilingual named entity layer covering our target ideologies.
Our dataset includes English, French, and Arabic content from various sources such as forums, Telegram and other social media platforms. The content covers different radicalisation domains (from white supremacy to jihadism) for each language. Our dataset is made freely available (See its webpage for details).
This work was published at the specialised PrivacyNLP workshop, as part of ACL 2024 68.
Exploring the CounteR Dataset
Once we properly processed the raw data provided by the contractor, we could begin to explore the challenges of detecting online radical content. As we said earlier, the CounteR dataset is a multilingual dataset designed to identify radicalisation across English, French, and Arabic texts. Unlike existing datasets that often focus on a single ideological perspective, it captures a broader range of extremist discourses, including Jihadist, far-right, and other radical discourse.
During the anonymisation phase of the dataset, we noticed a certain amount of annotation biases that could conflict with the legal context of the countries where a classifier trained on our dataset was supposed to be tested. Some of those biases were almost uniformly directed towards certain ideologies and, in some cases, towards some specific communities. Given the lack of information regarding the annotation phase by the third-party contractor (socio-demographic details of the annotators, country of origin, political stances, etc.), we decided to (i) reannotate the English and French dataset so we could contrast the original annotations with annotations coming from different persons with different sociodemographic traits and political opinions, (ii) study the impact of human-label variation 153 and finally (iii) analyse our classifier results regarding different socio-demographic variables.
Our study first highlighted the complexities of human annotation in this domain, showing that annotator disagreements and socio-demographic backgrounds influence labeling decisions. By comparing prescriptive annotations (where strict guidelines are enforced) to a descriptive approach that allows for more subjective interpretations, we demonstrated how these variations impact model predictions and fairness. To further investigate the classifier biases, we generated synthetic data using a specific large language model aligned without any safeguards (Vicuña Uncensored, 13B), embedding diverse socio-demographic traits into the generated text. This persona prompting technique enabled us to generate a 1900-document-long dataset that we used to assess whether models treat different socio-demographic groups fairly. Our findings revealed that ethnicity, nationality, and political affiliation significantly affect classification outcomes, raising concerns about how much bias propagation can affect an NLP system such as our multitask learning classification pipeline.
Performance-wise, our results also indicate that multilingual models perform best when each language has a dedicated classifier rather than a shared one. Additionally, we challenged the definition of the task itself, contrasting classification and regression approaches, finding that classification models achieve higher accuracy but sometimes make severe misclassifications, while regression models better preserve ordinal relationships in radicalisation levels.
Beyond the technical challenges, we had to take into account major ethical concerns in the development and deployment of radicalisation detection models. Automated systems in this domain carry a risk of misuse, particularly in profiling individuals or communities based on flawed predictions. To mitigate this, in our discussion with the consortium and finally in our paper, we repeatedly emphasised the necessity of human supervision in the application of these models.
To conclude, we argue that closer collaboration between NLP researchers and social scientists is crucial for mitigating bias and improving the interpretability of radicalisation detection models. This is one of the reasons ALMAnaCH is getting increasingly involved with social scientists from the Science Po's Medialab via the Salm exploratory action.
Finally, while this dataset, the first of its kind, represents an important step toward multilingual, context-aware radical content detection, it is important to note that language and behaviours associated with radicalisation continuously evolve, requiring ongoing dataset updates and model adaptations to maintain any hope of effectiveness and fairness.
This work was published at the COLING 2025 conference 69.
8.5.3 Miscellaneous
We adapted the Narabi Treebank NER annotation scheme so it could be part of the Universal NER framework. This work led to a publication at the NAACL 2024 conference 60.
8.6 NLP for Specific Application Domains
Participants: Éric Villemonte de La Clergerie, Simon Meoni, Rian Touchent, You Zuo, Benoît Sagot, Lauriane Aufrant, Léo Labat.
8.6.1 Biomedical NLP
In the context of the BPI-funded project ONCOLAB, we explored the application of NLP techniques to the medical domain. The medical domain is a very specialised one, and clinical documents (EHR – Electronic Health Records) are very sensitive data and not readily available, especially for French. During his Master's thesis, Rian Touchent tested several approaches to collect a dataset of French biomedical documents (biomed-fr) and use it to continue the pretraining of our CamemBERT language model, leading to CamemBERT-bio 159, 71, one of the first French language models specialised for the biomedical domain. Continual pre-training ensured a very low environmental impact (compared to a pre-training from scratch) and CamemBERT-bio exhibits state-of-the-art performance on several French biomedical benchmarks.
The lack of biomedical data and especially annotated biomedical data also led Simon Meoni to explore the use of LLMs (in particular Instruct-GPT) to automaticaly annotate medical entities in the E3C corpus (without any prior fine-tuning) in the context of his CIFRE PhD (with the company Arkhn), supervised by Éric Villemonte de La Clergerie for ALMAnaCH. In 2024, to overcome the issue of the necessary confidentiality of medical data, which leads to a lack of any freely available datasets, he explored the generation of synthetic documents using keywords extracted from real health records that do not contain confidential information 61. Furthermore, we introduced a reward mechanism that iteratively refines the quality of synthetic documents. This involves scoring synthetic candidates against real clinical reports using a semantic textual similarity score and performing an alignment step to align the model with its best-scored utterances.
8.6.2 NLP for Patents
As part of You Zuo 's CIFRE PhD, in collaboration with the start-up qatent acquired by Questel in 2024, supervised by Benoît Sagot and Éric Villemonte de La Clergerie for ALMAnaCH and Kim Gerdes for qatent (see Section 9), we have broadened our research into NLP for patents, with a focus on automatic patent generation and computer-aided patent writing. In particular, we published a comparative evaluation and analysis of language models for the task of patent generation, focusing on two different tasks 73: (i) the generation of abstract from claims and (ii) the generation of claims given previous ones. We developed a benchmark, PatentEval, and a comprehensive error typology for both tasks. We manually annotated various models, including those specific to the patent domain as well as general-purpose language models. In addition, we explored and evaluated several metrics to approximate human judgments in patent text evaluation, analysing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation.
8.7 Named Entity Recognition in the Context of Inria's “Mission Security-Defence”
In 2024, our work on named entity recognition (NER) in collaboration with the Inria “Mission Security-Defence” focused on two main directions. Firstly, we introduced UkraiNER 37, a novel corpus comprising 10,000 French sentences within the geopolitical news domain, annotated using a comprehensive and simplified scheme that encompasses nested, discontinuous, and non-named entities. Baseline experiments on this corpus yielded an F1 score of 82 for comprehensive entity recognition and 87 when focusing on traditional nested NER, providing valuable insights into the challenges faced by state-of-the-art NER models. Secondly, we explored the interplay between entity linking and coreference resolution, demonstrating that incorporating coreference chain information can enhance disambiguation processes 74. Our error analysis and oracle experiments revealed that combining predictions within coreference chains could improve F1 scores by up to 4.3 on coreferent mentions in English. We proposed a voting-based combination strategy with various weighting heuristics, resulting in modest yet interpretable gains.
8.8 Information Extraction from Specialised Collections
Participants: Alix Chagué, Floriane Chiffoleau, Hugo Scheithauer, Sarah Bénière, Thibault Clérice, Juliette Janès, Cecilia Graiff.
In the context of DataCatalogue, a project with the Bibliothèque nationale de France (BnF) and the Institut national d'histoire de l'art (INHA), we continued our work on the automatic structuration of structured information from specialised collections and on the normalisation and modelling of the structure of such catalogues. In 2024, we focused on the automatic retro-structuration of auction sales catalogs layout and content 86.
Our collaboration within the international EHRI project led to an augmented publication workflow for the digital edition of Holocaust testimonies 77. The current iteration of the pipeline facilitates the extraction of text from digitised documents (both manuscripts and typed materials) and transforms them into a standardised representation in XML-TEI 43, the goal being a sustainable management of such testimonies. Subsequent post-processing includes NER annotations, which allowed us to repurpose the Holocaust testimony editions to develop multilingual domain-specific NER tools. This facet aligns with our engagement in the NER4Archives initiative and our sustained collaboration with the Archives Nationales.
8.9 Automatic Text Recognition for Historical Documents
Participants: Thibault Clérice, Benoît Sagot, Alix Chagué, Floriane Chiffoleau, Hugo Scheithauer, Juliette Janès, Sarah Bénière, Hassen Aguili.
In 2024, we continued our participation in the development and promotion of eScriptorium and its Kraken HTR engine. Since mid-2024, we have been in talk with the EPHE about merging our instances of eScriptorium to accomodate more users, with an infrastructure allowing for around 1PB of user data. In this context, we used the example of the new, user-generated documentation created for eScriptorium to investigate the benefits and limitations of such contributions to open-source software 78. The new documentation offers a solution to a scattered, hard-to-maintain landscape of documentation on the tool, and favours future collaborations across user groups and languages. We also took part in the training of a Kraken HTR model for Ancient Greek, using the 10th-century Heidelbergensis Palatinus graecus 23 manuscript 54. It is an essential source of the Palatine Anthology, and clarity and neat script make it well-suited for this task.
We have been involved since its beginning in the CATMuS (Consistent Approaches to Transcribing ManuScripts) initiative, which brings together DH researchers from multiple teams and whose goal is to define a set of datasets and guidelines meant for training large HTR models, and to share such guidelines, datasets and models 83. Our first focus was medieval manuscripts. CATMuS provides an inter-compatible dataset spanning more than 200 manuscripts and incunabula in 10 different languages that use the Latin script, comprising over 160,000 lines of text and 5 million characters spanning from the 8th century to the 16th. The dataset's consistency in transcription approaches aims to mitigate challenges arising from the diversity in standards for medieval manuscript transcriptions, providing a comprehensive benchmark for evaluating HTR models on historical sources. Based on this corpus, we trained the CATMuS-Medieval model 152, a Kraken HTR model trained on four different languages (in descending order of importance in the dataset: Old and Middle French, Latin, Spanish (and other languages of Spain), Italian) on strictly graphematic transcriptions (no abbreviations are resolved). It is a result of a collaboration between multiple projects, including CREMMA, Gallic(orpor)a, HTRomance and HTRogène. 46.
We took part in a follow-up to this work dedicated to Medieval Dutch: ARletta is a series of open-source models for the automated transcription of historic Dutch-language handwritten sources 26. All models presented were trained on publicly available data, including CATMuS-Medieval data when applicable and a newly digitised large-scale collection of local police reports (1876–1945), for which we used the open-source kraken engine. Additionally, we trained a “supermodel” on the union of other Dutch-language datasets (extending back to the 17th century) which we hope will be useful as a foundational model for future projects. Our results demonstrate performance that is competitive with proprietary software solutions.
In the context of the development of CATMus-Medieval, exploring a long-lasting research question became necessary. It has long been observed that HTR model performance degrades on the best manuscripts over time as more data is incorporated in the trainind data, likely due to over-generalization. We therefore wanted to investigate the impact of incorporating contextual metadata in training HTR models to mitigate this effect 58. In our experiments, based on the CATMuS-Medieval corpus, we compared the performance of various model architectures, focusing on Conformer models with and without contextual inputs, as well as Conformer models trained with auxiliary classification tasks. Results indicate that Conformer models using semantic contextual tokens (Century, Script, Language) outperform baseline models, particularly on challenging manuscripts. This demonstrates that metadata can enhance model accuracy and robustness across diverse historical texts.
As an extension of the CATMuS model, we developed CATMuS-print, a generic, multilingual, and diachronic model designed to enhance OCR for printed texts 50. Recognising the vast and diverse collections in digital libraries, such as Gallica, which houses over 860,000 books and 20,000 periodicals spanning multiple centuries and languages, we identified the need for an OCR model capable of effectively processing both historical and contemporary prints across various languages. We evaluated multiple architectures to determine their efficacy in terms of Character Error Rate (CER) and inference time. Following previous work in the team, and using a model derived from CATMuS-print, we carried out experiments to better understand the 17th century's pivotal role in establishing the standardised French orthographic norms that largely persist today 51. Such studies face two major challenges: the necessity for quantitative approaches to address micro-level orthographic changes and the scarcity of suitable corpora, as most available texts have been modernised by editors. To overcome these challenges, we proposed the creation of a new corpus and the development of tools for extracting and analysing orthographic variants. By comparing texts extracted via OCR—hence the need for the CATMuS-text-derived model—with versions automatically aligned to contemporary French spelling, we were able to identify and categorise variant zones, facilitating a quantitative study of orthographic changes during the 17th century.
In the context of the COLaF DEFI and our ongoing work on improving the segmentation of documents, with digital proto-edition in mind as a target application, we continued our work on the LADaS dataset, our open-access dataset aimed at semantic layout analysis and document reconstruction. We published a new version of the corpus, LADaS 2.0 101. It builds on the SegmOnto ontology 82, also published in 2024, providing detailed hierarchical and semantic annotations for 7,254 pages from documents dated 1600–2024. It includes both digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets, thereby addressing varying layout complexities and historical changes in document structure. In this context, the discussions with the Persée digital academic library, started in late 2023, are still ongoing and have already led to concrete outcomes. For instance, LADaS 2.0 includes data provided by Persée, and they are interested in working with us to use models trained on LADaS at an industrial scale. Collecting more data from Persée could lead to the production of an academic dataset of more than 1 million documents (academic paper, monographs, etc.) in a standardised and clean format.
Finally, since June 2023, we have been involved in a collaboration with the CARAMBA team in Nancy and the Université de Picardie for the automatic HTR-based acquisition of encrypted historical documents from the 16th century. In 2024, this informal collaboration became an Inria “Action Exploratoire” named BackInTime. We are developping a new platform based on eScriptorium to facilitate the transcription of ad-hoc, non standard scripts, for the CARAMBA team to post-process with decryption.
8.10 Evaluation of LLM-Generated Texts
Participants: Chloé Clavel, Yanzhu Guo, Cyril Chuun.
8.10.1 Story Generation Evaluation
In the context of Cyril Chhun 's thesis, co-supervised by Chloé Clavel and Fabian Suchanek (Telecom-Paris), we examine automatic story evaluation. Storytelling is an integral part of human experience and plays a crucial role in social interactions. Automatic Story Evaluation (ASE) and Generation (ASG) could therefore benefit society in multiple ways, but they are challenging tasks that require high-level human abilities such as creativity, reasoning, and deep understanding. Meanwhile, LLMs now achieve state-of-the-art performance on many NLP tasks. In 2024, we studied whether LLMs can be used as substitutes for human annotators for ASE. We performed an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explored the influence of prompting on the results and the explainability of LLM behaviour 25. Most notably, we found that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers.
8.10.2 Linguistic Diversity
In the context of Yanzhu Guo 's PhD thesis, under the supervision of Chloé Clavel , we examine the linguistic diversity of LLM outputs. While LLM evaluation has gained significant attention in recent years, most research has focused on assessing the task-solving abilities of models. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been properly addressed. Our study emphasises the importance of preserving the linguistic richness of human language, given the concerning surge in online content produced or aided by LLMs.
In the first part of our study, we proposed a comprehensive framework for evaluating LLMs from various linguistic diversity perspectives including lexical, syntactic, and semantic dimensions. Using this framework, we benchmarked six state-of-the-art LLMs across three diversity dimensions on five NLG tasks, and conducted an in-depth case study for syntactic diversity. We also examined how various development and deployment strategies for LLMs influence the linguistic diversity of their outputs. Our findings reveal a significant gap in replicating the linguistic richness of human language, particularly in more creative tasks, despite the impressive capabilities of LLMs in generating coherent and contextually appropriate text. Specifically, we observed that factors such as model scale, training data volume, and finetuning techniques critically influence diversity metrics.
In the second part of our study, we investigated the consequences of training LLMs on synthetic data generated by their predecessors. We conducted recursive finetuning experiments across three NLG tasks 55. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, highlighting the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.
8.11 Dialogue Modelling
Participants: Ha Anh Ngo, Nicolas Rollet, Aina Garí Soler, Chenwei Wan, Chloé Clavel.
8.11.1 Repair modelling
In the scope of Ha Anh Ngo 's PhD thesis, under the supervision of Chloé Clavel , Catherine Pelachaud (Sorbonne Université, ISIR) and Nicolas Rollet (Telecom-Paris, on detachment in ALMAnaCH), we investigated the phenomenon of conversational repair mechanisms in human dialogue. Repair in conversations refers to the process by which interlocutors identify, address, and resolve issues, such as misunderstandings, mishearings, or ambiguous expressions, which help maintain mutual understanding between participants and ensure that the conversation proceeds smoothly.
In the first part of our study, in collaboration with Professor Dirk Heylen from Twente University (Netherlands), we have been exploring the multimodal patterns (linguistic and prosodic cues) involved when a speaker initiates and resolves a repair request. Using a task-oriented dialogue corpus in Dutch, we used an automatic coreference solver and analysed patterns of self-repetition and other-repetition to identify linguistic features that distinguish turns where the addressee initiates a repair request from those where they do not 66. Our findings revealed distinct linguistic patterns, including coreference chains, repetition, and grammatical structures, that differentiate repair initiation requests from regular dialogue turns, as well as among the three types of repair initiation. These results established the foundation for modelling the other-initiated repair sequence, which could significantly contribute to developing a computational model for recognising repair initiation and generating appropriate repair responses.
For future work, we are developing a computational model to detect other-initiated repair requests, incorporating multimodal cues (text, audio, and video) 65. The next step is to enable real-time online detection. Additionally, to address potential biases and limitations of the current corpus, we also plan to annotate a new corpus in French.
8.11.2 Word Meaning Negotiation
As part of Aina Garí Soler 's postdoctoral research, under the supervision of Chloé Clavel and Matthieu Labeau (Telecom-Paris), we have been exploring word usage dynamics in dialogue. In collaboration with researchers Jenny Myrendal and Staffan Larsson from Gothenburg University (Sweden), we have developed the NeWMe (Negotiating Word Meaning) corpus, a collection of annotated Word Meaning Negotiation (WMN) sequences. WMNs are parts of conversations where a speaker raises a clarification request or an objection about the usage of a specific word, prompting a metalinguistic discussion aimed at clarifying or refining the word's meaning. The corpus contains various kinds of interactions (written, oral, dyadic, and multi-party) and is annotated with WMN sequences, their components, and related phenomena. The effort, which included creating annotation guidelines and establishing inter-annotator agreement with recruited annotators, led to a submission to the LRE (Language Resources and Evaluation) journal in early 2025. This corpus offers valuable insights into conversational repair, disagreement, and semantic conflicts. The annotations can be used to identify potentially controversial or problematic terms and to investigate the mechanisms of conceptual or lexico-semantic alignment.
WMNs have three key components: the trigger, which is a problematic word usage; the indicator, an utterance signaling a problem with a word's meaning; and the negotiation – one or more conversational turns where participants address the misunderstanding or disagreement. To better understand the mechanisms by which a word can be misunderstood or disputed, we paid particular attention to the trigger component. We conducted an extensive literature review, identifying linguistic factors that contribute to potentially problematic word usages, as well as computational methods and data that facilitate their detection. This review connects works from various disciplines and highlights areas for future research.
Together, these two works will serve as foundations for our future research on lexico-semantic alignment, the detection of problematic word usages, and the expansion of WMN annotations to more languages and conversational contexts.
8.11.3 Socio-emotional Response Generation
In the context of Chenwei Wan 's research internship, co-supervised by Chloé Clavel and Matthieu Labeau (Telecom-Paris), we examine socio-emotional response generation in dialogue systems. Recently, with advancements in LLMs, end-to-end dialogue agents without explicit strategy prediction steps have become prevalent. However, implicit strategy planning lacks transparency, and recent studies show that LLMs' inherent preference bias towards certain socio-emotional strategies hinders the delivery of high-quality emotional support. To address this challenge, we proposed to decouple strategy prediction from language generation, and introduced a novel dialogue strategy prediction framework, EmoDynamiX, which models the discourse dynamics between user fine-grained emotions and system strategies using a heterogeneous graph for better performance and transparency.26 Experimental results showed that EmoDynamiX outperforms previous state-of-the-art methods by a significant margin (better proficiency and lower preference bias). Our approach also exhibits better transparency by allowing backtracing of decision making.
8.12 Social Computing
Participants: Célia Nouri, Gabrielle Lebellier, Benoît Sagot, Chloé Clavel.
8.12.1 Hate Speech Detection in Conversations
As part of Célia Nouri 's research internship, under the supervision of Chloé Clavel and Jean-Philippe Cointet (Medialab, Sciences Po), we have been using graph neural networks to model enhance abusive language detection in social media conversations. In the context of this study we investigated the detection of abusive language in social media conversations using graph-based approaches. Traditional abusive language detection (ALD) methods often classify comments in isolation, neglecting crucial conversational context. While recent efforts have attempted to incorporate conversational context into NLP classifiers, they often employ a limited scope, primarily focusing on the immediately preceding comment and using flattened representations of conversations. To overcome these shortcomings, we propose a novel framework that models social media conversations as structured graphs, where nodes correspond to individual comments and edges represent reply relationships.
Our approach exploits Graph Neural Networks (GNNs), specifically Graph Attention Networks (GATs), to effectively capture conversational dependencies and enhance classification accuracy. We systematically compared different graph construction strategies to model Reddit conversations and introduce an affordance-based graph-trimming strategy aligned with Reddit's default rendering to enhance computational efficiency while preserving relevant context. Experiments on the Contextual Abuse Dataset (CAD) confirm that this approach yields superior performance in ALD. We also examined the impact of varying context depth by adjusting the number of GAT layers. Our results show that incorporating up to three-hop neighbours provides optimal performance, underscoring the importance of considering extended conversational structures beyond a single preceding comment. Additionally, our graph-based model outperforms both context-agnostic and traditional context-aware baselines, particularly in cases requiring contextual disambiguation. These findings establish GNNs as a robust framework for improving automated moderation of abusive content in online discussions.
Future research directions include extending the methodology to other social media platforms, integrating multimodal cues such as images and videos, and addressing annotation biases in conversational toxicity datasets. The proposed approach represents a significant advancement in context-aware hate speech detection and offers scalable solutions for moderating online discourse.
8.12.2 Fallacy Detection
In the context of Chadi Helwé 's thesis, supervised by Chloé Clavel and Fabian Suchanek (Telecom-Paris), we introduced MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets 56. A fallacy is an erroneous or invalid way of reasoning. Consider, e.g., the argument “You must either support my presidential candidacy or be against America!”. This argument is a false dilemma fallacy: it wrongly assumes no other alternatives. The dataset integrates four pre-existing datasets into a cohesive whole, achieved through developing a new, comprehensive taxonomy. The proposed taxonomy aligns, refines, and unifies existing classifications of fallacies. We further provided a manual annotation of a part of the dataset together with manual explanations for each annotation. We proposed a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluated several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies bias mitigation in language models. While a coarse grain classification (presence of fallacy) shows good results, a classification into fine-grained categories of fallacies is largely out of reach of LLMs in zero-shot settings. We hope that our benchmark will enable researchers to improve the results of this challenging task.
The PhD thesis recently started (November 2024) by Cecilia Graiff , supervised by Benoît Sagot , Chloé Clavel and Emiliano Grossmann, in collaboration with the “Action Exploratoire” SaLM, aims at extending this line of research by investigating argument structures in political discourse in a cross-lingual and cross-cultural perspective.
8.12.3 Disentangling Gender Information
In the context of Gabrielle Le Bellier 's research internship, supervised by Benoît Sagot and Chloé Clavel , we investigated bias mitigation in language models. Recent models have proved to be biased towards some demographics and display stereotypes. For instance, models identify nurses as women more than men in an excessive proportion compared to reality. This raises ethical issues about the ideas conveyed by the models and their consequences to societies. In this work, we focused on binary gender bias in occupation classification. Using a labelled dataset of biographies written in English (BiasBios), our aim was to build a model that is fair across genders. We used disentanglement to isolate gender information from occupation information and perform classification on top of the disentangled occupation vector. We explored several pre-processing and disentangled-based in-processing bias mitigation techniques. We evaluated disentanglement and fairness through metrics on embeddings, mutual information, and True Positive Rate on the downstream classification task. We found that a good disentanglement between occupation and gender does not necessarily involve a fairer occupation classification. Moreover, among the techniques explored, a multi-estimator setting seemed to enable different combinations to balance information loss and fairness, and estimators based on contrastive losses showed promising results.
8.13 Multimodal Approaches to Human-agent and Human-human Interaction
Participants: Justine Cassell, Sinem Demirkan, Hasan Onur Keles, Cindy Evelyn De Araujo, Reem Al Najjar, Sophie Etling, Gabrielle Alimi, Barokshana Baskaran, Zofia Milczarek, Biswesh Mohapatra, Marius Le Chapelier, Hao Wang, Emer Gilmartin, Anna Celina Desiderio, Qinhyue Xu, Mira Lee, Irène Metz, Diana Bakina, Mathilde Deletombe, Théo Charlot.
This research direction advances understanding of human-like communication by studying how interpersonal dynamics—like synchrony, rapport, and personality expression—affect collaboration and interaction, both in humans (including children) and between humans and AI-driven conversational agents.
8.13.1 Using Interbrain Synchrony and Rapport Building to Understand Productive Peer Collaboration
As a part of our large-scale approach to studying collaboration, our interdisciplinary team, composed of researchers, research engineers and Master's students within the neurosciences field, linguistics, education and computer sciences, designed an experiment that aims to examine the nature of collaboration between pairs of schoolchildren of the same age. We are particularly interested in the relationship between interpersonal rapport, as displayed by children's language and nonverbal behaviour, and the success of collaboration, as demonstrated by the learning gains of pairs of children collaborating on 4 scientific tasks, and by interbrain synchrony (IBS) - a similarity in the brain waves of two people while collaborating. Our team collected multimodal data from 22 dyads of adult participants (verbal, non verbal and neural data) in a longitudinal setting over 3 weeks using fNIRS hyperscanning. This study allows us to move towards our goals to better understand collaboration in children, as we are the first lab to carry out research on IBS between children in a peer learning setting, using the fNIRS (functional near-infrared spectroscopy), and to associate it to, on the one hand, success in collaboration, and on the other hand, to interpersonal rapport.
8.13.2 Conversational Grounding in Dialogue Systems
The team is dedicated to developing more socially capable conversational agents, with a focus on enhancing their ability to engage in natural and effective interactions. In this direction, Marius Le Chapelier is leading the Son of Sara project, which aims to create a new embodied conversational agent powered by an LLM. Throughout 2024, he has been implementing various modules to equip the agent with essential capabilities such as comprehension, reasoning, and response generation. These modules include Automatic Speech Recognition (ASR), Natural Language Generation (NLG), Text-to-Speech (TTS), and a Dialogue Manager, among others. In parallel, to further improve the agent's social capacities, the team is conducting in-depth research on LLMs' conversational abilities and social behaviors. Within this context, Biswesh Mohapatra 's PhD thesis, supervised by Justine Cassell and Laurent Romary , focuses on the understanding and evaluation of conversational grounding in dialogue systems, particularly in LLMs and LLM-based agents, developing new approaches into making LLMs that can engage in conversational grounding. Mutual understanding plays a critical role in effective communication, and we are working on addressing the notable gap in both theoretical frameworks and empirical evaluations of grounding capabilities in modern dialogue agents. In 2024, we annotated dialogue corpora using Grounding Acts, Grounding Units and a measure of their degree of grounding, together with a baseline classification model 62. We also conducted large-scale evaluations across various LLMs to assess their performance of LLMs in various aspects of conversational grounding, identifying correlations between model scale, pre-training data size, and grounding performance 63. Increasingly, our explorations have taken the form of multimodal representations, as conversational grounding–where interlocutors monitor and ensure that their conversational partners are following–can take the form of language, but also aspects of prosody, and visual phenomena.
8.13.3 Exploring Interpersonality: Multimodal Personality Cues in Embodied Conversational Agents
In 2024, in the context of our collaboration with KETI, we explored the most impactful cues to generate in embodied conversational agents (ECA) in order to convey particular personality traits to the people interacting with the ECA. These traits include speech style (speed of talk, and pitch excurses, for example) linguistic content (number of emotion words, for example), non-verbal behaviours (mutual eye gaze and head nods, for example). Breaking with previous work on personality in agents, the team developed the idea of “interpersonality”, which captures the fact that the manifestation of personality (that is, how it is displayed multimodally) is impacted by the personality of the interlocutor. A final report on interpersonality was subsequently accepted for publication in The Handbook of Personality Psychology, to appear in 2026.
9 Bilateral contracts and grants with industry
Participants: Benoît Sagot, Rachel Bawden, Djamé Seddah, Éric Villemonte de La Clergerie, Lionel Tadonfouet Tadjou, Tu Anh Nguyen, Paul-Ambroise Duquenne, You Zuo, Chloé Clavel, Justine Cassell.
9.1 Bilateral contracts with industry
Verbatim Analysis
Participants: Benoît Sagot.
-
Partner type:
Inria start-up
-
Leader for ALMAnaCH:
Benoît Sagot .
-
Dates:
1 Jul. 2009–31 Jan. 2024
-
Description:
Verbatim Analysis is an Inria start-up co-created in 2009 by Benoît Sagot . It uses some of ALMAnaCH's free NLP software (SxPipe) as well as a data mining solution co-developed by Benoît Sagot , VERA, for processing employee surveys with a focus on answers to open-ended questions. Its activities have been progressively taken over by opensquare (see below), and the company was discontinued at the end of January 2024.
opensquare
Participants: Benoît Sagot.
-
Partner type:
Inria start-up
-
Leader for ALMAnaCH:
Benoît Sagot .
-
Dates:
1 Dec 2016–present
-
Description:
Opensquare was co-created in December 2016 by Benoît Sagot with 2 senior specialists of human resources (HR) consulting. It is dedicated to designing, carrying out and analysing employee surveys as well as HR consulting based on these results. It uses a new employee survey analysis tool, enqi, which is still under development. This tool being co-owned by opensquare and Inria, both parties have signed a Software Licence Agreement in exchange for a yearly fee paid by opensquare to Inria based on its turnover. Benoît Sagot currently contributes to opensquare, under the “Concours scientifique” scheme.
META
Participants: Benoît Sagot, Tú Anh Nguyên, Paul-Ambroise Duquenne, Pierre Chambon.
-
Partner type:
Company
-
Leader for ALMAnaCH:
Benoît Sagot .
-
Dates:
1 Jan 2018–present
-
Funding received:
€331,260
-
Description:
Our collaboration with META AI is centered around the joint supervision of CIFRE PhD theses. A first collaboration (Louis Martin's PhD thesis), co-supervised by Benoît Sagot , Éric de La Clergerie and Antoine Bordes (META) was dedicated to text simplification (“français Facile À Lire et à Comprendre”, FALC), in collaboration with UNAPEI. This collaboration was part of a larger initiative called Cap'FALC involving (at least) these three partners as well as the relevant ministries. Louis defended his PhD in 2022. Two other joint PhD theses started in 2021 and were defended in 2024. Paul-Ambroise Duquenne's PhD, co-supervised by Benoît Sagot and Holger Schwenk (META), is dedicated to sentence embeddings for massively multilingual speech and text processing. Tú Anh Nguyen's PhD, co-supervised by Benoît Sagot and Emmanuel Dupoux (META), is dedicated to the unsupervised learning of linguistic representations from speech data, with a focus on textless dialogue modelling and speech generation. Finally, a new PhD started in PhD: Pierre Chambon's PhD is dedicated to code generation with language models.
In addition, Benoît Sagot received a $50,000 gift grant from META AI in the context of the release of the LLAMA 3 series of models.
Qatent
Participants: You Zuo, Benoît Sagot, Éric de La Clergerie.
-
Partner type:
Inria start-up
-
Leader for ALMAnaCH:
Benoît Sagot .
-
Dates:
1 Jan 2021–present
-
Description:
Qatent is a startup supported by the Inria Startup Studio and ALMAnaCH that applies NLP to help write better patents faster. Its creation followed the 18-month secondment (“détachement”) at ALMAnaCH of Kim Gerdes, one of the three founders of the company, and benefitted from ALMAnaCH's scientific expertise and the Inria Startup Studio's counselling and financial support. It also led to You Zuo's CIFRE PhD thesis, co-supervised by Benoît Sagot and Kim Gerdes (now at qatent), which continues the collaboration. Note that in 2024, qatent was aquired by Questel, a global leader in intellectual property (IP) management and technology services, without a significant impact on You Zuo 's PhD.
10 Partnerships and cooperations
10.1 International initiatives
10.1.1 Participation in other International Programs
Interpersonality
Participants: Justine Cassell, Emer Gilmartin.
-
Duration:
1 Jan 2024–present.
-
PI:
Justine Cassell .
-
Coordinator for ALMAnaCH:
Justine Cassell .
-
Partner:
- KETI (Corée)
-
Funding:
€750
-
Summary:
Computational models of personality have largely ignored the subtleties of expressions of personality, and have assumed that individuals (and therefore embodied conversational agents) should have one fixed personality (such as introvert or extrovert), despite the fact that all of the psychological evidence demonstrates that the expression of one's personality changes as a function of who one is speaking to, what one is speaking about, the personalities of the other individuals in a conversation, one's culture, one's age, etc. A new model is needed that takes into account the malleability of personality and therefore the need for conversational agents to also have malleable representations of personality so as to be maximally effective in working with people.
Informal initiative Universal Dependencies Project
Participants: Djamé Seddah, Benoît Sagot, Arij Riabi.
-
Duration:
1 Jan 2017–present.
-
PIs:
Joakim Nivre and Christopher Manning.
-
Coordinator for ALMAnaCH:
Djamé Seddah .
-
Partners:
- LLF
- SEMMAGRAME
- Uppsala University
- Stanford University
-
Summary:
The Universal Dependencies project is an open community effort with over 300 contributors producing nearly 200 treebanks in over 100 languages. Universal Dependencies is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. With a release every 6 months since 2017, the UD framework is the de-facto standard for syntactic structures representations and of course the basis for most if not all supervised neural parsers.
10.2 International research visitors
10.2.1 Visits of international scientists
Other international visits to the team
- Marco Bronzini visited the team from April to June 2024 from the Università degli Studi di Trento in the context of an agreement between the Collège de France and the University of Trento. Working on integrating large language models and knowledge graphs, He was sponsored by Xavier Leroy officially and Benoît Sagot informally, who hosted him in the ALMAnaCH project-team.
- Marine Carpuat, from the University of Maryland, has been visiting the team since September 2024 during her Sabbatical year. Her work focuses on human-centered NLP, machine translation and cross-lingual communication.
10.2.2 Visits to international teams
Research stays abroad
Nathan Godey
-
Visited institution:
University of Edinburgh
-
Country:
United Kingdom
-
Dates:
01/04/2024–30/06/2024
-
Context of the visit:
Research visit during his PhD in collaboration with Edoardo Ponti on advanced optimisation techniques in large language models (QV-cache compression).
-
Mobility program/type of mobility:
research stay
Thibault Clérice
-
Visited institution:
Federico II, Naples
-
Country:
Italy
-
Dates:
20/09/2024–20/12/2024
-
Mobility program/type of mobility:
Invited Professor
Justine Cassell
-
Visited institution:
University of California at Berkeley
-
Country:
USA
-
Dates:
01/06/2024–01/07/2024
-
Context of the visit:
Co-organiser of the Simons Institute Summer workshop on the interaction between linguistics, psychology, neuroscience and AI in the study of social interaction
-
Visited institution:
Carnegie Mellon University, USA
-
Country:
USA
-
Dates:
01/08/2024–01/09/2024
-
Context of the visit:
Interaction with colleagues in the Language Technologies Institute.
-
Mobility program/type of mobility:
Research stay
10.3 European initiatives
10.3.1 Horizon Europe
ATRIUM
ATRIUM project on cordis.europa.eu
Participants: Thibault Clérice, Alix Chagué.
-
Title:
Advancing FronTier Research In the Arts and hUManities
-
Duration:
1 Jan 2024–31 Dec 2027.
-
Partners:
- Institut National de Recherche en Informatique Et Automatique (Inria), France
- Archeologický ústav AV ČR, Praha v. v. i. (ARUP), Czechia
- Ludwig-Maximilians-Universität München (LMU München), Germany
- Foxcub, France
- Instytut Badań Literackich Polskiej Akademii Nauk (IBL PAN), Poland
- The University of Sheffield (USFD), United Kingdom
- Open Access in the European Area through Scholarly Communication (OPERAS), Belgium
- Stichting Radboud Universiteit, Netherlands
- Centar za digitalne humanističke nauke (Belgrade Center for Digital Humanities), Serbia
- University Of South Wales Prifysgol de Cymru (USW), United Kingdom
- Ariadne Research Infrastructure, Belgium
- Athina-Erevnitiko Kentro Kainotomias Stis Technologies Tis Pliroforias, Ton Epikoinonion Kai Tis Gnosis (Athena - Research And Innovation Center), Greece
- Idryma Technologias Kai Erevnas (Foundation For Research And Technologyhellas), Greece
- Instytut Chemii Bioorganicznej Polskiej Akademii Nauk, Poland
- Digital Research Infrastructure for the Arts and Humanities (DARIAH ERIC), France
- Laboratório Nacional De Engenharia Civil (LNEC), Portugal
- Archeologický ústav AV ČR, Brno v. v. i., Czechia
- University of York, United Kingdom
- Consiglio Nazionale delle Ricerche (CNR), Italy
- PIN Soc. Cons. A R.L. - Servizi Didattici e Scientifici per l'Universita' di Firenze (PIN SCRL), Italy
- Prisma Cultura S.R.L., Italy
- Clarin Eric (Common Language Resources and Technology Infrastructure as a European Research Infrastructure Consortium), Netherlands
- Université de Tours, France
- Univerzita Karlova (CU), Czechia
- Athens University of EconomicsaAnd Business - Research Center (AUEB-RC), Greece
- Österreichische Akademie der Wissenschaften (OEAW), Austria
- The Cyprus Institute, Cyprus
- Göteborgs universitet (UGOT), Sweden
- Net7 S.R.L, Italy
-
Inria contact:
Thibault Clérice
-
Coordinator:
Laurent Romary
-
Funding:
€7,000,000 total, €300,000 for ALMAnaCH
-
Summary:
Advancing FronTier Research In the Arts and hUManities (ATRIUM) will exploit and strengthen complementarities between leading European infrastructures: DARIAH, ARIADNE, CLARIN and OPERAS in order to provide vastly improved access to a rich portfolio of state-of-the-art services available to researchers across countries, languages, domains and media, building on a shared understanding and interoperability principles established in the SSHOC cluster project and other previous collaborations.
10.3.2 H2020 projects
EHRI-3 - “European Holocaust Research Infrastructure”
Participants: Hugo Scheithauer, Floriane Chiffoleau, Alix Chagué, Lucas Terriel, Sarah Bénière.
EHRI-3 project on cordis.europa.eu
-
Duration:
1 May 2015–31 Aug 2024.
-
PI:
Conny Kristel (NIOD-KNAW, NL).
-
Coordinator for ALMAnaCH:
Laurent Romary .
-
Partners:
- Archives Générales du Royaume et Archives de l'État dans les provinces (Belgium)
- Aristotelio Panepistimio Thessalonikis (Greece)
- Dokumentačné Stredisko Holokaustu Občianske Združenie (Slovakia)
- Fondazione Centro Di Documentazione Ebraica Contemporanea -CDEC - ONLUS (Italy)
- International Tracing Service (Germany)
- Kazerne Dossin Memoriaal, Museum Endocumentatiecentrum Over Holocausten Mensenrechten (Belgium)
- Koninklijke Nederlandse Akademie Van Wetenschappen - KNAW (Netherlands)
- Magyarorszagi Zsido Hitkozsegek Szovetsege Tarsadalmi Szervezet (Hungary)
- Masarykův ústav a Archiv AV ČR, v. v. i. (Czech Republic)
- Memorial de La Shoah (France)
- Stiftung Zur Wissenschaftlichen Erforschung Der Zeitgeschichte - Institut Fur Zeitgeschichte IFZ (Germany)
- Stowarzyszenie Centrum Badan Nad Zaglada Zydow (Poland)
- The United States Holocaust Memorial Museum (United States)
- The Wiener Holocaust Library (UK)
- Vilniaus Gaono žydų istorijos muziejus (Lithuania)
- Wiener Wiesenthal Institut Fur Holocaust-Studien - VWI (Austria)
- Yad Vashem The Holocaust Martyrs And Heroes Remembrance Authority (Israel)
- Židovské muzeum v Praze (Czech Republic)
- Żydowski Instytut Historyczny im. Emanuela Ringelbluma (Poland)
-
Summary:
The European Holocaust Research Infrastructure’s (EHRI) mission is to overcome widespread dispersal of Holocaust sources. EHRI is an advanced community comprising 23 partners from 17 countries across Europe, Israel and the United States. It is an inter-disciplinary community spanning Holocaust research, archival sciences and the digital humanities. In two previous Integrating Activities, EHRI has integrated an unprecedented amount of information about dispersed Holocaust sources in an online Portal, developed tools to contextualise, analyse and interpret such sources, and set new impulses with regard to inter-disciplinary and trans-national research. EHRI's past achievements have been recognised, not least by European Strategy Forum on Research Infrastructures (ESFRI) who adopted EHRI on its 2018 Roadmap.
H2020 CounteR
Participants: Djamé Seddah, Arij Riabi, Wissam Antoun, Virginie Mouilleron, Menel Mahamdi, José Carlos Rosales Núñez, Galo Castillo Lopez.
-
Duration:
1 May 2021–30 Apr 2024.
-
PI:
Catalin Truffin.
-
Coordinator for ALMAnaCH:
Djamé Seddah .
-
Partners:
- Assist Software SRL (Romania)
- Insikt Intelligence S.L. (Spain)
- IMAGGA Technologies LTD (Bulgaria)
- Icon Studios LTD (Malta)
- Consorzio Interuniversitario Nazionale per l'Informatica (Italy)
- Eötvös Loránd Tudományegyetem (Hungary)
- Universita Cattolica del Sacro Cuore (Italy)
- Malta Information Technology Law Association (Malta)
- European Institute Foundation (Bulgaria)
- Association Militants des Savoirs (France)
- Eticas Research and Consulting S.L. (Spain)
- Elliniki Etairia Tilepikoinonion kai Tilematikon Efarmogon A.E. (Greece)
- Ministério da Justiça (Portugal)
- Hochschule für den Öffentlichen Dienst in Bayern (Germany)
- Iekslietu Ministrijas Valsts Policija [State Police Of The Ministry Of Interior] (Latvia)
- Serviciul de Protectie si Paza (Romania)
- Glavna Direktsia Natsionalna Politsia (Bulgaria)
- Ministère de l'Intérieur (France)
-
Funding:
€6,994,813 total, €684,000 for ALMAnaCH
-
Summary:
In order to support the fight against radicalization and thus prevent future terrorist attacks from taking place, the CounteR project brings data from diverse sources into an analysis and early alert platform for data mining and prediction of critical areas (e.g. communities), aiming to be a frontline community policing tool that looks at the community and its related risk factors rather than targeting and monitoring individuals. The system will incorporate state of the art NLP technologies combined with expert knowledge in the psychology of radicalization processes to provide a complete solution for law enforcement authorities to understand the when, where and why of radicalisation in the community.
10.4 National initiatives
ANR MaTOS
Participants: Rachel Bawden, Éric de La Clergerie, Nicolas Dahan, Ziqian Peng, Panagiotis Tsolakis.
-
Duration:
1 Jan 2023–31 Dec 2026.
-
PI:
François Yvon.
-
Coordinator for ALMAnaCH:
Rachel Bawden .
-
Partners:
- Sorbonne-Université
- Université de Paris
- CNRS
-
Funding:
€782,529 total, €280,520 for ALMAnaCH
-
Summary:
The MaTOS (Machine Translation for Open Science) project aims to develop new methods for the machine translation (MT) of complete scientific documents, as well as automatic metrics to evaluate the quality of these translations. Our main application target is the translation of scientific articles between French and English, where linguistic resources can be exploited to obtain more reliable translations, both for publication purposes and for gisting and text mining. However, efforts to improve MT of complete documents are hampered by the inability of existing automatic metrics to detect weaknesses in the systems and to identify the best ways to remedy them. The MaTOS project aims to address both of these issues.
ANR TraLaLaM
Participants: Rachel Bawden, Benoît Sagot, Malik Marmonier.
-
Duration:
1 Oct 2023–30 Sept 2026.
-
PI:
Josep Crego (Systran by ChapsVision).
-
Coordinator for ALMAnaCH:
Rachel Bawden .
-
Partners:
- Systran by ChapsVision
- CNRS
-
Funding:
€595,348 total, €169,566 for ALMAnaCH
-
Summary:
The aim of TraLaLaM is to explore the use of large language models (LLMs) for machine translation, by asking two main questions: (i) in what scenarios can contextual information be effectively used via prompting? and (ii) for low-resource scenarios (with a focus on dialects and regional languages), can LLMs be effectively trained without any parallel data?
ANR PRME SINNet
Participants: Chloé Clavel.
-
Duration:
1 Mar 2024–1 Oct 2027.
-
PI:
Chloé Clavel .
-
Coordinator for ALMAnaCH:
Chloé Clavel .
-
Funding:
€474,255
-
Summary:
SINNet proposes a paradigm shift for rendering conversational systems and social robotics a more acceptable and trustworthy technology even when using deep learning approaches. It will focus on the verbal component of the interaction, will target the agent-user social relationship, and model the behaviors indexing the state of the social relationship between agent and user, thus going beyond the analysis of the user's positive and negative sentiments. It implies developing easy-to-adapt and easy-to-explain neural models able to analyse the user's behavior contributing to user-agent co-construction processes such as the ones characterising the rapport with the agent, or the trust and affiliation in the agent, as well as to generate the agent's answer fostering the user-agent social relationship. This SINNet project will establish interdisciplinarity as a core challenge by providing a shared formalism between complex (e.g. psychological or socio-linguistic) theories of social interactions and the underlying formalism in deep learning and language models.
ANR PRCE REVITALISE
Participants: Chloé Clavel.
-
Duration:
15 Feb 2022–15 Nov 2025.
-
PI:
Magalie Ochs (LIS).
-
Coordinator for ALMAnaCH:
Chloé Clavel .
-
Partners:
- LIS
- Umanis
- ISM
- IMT Atlantique, Telecom-Paris
-
Funding:
€580,124 total, €123,800 for ALMAnaCH
-
Summary:
More than ever, with the increasing use of online video-conferencing solutions in daily professional interactions, public speaking skills are becoming crucial. The aim of this project is to obtain better insights into the best approaches allowing the practice of public speaking skills with technologically mediated tools. To this end, we will investigate different training environments (e.g. without a virtual/real audience) and different training approaches (e.g. modelling-based, feedback-based, simulation-based) to help users acquire, improve, and practise public speaking skills in full autonomy. For this purpose, different research challenges will be tackled to (i) automatically learn, from different corpora, the multimodal cues correlated to the quality of public speaking, (ii) provide pedagogical activities rooted in coaching practice, taking a user-centered approach and (iii) provide a global evaluation of the training session as well as the specific behavioural characteristics to improve.
10.4.1 Competitivity Clusters and Thematic Institutes
3IA PRAIRIE
Participants: Benoît Sagot, Rachel Bawden, Nathan Godey, Lydia Nishimwe, Matthieu Futeral-Peter, Arij Riabi, Wissam Antoun.
-
Duration:
1 Oct 2019–30 Sept 2025.
-
PI:
Isabelle Ryl.
-
Coordinators for ALMAnaCH:
Benoît Sagot , Rachel Bawden and Justine Cassell .
-
Partners:
- Inria
- CNRS
- Institut Pasteur
- PSL
- Université de Paris
- Amazon
- Google DeepMind
- faurecia
- GE Healthcare
- Idemia
- Janssen
- Naver Labs
- Nokia
- Pfizer
- Stellantis
- Valeo
- Vertex
-
Funding:
€20,000,000 total, €592,000 for ALMAnaCH
-
Summary:
The PRAIRIE Institute (PaRis AI Research InstitutE) is one of the four French Institutes of Artificial Intelligence, which were created as part of the national French initiative on AI announced by President Emmanuel Macron on May 29, 2018. PRAIRIE's objective is to become within five years a world leader in AI research and higher education, with an undeniable impact on economy and technology at the French, European and global levels. It brings together academic members (“PRAIRIE chairs”) who excel at research and education in both the core methodological areas and the interdisciplinary aspects of AI, and industrial members that are major actors in AI at the global level and a very strong group of international partners. Benoît Sagot and Justine Cassell hold PRAIRIE chairs. Rachel Bawden holds a junior PRAIRIE chair.
LabEx EFL
Participants: Benoît Sagot, Djamé Seddah, Éric Villemonte de La Clergerie, Virginie Mouilleron.
-
Duration:
1 Oct 2010–30 Sept 2024.
-
PI:
Barbara Hemforth (LLF).
-
Coordinators for ALMAnaCH:
Benoît Sagot , Djamé Seddah and Éric de La Clergerie .
-
Summary:
Empirical foundations of linguistics, including computational linguistics and natural language processing. ALMAnaCH's predecessor team ALPAGE was one of the partner teams of this LabEx, which gathers a dozen of teams within and around Paris whose research interests include one aspects of linguistics or more. Several ALMAnaCH members are now “individual members” of the LabEx EFL. B. Sagot serves as deputy head (and former head) of one of the scientific strands of the LabEx, namely strand 6 dedicated to language resources. Benoît Sagot and Djamé Seddah are (co-)heads of a number of scientific “operations” within strands 6, 5 (“computational semantic analysis”) and 2 (“experimental grammar”). Main collaborations are related to language resource development (strands 5 and 6), syntactic and semantic parsing (strand 5, especially with LIPN [CNRS and U. Paris 13]) and computational morphology (strands 2 and 6, especially with CRLAO [CNRS and Inalco] and LLF [CNRS and Paris-Diderot]).
GDR LiLT
Participants: Benoît Sagot, Djamé Seddah, Éric Villemonte de La Clergerie.
-
Duration:
1 Jan 2019–present.
-
Summary:
Linguistic issues in language technology.
10.4.2 Other National Initiatives
Convention (MIC, Archives Nationales) NER4archives
Participants: Cecilia Graiff.
-
Duration:
1 Jan 2020–27 Nov 2024.
-
PI:
Laurent Romary .
-
Coordinator for ALMAnaCH:
Laurent Romary .
-
Partners:
- Ministère de la culture
- Archives Nationales
-
Funding:
€60,840
-
Summary:
The project focuses on named entity recognition and disambiguation on data of the Archives Nationales de France (AN). The NER task is applied to the XML/EAD resources and consists in fine-tuning a spaCy based Transformer. A spaCy wrapper of the entity-fishing package is applied for entity disambiguation. Moreover, the entities are disambiguated against the Authorities made available by the AN, by leveraging RDF graph manipulation, string-matching algorithms, and an application of CrossEncoders. The idea is to merge this approach to a structure-based approach relying on GNNs, which was partially implemented.
TGIR Huma-Num
Participants: Benoît Sagot, Thibault Clérice.
-
Duration:
1 Jan 2013–present.
-
Summary:
ALMAnaCH is a member of the CORLI consortium on “corpora, languages and interactions” (B. Sagot is a member of the consortium’s board).
Convention (MIC) DataCatalogue
Participants: Hugo Scheithauer, Sarah Bénière.
-
Duration:
12 Aug 2021–31 Oct 2024.
-
PI:
Laurent Romary .
-
Coordinator for ALMAnaCH:
Laurent Romary .
-
Partners:
- Ministère de la culture
- INHA
- Bibliothèque Nationale de France
-
Summary:
The project aims at contributing to the proper transition between a basic digitalisation of cultural heritage content and the actual usage of the corresponding content within a “collection as data” perspective. To acheive this, we experiment news methods for extracting the logical structure of scanned (and OCRed) catalogues and standardise their content for publication towards curators, researchers, or wider users.
PIA project (“AMI santé numérique”) OncoLab
Participants: Éric de La Clergerie, Simon Meoni, Rian Touchent.
-
Duration:
1 Mar 2022–1 Mar 2026.
-
PI:
Éric de La Clergerie .
-
Partners:
- Arkhn
- Owkin
- Institut universitaire du cancer de Toulouse Oncopole
- Institut Curie
- Institut Bergonié
- CHU de Toulouse
-
Funding:
€10,639,360 total, €700,720 for ALMAnaCH
-
Summary:
The aim of the project is to make cancer data from health institutions accessible to all stakeholders involved for research and innovation purposes. The data at hand will be standardised and structured, in particular by extracting information from textual documents.
DEFI Inria COLaF
Participants: Benoît Sagot, Thibault Clérice, Rachel Bawden, Juliette Janès, Rasul Dent, Oriane Nédey.
-
Duration:
1 Aug 2023–31 Jul 2027.
-
PIs:
Benoît Sagot and Slim Ouni.
-
Coordinator for ALMAnaCH:
Benoît Sagot .
-
Partner:
- MULTISPEECH (Inria Nancy)
-
Funding:
€1,500,000 total, €750,000 for ALMAnaCH
-
Summary:
The Inria DEFI COLaF (Corpus and Tools for the Languages of France) aims to strengthen the ecosystem of automatic text and speech processing for the languages and speakers of France. To do this, it aims to create open datasets and use them to develop open-source models and tools.
ExcellencES TIERED
Participants: Djamé Seddah, Benoît Sagot.
-
Duration:
1 Jan 2023–31 Dec 2033.
-
PI:
SciencesPo.
-
Coordinator for ALMAnaCH:
Benoît Sagot .
-
Partners:
- SciencesPo
- CNRS
- Ifremer
- INED
- Inserm
- Université Paris Cité
- INALCO
- IDDRI
-
Funding:
€16,000,000 total
-
Summary:
The ambition of the ExcellencES TIERED project is to address the challenges of democratic systems in the face of environmental transformations and the digital transition, by producing outstanding scientific research, disseminating it within society, and training today's and tomorrow's decision-makers.
Biblissima+ Grant HTRogène
Participants: Thibault Clérice, Alix Chagué.
-
Duration:
1 Jan 2024–31 Dec 2025.
-
PIs:
Thibault Clérice and Alix Chagué.
-
Coordinators for ALMAnaCH:
Thibault Clérice and Alix Chagué.
-
Partners:
- PSL
- Ca'Foscari
- CNRS
-
Funding:
€20,000 total
-
Summary:
The project focuses on the production of transcriptions for literary manuscripts and public or private archives in Romance languages from the 11th to the 16th centuries. The main goal of the project is to produce training data and transcription models that are resilient to language and hand changes. HTRogenic is therefore envisaged as a building block for the infrastructure of Biblissima+ and the medieval philology of Romance languages: the project does not focus on a particular text or a small selection of texts, but on the contrary aims to produce examples of transcription capable of to constitute a representative sample. This sampling is based on specific criteria of language, script, genre and even dating.
Inria Action Exploratoire SaLM
Participants: Chloé Clavel, Benoît Sagot.
-
Duration:
1 Jan 2024–31 Dec 2028.
-
PIs:
Djamé Seddah and Jean-Philippe Cointet.
-
Coordinator for ALMAnaCH:
Djamé Seddah .
-
Partner:
- Sciences Po (Medialab)
-
Funding:
€154,600 total, €130,000 for ALMAnaCH
-
Summary:
SaLM is an interdisciplinary project between Inria Paris and Sciences Po that aims to redefine current NLP, LLM-based, algorithms by incorporating social contexts into their development and evaluation. It emphasises the importance of understanding language as a reflection of cultural and social identities. To explore this sociological dimension in NLP, the project will include two interrelated PhD projects about hate speech detection and cultural bias detection, gathering a mixed team of sociologists and NLPers to measure the role of the social dimension and prepare sociologically aware language models.
Inria Action Exploratoire BackInTime
Participants: Thibault Clérice, Hassen Aguili, Benoît Sagot, Rachel Bawden.
-
Duration:
1 Sept 2024–31 Dec 2028.
-
PI:
Cécile Pierrot.
-
Coordinator for ALMAnaCH:
Thibault Clérice .
-
Partner:
- CARAMBA
-
Summary:
BACK IN TIME brings together the expertise of researchers in three fields - artificial intelligence, cryptography and history - to decipher encrypted historical letters, some of which have lain dormant for several centuries. Given the sheer number of pages and the variety of symbols and rules involved, our aim is to develop a software package to assist or even automate the deciphering of documents from ancient, medieval and modern history.
BPI project Code Commons
Participants: Benoît Sagot, Djamé Seddah.
-
Duration:
1 Nov 2024–31 Oct 2026.
-
PI:
Roberto Di Cosmo.
-
Coordinator for ALMAnaCH:
Djamé Seddah . CodeCommons is a two-year project building on the foundation of Software Heritage, the world’s largest public source code archive. Funded by the French government with academic partners in France and Italy, our mission is to expand and enhance the archive, consolidating critical, qualified information needed to create smaller, higher-quality datasets for the next generation of responsible AI tools. It prioritizes transparency and traceability, empowering model builders and users to respect creators’ rights while fostering a more sovereign and sustainable approach to AI development, massive software analysis, and reproducibility in research.
Detecting Dataset Manipulation and Weaponisation of NLP Models (grant)
Participants: Djamé Seddah, Benoît Sagot, Wissam Antoun.
-
Duration:
1 Jan 2023–31 Dec 2026.
-
PIs:
Djamé Seddah and Benoît Sagot .
-
Coordinators for ALMAnaCH:
Djamé Seddah and Benoît Sagot .
-
Partner:
- Ministry of the Interior, France
-
Funding:
€184,000
-
Summary:
Training Large Language Models (LLMs) has become more accessible than ever due to the increased interest in scaling these LMs to obscene scales, which have been shown to not only improve performance but to unlock new emergent capabilities. However, the high compute cost required to train LLMs is exclusive to high-budget private institutions or some countries, thus raising questions about bad actors with malicious intents. Furthermore, The Center on Terrorism, Extremism, and Counter-terrorism (CTEC) highlights the upcoming threat of industrialized terrorist and extremist propaganda using models like GPT-3. Hence, it is imperative to research methods to 1) detect and defend against threats of LM weaponization and malicious dataset tampering, 2) eliminate or mitigate the threats present in language models, and 3) improve the robustness of our OSINT and threat analysis defense systems against adversarial attacks.
CamemBERT2.0 (grant)
Participants: Djamé Seddah, Benoît Sagot, Wissam Antoun.
-
Duration:
1 Oct 2023–1 Jun 2024.
-
PI:
Djamé Seddah .
-
Coordinator for ALMAnaCH:
Djamé Seddah .
-
Partner:
- DINUM
-
Funding:
€60,000
-
Summary:
French language models such as CamemBERT, by far the most widely used model for French with 22 million downloads since its release on HuggingFace in late 2019 and known for its widespread use in French companies relying on NLP to maximise productivity and efficiency. Language models such as CamemBERT/a have been trained with data that is now obsolete, reducing their performance when put into production. Given the importance of having up-to-date, state-of-the-art models available to the entire French tech ecosystem, and to address the issues listed above, we plan to provide up-to-date encoder models that are essential for the development of modern, fast and efficient artificial intelligence systems. We also plan to provide fine-tuned versions of these models that are directly available for generic use cases such as named entity recognition, natural language relabelling and inference, and part-of-speech labelling. The plan is first to create an updated corpus of French texts, mainly from our own OSCAR project. The new corpus will be used to train a new state-of-the-art CamemBERTa model, in addition to a new CamemBERT model, to ensure backward compatibility with existing applications.
10.4.3 Regional Initiatives
Domaine de recherche et d'innovation majeurs (DIM) AI4IDF
Participants: Chloé Clavel.
-
Duration:
1 Sept 2021–1 Sept 2026.
-
PI:
Chloé Clavel .
-
Coordinator for ALMAnaCH:
Chloé Clavel .
-
Partners:
- PRAIRIE
- DataIA
- Hi!Paris
- SCAI
-
Summary:
AI4IDF aims to deepen knowledge in AI while keeping the human being at the center of its concerns. The Paris Region must play a major role in this future sector, thanks to its scientific excellence and its incomparable ecosystem.
Domaine de recherche et d'innovation majeurs (DIM) Patrimoines matériels – innovation, expérimentation et résilience
Participants: Alix Chagué, Thibault Clérice.
-
Duration:
1 Jan 2022–31 Dec 2026.
-
Coordinator for ALMAnaCH:
Laurent Romary .
-
Summary:
The DIM Patrimoines matériels – innovation, expérimentation et résilience (PAMIR) aims to bring out new forms of social, environmental and economic development by connecting museums, companies, the Ile-de-France ecosystem of creating and crafts, universities, and laboratories around questions of fundamental and applied research on heritage collections and issues.
10.5 Public policy support
- Justine Cassell is a member of the Conseil National du Numérique, an independent advisory body in France that provides guidance to the government on digital technology and its societal impacts.
- Benoît Sagot was a member of the IA Action Summit steering committee held in Paris in March 2025.
- Benoît Sagot gave a 2-hour speech to provide a basic training on NLP to the “Section de l'intérieur” of the French “Conseil d'État”, the highest administrative court in France, which also serves as a legal adviser to the government.
- BenoîtSagot gave an invited conference on current NLP at the 3rd training session aimed at Directors of Central Government Administration, organised by DINUM (the French Interministerial Directorate for Digital Affairs) at INSP (the elite French school for training senior civil servants).
- Benoît Sagot and Djamé Seddah gave a 2-hour training course at the CNCTR (“Centre national de contrôle des techniques de renseignement”), the French independent administrative authority that oversees and ensures the legality of intelligence-gathering techniques used by French intelligence agencies.
11 Dissemination
11.1 Promoting scientific activities
11.1.1 Scientific events: organisation
Member of the organizing committees
- Justine Cassell : Member of the organising committee for AI, Psychology and Neuroscience Summer Program (U. California Berkeley Simon Foundation funded 1-month summer workshop).
- Rachel Bawden : Member of the organising committee for the WMT biomedical shared task and the WMT general shared task.
Inria-internal events
- Justine Cassell : Organiser for the First Symposium on HCI in AI (24-person workshop to identify grand challenges for HCI in the era of ubiquitous AI systems, hosted by MBZUAI in Abu Dhabi) and ESSLLI 5-day workshop on Conversational Grounding (5-day ESSLI workshop, co-organised with David Traum, during the last week of July).
- Rachel Bawden : Organiser of the ALMAnaCH seminar series.
- Wissam Antoun : Organiser of the ALMAnaCH reading group.
11.1.2 Scientific events: selection
Member of the conference program committees
- Aina Garí Soler: Senior area chair for COLING 2025 (Lexical Semantics). Reviewer for NLP4DH & IWCLUL.
- Chloé Clavel : Reviewer for Interspeech, EMNLP 2024 and SiCon workshop @EMNLP (Social Influence in Conversations). Area chair for ICMI and ACII. Senior area chair for COLING 2025 (Dialogue and Conversational Interaction).
- Éric de La Clergerie : Reviewer for COLM 2024, EPIA 2024, EAMT 2024 (emergency review), EMNLP 2024 and ACL Rolling Review (February).
- Emer Gilmartin: Reviewer for LREC-COLING 2024.
- Justine Cassell : Reviewer for YESDS (Young Researchers in Spoken Dialogue Systems), ACL Rolling Reviews and SIGDIAL.
- Lauriane Aufrant: Area chair for LREC-COLING 2024 (“Parsing, Tagging, Chunking, Grammar, Syntax, Morphosyntax, Morphology” track).
- Rachel Bawden : Senior area chair for EAMT 2024 (Co-chair of the “Research: Technical” track). Programme committee member for TALN 2024 (Meta-reviewer). Reviewer for COLM 2024, EvalLLM2024 workshop, ACL SRW 2024 (Student research workshop) and WMT 2024. Area chair for EMNLP 2024 (Industry Track).
- Benoît Sagot : Reviewer for COLM 2024, EAMT 2024 and TALN 2024.
- Djamé Seddah : Senior area chair for ACL 2024 (Syntax and Morphologie Track) and COLING 2025 (Tutorial chair). Reviewer for NLP+CSS Workshop 2024, WinNLP, ACL Rolling Reviews (October), NODALIDA 2025, CAWL 2024, JEP-TALN 2024, WinNLP, LAW (Linguistics Annotations Workshop), GenBench 2024 Workshop, EAMT 2024 Technical Track, NAACL 2024 Workshop NLP-CSS (Computational Social Science) and ACL Rolling Reviews (February).
- Thibault Clérice : Programme committee member for Computational Humanities Research and NLP4DH.
11.1.3 Journal
- Éric de La Clergerie : Reviewer for ACM Computing Surveys and SN Computer Science.
- Rachel Bawden : Reviewer for Revue TAL (Special thematic issue on “Robustesse et limites des modèles de traitement automatique des langues”) and Revue TAL (Special thematic issue on “Explicabilité des modèles de TAL”).
- Djamé Seddah : Reviewer for Transaction on Affective Computing.
Member of the editorial boards
- Chloé Clavel : Member of the editorial board for IEEE Transactions of Affective Computing.
- Rachel Bawden : Member of the editorial board for Revue TAL (Secretary).
11.1.4 Invited talks
-
Rachel Bawden
:
- Demi-heure de Science, Inria Paris (11 Jan 2024): “Machine Translation and Variation”.
- Conference on Multilingualism, Aix-Marseille University, Aix-en-Provence (11 Jun 2024): “Machine Translation and the Challenge of Cross-Lingual Ambiguity”.
-
Justine Cassell
:
- World Economic Forum, Davos (17 Jan 2024): “Communication at the Speed of AI”. Panel organised by Fast Company & Verizon
- World Economic Forum, Davos (18 Jan 2024): “Future of XR”.
- Shonan Village Center Meeting on The Future of Education with AI (5 Mar 2024): “The Future of Education with AI”. Shonan Village workshop, organised by Laurence deVillers
- AI Research Center (AIRC), Japan (8 Mar 2024): “Interlocutor effects: What LLMs can't do and why”.
- EACL, Malta (22 Mar 2024): “Marginalized Dialects and Language Technologies for Education: Can Culture-Sensitive EdTech Help Children Flourish”. Workshop EACL-EMNLP workshop “Towards Ethical and Inclusive Conversational AI: Language Attitudes, Linguistic Diversity, and Language Rights” (TEICAI)
- Paris, France (27 Mar 2024): “Teaching and learning with generative AI”. Unesco Chair Conference on Teaching and Learning with generative AI.
- Cambridge, MA, USA (5 Apr 2024): “Potential impacts of Social Bias and Machine Learning Bias on Precision Medicine”. “Harvard-Radcliffe workshop on “Sex and Gender at the Crossroads of 21st Century Precision Medicine”
- Paris, France (22 May 2024): “The impact of on-Device LLMs on usability and security of cell phones”. Salon VivaTech
- Paris, France (2 Jul 2024): “IA Génératives, chatbots, tuteurs et étudiants - What's next ?”. Seminaire AGIR (Paris Dauphine)
- Paris, France (17 Sept 2024): “IA, neuroscience et l'avenir de l'éducation”. Evenement à Dauphine: Réunion Agir
- Paris, France (27 Sept 2024): “Chatbots as interlocutors not tutors”. XYZParis (Station F)
- French Embassy, Seoul, South Korea (22 Oct 2024): “IA and the future of education”. Joint meeting between France Korea, Taiwan and Japan
- College de France (28 Oct 2024): “Conversations à Venir”. IA et Neuroscience
- Conversations à Venir (6 Dec 2024): “Vivre avec l'Intelligence Artificielle: l'éducation”. Entretiens de Royaumont
- Versailles, France (11 Dec 2024): “Giving AI a Voice and a Body”. Joint with Patrick Perez (Kyutai). Jolt Capital's 2024 Annual LPs & CEOs Day
-
Floriane Chiffoleau
:
- Saarland University, Saarbrücken (22 Feb 2024): “Introduction into Publishing Workflows with TEI Publisher”. Master Class “Digital Scholarly Editing”' 2024, Saarbrücken
- Charles University, Prague, Czech Republic (28 Mar 2024): “Leveraging EHRI Online Editions for training automated edition tools”. Joint with hscheith. Workshop “Natural Language Processing meets Holocaust Archives”
-
Benoît Sagot
:
- PhD day of the Île-de-France AI4IDF institute (5 Mar 2024): “Image-enhanced machine translation: modelling and evaluation challenges”. Phd day of the Île-de-France AI4IDF institute
- CORIA 2024, La Rochelle, France (3 Apr 2024): “Mieux comprendre les modèles de langue et les textes qu'ils produisent”. Keynote speech
- École polytechnique, Palaiseau, France (9 Apr 2024): “Apprendre les langues aux machines: Introduction au traitement automatique des langues”. “Amphi 0” for computer science: invited scientific talk to encourage new École polytechique students to enroll in computer science courses
- Maison Irène et Frédéric Joliot Curie, Bruxelles, Belgium (16 Apr 2024): “Language models and their training data: challenges and perspectives”. ERCIM Visionary Event on Generative AI: Introductory keynote
- BnF, Paris, France (26 Apr 2024): “Training data for language models: challenges and perspectives”. Web Archiving Conference (WAC): Closing keynote
- George Mason University, Fairfax, Virginia, USA (3 May 2024): “Why do small language models underperform?”. GMU Computer Science Seminar
- University of Edinburgh, Edinburgh, UK (14 Jun 2024): “Doing more with sentence embeddings”. ILCC seminar
- Paris, France (24 Jun 2024): “Modèles de langue et données d'entraînement : progrès récents et enjeux actuels”. Executive seminar “L'intelligence artificielle à la BnF” (AI at BnF)
- Grenoble, France (29 Aug 2024): “Modèles génératifs pour le traitement automatique des langues”. Journées Scientifiques Inria
- Montreal, Canada (1 Oct 2024): “Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck”. MILA Workshop on NLP in the era of generative AI, cognitive sciences, and societal transformation. Joint work with Nathan Godey and Éric de La Clergerie
- Campus Cyber, Puteaux, France (11 Dec 2024): “Modèles de langue génératifs : contexte, état de l'art et enjeux”. Journées scientifiques du PEPR Cybersécurité
-
Chloé Clavel
:
- Paris, France (23 May 2024): “AI4IDF: recherche en intelligence artificielle centrée sur l'humain”. Salon VivaTech
- Bordeaux, France (24 May 2024): “Transparency in AI”. Symposium on Neuroscience & Artificial Intelligence Mechanisms, Perspectives & Consequences.
- Glasgow, UK (19 Sept 2024): “Multimodal analysis of the socio-emotional layer in interactions — Or how to replicate a subjective perception process?”. Workshop of the IVA conference
- Grenoble, France (15 Oct 2024): “Socio-conversational systems”. 2nd HRI Symposium organized by Naver Labs Europe
-
Djamé Seddah
:
- Paris, France, (30 May 2024): “Should we be afraid of the big bad GPT? Demystifying language models and preventing their weaponization”. Forum Terratec 2024. Applications of AI in research and industry. Invited Talk. .html
- Forum 30 ans de l'ORAP, BNF, Paris, France (15 Oct 2024): “Should we be afraid of the big bad GPT? Demystifying language models and preventing their weaponization”. Forum 30 ans de l'ORAP
- CNCTR, Paris, France (29 Apr 2024): “Prévenir la weaponisation des LLM”. Commission nationale de controle des techniques de renseignement
- ANSII, Paris, France (26 Sept 2024): “Prévenir la weaponisation des LLM”. Online à partir du Campus Santé
- Terratech, Paris, France (30 May 2024): “Should we be afraid of the big bad GPT? Demystifying language models and preventing their weaponization”. Teratec 2024 Forum
- IMS, Stuttgart, Germany (20 Nov 2024): “Demystifying language models and preventing their weaponization”.
- Ambassade UK, workshop Inria-UK AI Institute (6 Nov 2024): “Preventing Language Models Weaponisation”. Joint with Arij Riabi and Wissam Antoun .
- Ministère de l'intérieur, Levallois-Perret, France (11 Dec 2024): “Preventing Language Models Weaponisation: Insights from the Counter Dataset”. Joint with Arij Riabi and Wissam Antoun .
-
Thibault Clérice
:
- Paris, France (13 Nov 2024): “CATMuS Medieval: The Importance of Medieval Manuscript Automatic Transcription for Computer Vision”.
-
Alix Chagué
:
- Naples, Italie (16 Apr 2024): “FAIRer transcriptions: HTR-United and the possibility of a common for training data”. Horizons of digital philology. The Greek Anthology for rethinking [formats], [paradigms] and [collaboration]
- Paris, France (13 Dec 2024): “Transcription of discursive and non-discursive documents: connecting project specificities and generic frameworks”. Workshop: texts in historical maps, diagrams and illustrations
-
Éric de La Clergerie
:
- workshop e-VITA on Knowledge Graphs and LLMs (8 Mar 2024): “Coupling KG and LLM: a few directions”.
11.1.5 Scientific expertise
-
Alix Chagué
:
- Member of the Comité de Perfectionnement du Master Technologies Numériques Appliquées à l'Histoire de l'Ecole nationale des chartes (Paris).
-
Benoît Sagot
:
- Member of the scientific advisory board of ERIC CLARIN.
- Member of the Working Group on IA of the Couperin consortium.
- Member of the Scientific Council of ARTE France.
-
Aina Garí Soler
:
- Reviewer of UTTER-FSTP (UTTER project: Unified Transcription and Translation for Extended Reality Financial Support for Third Parties).
-
Chloé Clavel
:
- Reviewer for BPI.
-
Éric de La Clergerie
:
- Reviewer of Expert BPI (8 project reviews on BPI call “Usages IA Generative”).
- Reviewer of ANR (MRSEI call) (Project review).
-
Rachel Bawden
- Co-opted member of the executive committee of the EAMT (European Association for Machine Translation)
- Reviewer of Thematic committee for Genci projects (Reviewing projects for the allocation of computational resources).
11.1.6 Research administration
- Laurent Romary :
-
Benoît Sagot
:
- Member of the board of Inria Paris's Comité des Projets (Inria Paris research centre's Bureau du Comité des Projets).
- Member of the scientific board of the Société de Linguistique de Paris (Administrateur).
-
Rachel Bawden
:
- Member of the scientific board of the Société de Linguistique de Paris (Administratrice).
11.2 Teaching - Supervision - Juries
11.2.1 Teaching
-
Rachel Bawden
:
- Master's course (M2), Master “Mathématiques, Vision Apprentissage”, ENS Paris-Saclay, France. CM: Speech and Language Processing, coorganised with Benoît Sagot , Chloé Clavel , Djamé Seddah and Guillaume Wisniewski. 3hrs.
-
Justine Cassell
:
- Master's course (M2), Master “Biologie Intégrative et Psychologie”, speciality Neurosciences Parcours N°2 “Neurosciences cognitives et comportementales”, Sorbonne Université, Paris, France. CM: Metacognition dans l'Interaction Virtuelle. 4hrs.
- Inria Paris. École Académique Formation Continue: IA et Education: Quels rôles pour l'IA? 4hrs.
- Bénin Start-up Incubator. Continued training for start-ups creating education applications: Les Sciences de l'Education, coorganised with Mastercard Foundation. 12hrs.
- Senegal Start-up Incubator. Continued training for start-ups creating education applications: Les Sciences de l'Education, coorganised with Mastercard Foundation. 12hrs.
-
Alix Chagué
:
- Bachelor; Master; Doctorate's course (1er, 2e et 3e cycles), Faculté des Arts et des Sciences, Université de Montréal, Canada. CM: HNU-6059: Humanités Numériques: Langages de Programmation. 15hrs.
- Bachelor's course (1er cycle), Faculté des Arts et des Sciences, Université de Montréal, Canada. CM: HNU-2000: Humanités Numériques: Pratique. 45hrs.
- Università degli Studi di Napoli Federico II, Naples, Italie. Workshop on eScriptorium for HTR projects: eScriptorium. 3hrs.
-
Floriane Chiffoleau
:
- Charles University, Prague, Czech Republic. Workshop “Natural Language Processing meets Holocaust Archives”: EHRI editions and TEI Publisher annotation tool. 2hrs.
-
Chloé Clavel
:
- Master's course (M2), X Datascience, Polytechnique. CM: Introduction to NLP, Recent advances in NLP. 6hrs.
- Master's course (M2), Master “Mathématiques, Vision Apprentissage”, ENS Paris-Saclay, France. CM: Speech and Language Processing, coorganised with Benoît Sagot , Chloé Clavel , Djamé Seddah and Guillaume Wisniewski. 6hrs.
- Master's course (M2), Mastère spécialisé en Intelligence Artificielle, Telecom-Paris. CM: Conversational Systems. 3hrs.
- Master's course (M2), Mastère spécialisé en Intelligence Artificielle, Telecom-Paris. CM: Introduction to Natural Language Processing. 3hrs.
- Master's course (M2), Master Data IA, Polytechnique. CM: Introduction to Natural Language Processing. 3.5hrs.
-
Thibault Clérice
:
- Doctorate's course (PhD & Higher), Invited professor course, Federico II, Napoli. CM: Introduction to Computational Humanities. 6hrs.
- Doctorate's course (PhD & Higher), Invited professor course, Federico II, Napoli. CM: Natural Language Processing for Computational Humanities. 6hrs.
- Doctorate's course (PhD & Higher), Invited professor course, Federico II, Napoli. CM: Computer Vision for Computational Humanities. 6hrs.
- Doctorate's course (PhD & Higher), Invited professor course, Federico II, Napoli. TD/TP: Workshop for Handwritten Text Recognition & Layout Analysis. 18hrs.
- ENS de Lyon, Lyon, France. Summer School: TranscriboQuest 2025, coorganised with CNRS (CIHAM). 21hrs.
-
Aina Garí Soler
:
- Master's course (M2), M2 DS - Data Science, École Polytechnique. CM: CSC_5DS25_TP: Natural Language Processing and Sentiment Analysis, coorganised with Matthieu Labeau. 1.5hrs.
- Master's course (M2), Mastère spécialisé - IA : Intelligence Artificielle, Télécom-Paris. Project supervision: IA717 : Natural Language Processing, coorganised with Matthieu Labeau, Maria Boritchev, Changhong Wang. 19.5hrs.
-
Arij Riabi
:
- Master's course (M1), Cycle d'ingénieur, ENSAE. Project supervision: Applied statistics project. 20hrs.
-
Benoît Sagot
:
- Course for the general public, Chaire annuelle “Informatique et sciences numériques”, Collège de France. CM: Apprendre les langues aux machines. 9hrs.
- Master's course (M2), Master “Mathématiques, Vision Apprentissage”, ENS Paris-Saclay, France. CM: Speech and Language Processing, coorganised with Benoît Sagot , Chloé Clavel , Djamé Seddah and Guillaume Wisniewski. 7.5hrs.
- PariSanté Campus. Certificate Data AI Product Owner Société des Ingénieurs de l'Automobile: Natural Language Processing. 14hrs.
- PariSanté Campus. “IA for creative business” training by We Are_ school: Introduction au TAL et à l'IA générative. 5hrs
3. - Institut national du service public. 3ème journée de formation des Directeurs d'Administrations Centrales au numérique, organisée par la DINUM: Introduction au traitement automatique des langues. 1hr.
-
Hugo Scheithauer
:
- Master's course (M2), Master Humanités numériques, Université Rennes 2. TD/TP: L'édition numérique avec XML-TEI. 6hrs.
-
Djamé Seddah
:
- Master's course (M2), Master “Mathématiques, Vision Apprentissage”, ENS Paris-Saclay, France. CM: Speech and Language Processing, coorganised with Benoît Sagot , Chloé Clavel , Djamé Seddah and Guillaume Wisniewski. 1.5hrs.
11.2.2 Supervision
PhD
- Tú Anh Nguyên : “Spoken Language Modeling from Raw Audio” (19 Apr 2021–18 Apr 2024). CIFRE PhD with META AI Paris. Supervised by Benoît Sagot and Emmanuel Dupoux (CIFRE advisor). PhD defended on 9 Apr 2024.
- Paul-Ambroise Duquenne : “Study of vector spaces for sentence representation” (15 May 2021–14 Mar 2024). CIFRE PhD with META AI Paris. Supervised by Benoît Sagot and Holger Schwenk (CIFRE advisor). PhD defended on 14 Mar 2024.
- Lydia Nishimwe : “Robust Neural Machine Translation” (1 Oct 2021–present). Supervised by Benoît Sagot and Rachel Bawden .
- Arij Riabi : “NLP for low-resource, non-standardised language varieties, especially North-African dialectal Arabic written in Latin script” (1 Oct 2021–present). Supervised by Laurent Romary and Djamé Seddah .
- Floriane Chiffoleau : “Training data and creation of models for the text recognition of typewritten or handwritten corpus of archival collection” (15 Oct 2021–14 Oct 2024). Primary affiliation: Université du Mans. Supervised by Anne Baillot and Laurent Romary . PhD defended on 20 Nov 2024.
- Matthieu Futeral-Peter : “Text-image multimodal models” (1 Nov 2021–present). Primary affiliation: WILLOW, Inria. Supervised by Ivan Laptev and Rachel Bawden .
- Alix Chagué : “Methodology for the creation of training data and the application of handwritten text recognition to the Humanities.” (1 Nov 2021–31 Oct 2024). Secondary affiliation: Université de Montréal and CRIHN. Supervised by Laurent Romary , Emmanuel Château-Dutier and Michael Sinatra.
- Nathan Godey : “Neural language modelling” (1 Dec 2021–31 Dec 2024). Supervised by Benoît Sagot and Éric de La Clergerie . PhD defended on 20 Dec 2024.
- Francis Kulumba : “Authorship attribution and disambiguation in scientific publications.” (1 Nov 2022–present). Supervised by Laurent Romary and Guillaume Vimont.
- Alisa Barkar : “Interpretable textual features, public speeches, multimodal systems” (1 Nov 2022–present). Primary affiliation: Télécom Paris. Supervised by Chloé Clavel , Beatrice Biancardi and Mathieu Chollet.
- Rian Touchent : “Information Extraction on French Electronic Health Records” (1 Dec 2022–present). Supervised by Laurent Romary and Éric de La Clergerie .
- Simon Meoni : “Exploration of adaptation methods for neural models in the French clinical domain” (1 Dec 2022–present). CIFRE PhD with Arkhn. Supervised by Laurent Romary and Éric de La Clergerie .
- Wissam Antoun : “Detecting Dataset Manipulation and Weaponisation of NLP Models” (1 Mar 2023–present). Supervised by Benoît Sagot and Djamé Seddah .
- You Zuo : “Patent representation learning for innovation generation and technical trend analysis” (1 Mar 2023–present). CIFRE PhD with qatent. Supervised by Benoît Sagot , Éric de La Clergerie and Kim Gerdes (CIFRE advisor).
- Nicolas Dahan : “Evaluation of the machine translation of scientific documents” (1 Oct 2023–present). Secondary affiliation: CNRS/ISIR. Supervised by François Yvon and Rachel Bawden .
- Ziqian Peng : “Machine translation of scientific documents” (1 Oct 2023–present). Primary affiliation: CNRS/ISIR. Supervised by François Yvon and Rachel Bawden .
- Yanzhu Guo : “Language model evaluation, argument mining, computational social science” (1 Oct 2023–present). Primary affiliation: Ecole Polytechnique. Supervised by Michalis Vazirgiannis and Chloé Clavel .
- Lucie Chenain : “Speech Emotion Recognition for Huntington's Disease risky behaviour” (1 Oct 2023–present). Primary affiliation: Université Paris Cité. Supervised by Anne-Catherine Bachoud Levi and Chloé Clavel .
- Lorraine Vanel : “Conversational AI, Social/emotional Dialogue Generation” (1 Oct 2023–present). Primary affiliation: Télécom Paris. CIFRE PhD with Zaion. Supervised by Chloé Clavel and Alya Yacoubi (CIFRE advisor).
- Biswesh Mohapatra : “Improving chatbot dialogue systems through collaborative grounding” (1 Oct 2023–present). Supervised by Justine Cassell and Laurent Romary .
- Cyril Chhun : “Story Generation and Evaluation” (1 Oct 2023–30 Apr 2024). Primary affiliation: Télécom Paris. Supervised by Chloé Clavel and Fabian Suchanek. PhD defended on 19 Nov 2024.
- Chadi Helwé : “Evaluating and Improving Reasoning Abilities of Large Language Model” (1 Oct 2023–30 Mar 2024). Primary affiliation: Télécom Paris. Supervised by Chloé Clavel and Fabian Suchanek. PhD defended on 5 Jul 2024.
- Hugo Scheithauer : “Acquisition, integration and redistribution of structured data in GLAMs: harmonising practices” (1 Nov 2023–present). Supervised by Laurent Romary .
- Armel Zebaze : “Analogy for Multilingual Natural Language Processing” (1 Nov 2023–present). Supervised by Benoît Sagot and Rachel Bawden .
- Rasul Dent : “Large-scale language identification (numerous languages, massive data, distinction between closely related varieties) with a focus on the languages of France and French-based creoles.” (1 Nov 2023–present). Supervised by Benoît Sagot , Thibault Clérice and Pedro Ortiz.
- Anh Ha Ngo : “Multimodal models, conversation repair and human-agent interaction” (1 Jan 2024–present). Secondary affiliation: Sorbonne Université. Supervised by Chloé Clavel and Catherine Pelachaud.
- Pierre Chambon : “Code generation with language models” (26 Feb 2024–present). Supervised by Benoît Sagot and Gabriel Synnaeve (CIFRE advisor).
- Oriane Nédey : “Machine Translation for low-resource dialectal variants” (1 Oct 2024–present). Supervised by Benoît Sagot , Rachel Bawden and Thibault Clérice .
- Panagiotis Tsolakis : (1 Oct 2024–present).
- Reem Al Najjar : “Investigating Neural Mechanisms of Collaboration Among Peers” (1 Nov 2024–present). Supervised by Justine Cassell .
- Gabrielle Le Bellier : “Controlled generation for biais mitigation and cultural awareness in conversational language models” (1 Nov 2024–present). Supervised by Benoît Sagot and Chloé Clavel .
- Célia Nouri : “Toxicity and Opinions Detection and Analysis on Social Media Conversations” (1 Nov 2024–present). Supervised by Chloé Clavel and Jean-Philippe Cointet (MediaLab).
- Cecilia Graiff : “Multilingual and cross-cultural automatic analysis of argumentation structures in political debates” (1 Dec 2024–present). Supervised by Benoît Sagot and Chloé Clavel .
- Yi Yu : “Automatic analysis of the human ability to collaborate in dyadic and group conversations, with a view to educational applications.” (1 Dec 2024–present). Secondary affiliation: Telecom-Paris. Supervised by Chloé Clavel and Maria Boritchev (Telecom-Paris).
Interns
- Léo Labat: “Extraction of knowledge graphs by combining and adapting tools for a number of NLP tasks, such as named entity recognition, named entity linking, coreference resolution, relation extraction, relation clustering, document-level event extraction and slot filling.” (1 Nov 2023–26 Mar 2024). Supervised by Lauriane Aufrant.
- Ilyas Lebleu: “Exploring the reasoning capabilities of transformers in language models based on the design of families of prompts” (4 Dec 2023–1 Mar 2024). Supervised by Éric de La Clergerie .
- Qinhyue Xu: “interpersonality” (1 Jan 2024–2 Jul 2024). Supervised by Emer Gilmartin and Justine Cassell .
- Mira Lee: “Analyzing properties of speech as a function of personality” (1 Feb 2024–1 Aug 2024). Supervised by Emer Gilmartin and Justine Cassell .
- Irène Metz: “Collection and analysis of hyperscanning data of pairs of children with respect to rapport” (1 Feb 2024–1 Jul 2024). Supervised by Justine Cassell .
- Diana Bakina: “Feature engineering on audio, video, and text signals to support modelling of personality-related conversational phenomena” (1 Mar 2024–13 Aug 2024). Supervised by Emer Gilmartin.
- Reem Al Najjar: “Hyperscanning of pairs of children” (15 Mar 2024–31 Jul 2024).
- Gabrielle Le Bellier: “Metrics and protocols for measuring bias in language models” (15 Apr 2024–30 Sept 2024). Supervised by Benoît Sagot and Chloé Clavel .
- Célia Nouri: “Hate speech analysis in conversations” (26 Apr 2024–3 Oct 2024). Primary affiliation: Medialab SciencesPo. Supervised by Chloé Clavel and Jean-Philippe Cointet.
- Javier Alejandro Lopetegui González: “LLM alignment for different spanish dialects” (6 May 2024–31 Jul 2024). Supervised by Djamé Seddah .
- Zofia Milczarek: “Prompting LLMs for Natural Spoken-like (Text) Dialogue generation” (13 May 2024–12 Aug 2024). Supervised by Marius Le Chapelier.
- Mathilde Deletombe: “Difference between French and English linguistic devices (including nonverbal) for signaling rapport” (13 May 2024–1 Aug 2024).
- Anna Celina Desiderio: (1 Jun 2024–31 Aug 2024).
- Théo Charlot: “Representation And Storage of Grounding in LLMs without affecting context input” (3 Jun 2024–31 Aug 2024). Supervised by Justine Cassell and Biswesh Mohapatra.
- Gabrielle Alimi: (1 Oct 2024–31 Oct 2024). Supervised by Justine Cassell .
- Barokshana Baskaran: (15 Oct 2024–31 Oct 2024). Supervised by Justine Cassell .
Engineers
- Virginie Mouilleron: “Correction and annotation of the Alien vs Predator dataset, Prompt Tuning and Data extraction from LLMs.” (1 Dec 2022–present). Supervised by Djamé Seddah .
- Menel Mahamdi: “Data set annotation, synthetic data generation, conversational data sets” (1 Sept 2023–30 Apr 2024). Supervised by Djamé Seddah .
- Juliette Janès: “Recovery, encoding, maintenance, and publication of textual data on French and other languages of France produced within the framework of the DEFI COLaF” (1 Oct 2023–present). Supervised by Benoît Sagot and Thibault Clérice .
- Sarah Bénière: “Automatic analysis of digitized sales catalogs” (1 Oct 2023–present). Supervised by Laurent Romary .
- Samuel Scalbert: “Detection of software in HAL articles using GROBID and Softcite in the context of the GrapOS project.” (1 Oct 2023–present). Supervised by Laurent Romary .
- Lauriane Aufrant: “Extraction of information and automated construction of knowledge graphs for French, with a focus on subjects and tasks related to the field of defense and security.” (1 Nov 2023–30 Sept 2024). Supervised by Benoît Sagot .
- Marius Le Chapelier: “Developing the SARA (Socially Aware Robot Assistant) dialogue system to be able to build social bonds (rapport) with users in order to improve performance.” (1 Nov 2023–present). Supervised by Justine Cassell .
- Oriane Nédey: “Data collection and translation models for a regional language of France.” (1 Dec 2023–30 Sept 2024). Supervised by Rachel Bawden , Thibault Clérice and Benoît Sagot .
- Cecilia Graiff: “Named entity disambiguation for the National Archives of France in the context of the NER4Archives project.” (1 Dec 2023–30 Nov 2024). Supervised by Laurent Romary .
- Sinem Demirkan: “Hyperscanning of pairs of children in order to better understand the neural correlates of rapport” (1 Jan 2024–2 Jan 2025). Supervised by Justine Cassell .
- Hao Wang: “Benchmarking conversational grounding in LLMs.” (1 Jan 2024–21 May 2024). Supervised by Biswesh Mohapatra.
- Malik Marmonier: “Machine Translation with large language models in low-resource scenarios and for unseen languages” (1 May 2024–present). Supervised by Rachel Bawden and Benoît Sagot .
- Cindy Evellyn de Araujo Silva: “Hyperscanning of pairs of children in order to better understand the neural correlates of rapport, and the impact of rapport on learning” (1 Jun 2024–present). Supervised by Justine Cassell .
- Hasan Onur Keles: (1 Jul 2024–present). Supervised by Justine Cassell .
- Hassen Aguili: “Interface and back-end for automatic recognition of standard and non-standard handwriting” (1 Sept 2024–present). Supervised by Thibault Clérice .
- Sophie Etling: (1 Sept 2024–31 Dec 2024). Supervised by Justine Cassell .
- Panagiotis Tsolakis: “Scientific article management infrastructure for translation” (1 Oct 2024–present). Supervised by Laurent Romary and Rachel Bawden .
Postdocs
- Aina Garí Soler : “Word Meaning Representation in Neural Language Models: Lexical Polysemy and Semantic Relationships” (6 Sept 2021–30 Sept 2024). Primary affiliation: Télécom-Paris. Supervised by Chloé Clavel .
- Emer Gilmartin : “Collaboration with researchers in Korea to understand and model the effects of interlocutor personality on dialogue, and the effects of conversational behaviors on personality. This is leading to a new model of ‘interpersonality’, how personality related behaviours of each participant in a conversation affect the conversation as a whole and, vice-versa, how conversational behaviors affect perceptions of personality.” (1 Oct 2022–30 Sept 2024). Supervised by Justine Cassell .
- José Carlos Rosales Núñez : “Radicalisation detection, robust UGC processing, Machine translation” (1 Aug 2023–30 Apr 2024). Supervised by Djamé Seddah .
- Aina Garí Soler : “Automatic analysis of alignment between speakers in conversations” (1 Oct 2024–present). Secondary affiliation: Telecom-Paris. Supervised by Chloé Clavel .
11.2.3 Juries
PhD
-
Benoît Sagot
- Member of the PhD committee as reviewer for Daniel Hesslow at Université de Franche-Comté & LightOn on 8 Oct 2024. Title: Limiting factors for the continued scaling of Large Langauge Models.
- Member of the PhD committee as president for Mathieu Rita at ENS on 24 Sept 2024. Title: Neural Emergent Communication: at the Intersection of Language Evolution and Deep Reinforcement Learning.
- Member of the PhD committee as reviewer for Laurie Burchell at University of Edinburgh on 13 Jun 2024. Title: Improving natural language processing for under-served languages through increased training data diversity.
- Member of the PhD committee as president for Chadi Helwé at Telecom-Paris on 5 Jul 2024. Title: Evaluating and Improving the Reasoning Abilities of Language Models.
- Member of the PhD committee as director for Nathan Godey at Inria Paris on 20 Dec 2024. Title: Improving representations for language modeling.
-
Chloé Clavel
- Member of the PhD committee as co-director for Chadi Helwé at Telecom-Paris on 5 Jul 2024. Title: Evaluating and Improving the Reasoning Abilities of Language Models.
- Member of the PhD committee as reviewer for Kim Yee at Faculty of Science, The University of Sydney, Australia on 26 Aug 2024. Title: Automatic Conversational Analysis System.
- Member of the PhD committee as reviewer for Rémi Uro at University Paris Saclay on 11 Oct 2024. Title: Détection et caractérisation des interruptions dans les interactions orales pour la description du comportement des femmes et des hommes dans les contenus audiovisuels.
- Member of the PhD committee for Salomé Do at LATTICE, MediaLab on 17 Oct 2024. Title: Computational Content Analysis: How, When and Why? Measuring strategic frame prevalence in political press.
- Member of the PhD committee as reviewer for Carol Figueroa at Université Aix Marseille, Furhat robotics on 16 Dec 2024. Title: Not the “when” and “where” but the “how”: modeling and generating spoken feedback for conversational agents.
- Member of the PhD committee as co-director for Cyril Chhun at Telecom-Paris on 19 Nov 2024. Title: Meta-Evaluation Methodology and Benchmark for Automatic Story Generation.
- Member of the PhD committee as president for Robin Quillivic at Université Paris Sciences et lettres on 20 Dec 2024. Title: Construction de profils psycholinguistiques pour le Trouble de Stress Post-Traumatique (TSPT) à l'aide du Traitement Automatique des Langues.
- Member of the PhD committee as reviewer for Soëlie Lerch at Université de Toulon on 18 Dec 2024. Title: Suggestion de dessins animés par similarité émotionnelle: approches neuronales multimodales combinant contenu et données physiologiques.
-
Djamé Seddah
- Member of the PhD committee as reviewer for Stefan Grünewald at IMS Stuttgart on 29 Nov 2024. Title: Syntactic Dependency Parsing and Beyond: Robust Neural Architectures and Quality-Enhanced Treebanks for Structured Prediction in NLP.
-
Éric Villemonte de La Clergerie
- Member of the PhD committee as co-supervisor for Nathan Godey at Inria Paris on 20 Dec 2024. Title: Improving representations for language modeling.
- Member of the PhD committee as examiner for Shrey Mishra at Université Paris Sciences et lettres on 4 Jul 2024. Title: Extraction multimodale de preuves et de théorèmes depuis la littérature scientifique.
Master
-
Rachel Bawden
- Member of the Master's committee as tutor for Junior Tonga at MVA on 10 Sept 2024. Title: Automatic Generation of Guidance Question Hints Using Large Language Models in Educational Technology.
- Member of the Master's committee as tutor for Fatima Balde at MVA on 11 Oct 2024. Title: Mixture Of Experts (MoEs) Finetuning.
- Member of the Master's committee as tutor for Kelthoum Kerboua at MVA on 11 Oct 2024. Title: Large Language Models: Detection and Mitigation of Hallucinations.
- Member of the Master's committee as tutor for Steven Zheng at MVA on 4 Sept 2024. Title: Unifying Recommender Systems and Product Search with Generative Retrieval.
- Benoît Sagot
- Member of the Master's committee as co-director for Gabrielle Le Bellier at Inria on 2 Sept 2024. Title: Metrics and protocols for measuring bias in language models.
- Member of the Master's committee as tutor for Gaspard Choné-Ducasse at SquarePoint on 12 Sept 2024. Title: Developing a Large Language Model for a Low-Resource Programming Language.
-
Chloé Clavel
- Member of the Master's committee as tutor for Mohammad Ali Jauhar at MVA on 25 Oct 2024. Title: Robust Accented Speech Recognition.
- Member of the Master's committee as tutor for Ben Kabongo Buzangu at MVA on 9 Sept 2024. Title: Systèmes de Recommandation et Analyse des Données Textuelles.
- Member of the Master's committee as tutor for Victor Deng at MVA on 6 Sept 2024. Title: Bias Correction with Pre-trained Audio Embeddings.
-
Djamé Seddah
- Member of the Master's committee as tutor for Grégoire Gissot at MVA on 25 Oct 2024. Title: Amélioration des méthodes automatiques de scoring via LLM-as-a-judge.
HdR
-
Chloé Clavel
- Member of the HdR committee as reviewer for Lina Maria Rojas Barahona at LORIA, Nancy on 12 Jun 2024. Title: HDR: Talking to Machines: do you read me?.
-
Benoît Sagot
- Member of the HdR committee as director for Djamé Seddah at Sorbonne Université, Paris, France on 19 Sept 2024. Title: From French Statistical Parsing to Low-Resource Language Modeling: a Research Journey.
CSD
-
Rachel Bawden
- Member of the CSD committee for Tom Calamai at Inria on 24 Jul 2024. Title: Détection automatique d'argument fallacieux.
- Member of the CSD committee for Estelle Zheng at Université de Lorraine on 14 Jun 2024. Title: Affinage des grands modèles de langage pour la planification et l'action via des APIs.
- Member of the CSD committee for Maxime Poli at ENS on 4 Jun 2024. Title: Pré-entraînement multilingue universel auto-supervisé de modèles de langage parlé.
- Member of the CSD committee for Zineddine Tighidet at BNP Paribas, Sorbonne Université, CNRS on 25 Oct 2024. Title: Etude de l'impact différentiel des choix de modélisation lors du développement d'un modèle de langage bancaire et application de mesures de compatibilité sémantique en tant que garde-fou pour la génération du langage.
-
Benoît Sagot
- Member of the CSD committee for Matthieu Dubois at CNRS, Sorbonne Université, MILA, Université Paris-Saclay on 31 Oct 2024. Title: Automatic Detection of Anomalous Texts.
-
Éric Villemonte de La Clergerie
- Member of the CSD committee for Cyril Bruneau at Université Paris Nanterre on 19 Jun 2024. Title: Transmettre des valeurs à l'école: développement d'un outillage informatique appliqué aux manuels scolaires d'histoire (1870-2020).
- Member of the CSD committee for Jules Descamps at Université Paris Cité on 26 Aug 2024. Title: An assessment of natural language processing models for enhancing health data collection from medical literature.
- Member of the CSD committee for Clément Dauvilliers at Sorbonne Université on 17 Sept 2024. Title: Apprentissage automatique pour la prédiction d'évènements météorologiques extrêmes.
- Member of the CSD committee for Boubakar Zourkalaini at Sorbone Université on 24 Sept 2024. Title: Machine learning research for renewable energy forecasting and planning.
Hiring committees
-
Rachel Bawden
:
- Member of the Commission des emplois scientifiques (CES) hiring committee at Inria (Paris Centre). Delegations, postdocs and PhDs.
-
Chloé Clavel
:
- Member of the PR hiring committee at Inria (LISN/LIPS/STL and A&O).
-
Benoît Sagot
- Member of the scientific selection committee for the Inria DR2 applications, as a member of Inria's Commission d'Évaluation.
- Member of the bonus attribution committee for Inria so-called C3 bonuses, as a member of Inria's Commission d'Évaluation.
11.3 Popularization
11.3.1 Productions (articles, videos, podcasts, serious games, ...)
Authored article
- Rachel Bawden and Benoît Sagot for La Recherche (Media article), “Dans les arcanes des modèles de langue”. Print, Apr-Jun 2024.
Article with citation
-
Chloé Clavel
:
- cited in an article by Le Monde (Media article), “Faut-il s'inquiéter des “hallucinations” des IA comme ChatGPT ou Gemini ?” Online, 17 Jun 2024.
- cited in an article by Les échos (Media article), “Intelligence artificielle : la guerre des données”. Online, 25 Jun 2024.
- cited in an article by So good (Media article), “IA - Bien dans ses Bots”. , 6 Dec 2024.
- Benoît Sagot cited in an article by L'Express (Media article), “OpenAI, Google, Apple... Pourquoi les grands noms de l'IA ont besoin des médias”. Online, 22 Jan 2024.
- Rachel Bawden cited in an article by Centre Inria de Paris website (Inria article), “Rachel Bawden améliore les modèles de traduction automatique”. Inria Paris website, 16 Jan 2024.
11.3.2 Participation in Live events
Media interview
- Chloé Clavel interviewed as part of Un jour dans le monde (Interview), “La Tech La Première”. France Inter, 28 Nov 2024.
Education
- Justine Cassell gave a talk at Executive Education cohort (AI and Society), MBZUAI, Abu Dhabi, 25 May 2024.
Intervention
- Djamé Seddah gave a talk at Formation pour les membres du CNCTR (Executive training), “Prévenir la weaponisation des LLM”. Centre national de contrôle des techniques de renseignement (CNCTR), 29 Apr 2024.
-
Éric Villemonte de La Clergerie
:
+
- gave a talk at Journées Youth Talks (Table ronde "Tribunal Youth Talks"), “L'intégration de l'IA dans l'enseignement favorisera-t-elle un meilleur vivre-ensemble, ou perdrons-nous l'essence de l'humanité dans l'éducation ?” Learning Planet Institute, 24 Jan 2024.
- gave a talk at Journée Réseau Canopé - Intelligence artificielle générative & éducation (Table ronde), “Les Intelligences Artificielles génératives, de la consommation à la construction de compétences”. Paris, 26 Jun 2024.
-
Benoît Sagot
:
+
- gave a talk at Formation pour la section de l'intérieur du Conseil d'État (Executive training), “Introduction au TAL et à l'IA générative”. Conseil d'État, 11 Mar 2024 (2 hours).
- gave a talk at Event "Intégrité et mentorat scientifique à l'heure de l'intelligence artificielle" (Table ronde “L'intelligence artificielle : quels risques et opportunités pour l'intégrité scientifique ?”), Collège de France, Paris, 19 Mar 2024.
- gave a talk at (Replay of the Leçon Inaugurale), “Apprendre les langues aux machines”. Inria Paris, 21 Mar 2024.
- gave a talk at PR[AI]RIE AI days, by PRAIRIE and Station F (Round table “Generative AI”), Station F, 4 Apr 2024.
- gave a talk at Formation pour les membres du CNCTR (Executive training), “Introduction au traitement automatique des langues”. Centre national de contrôle des techniques de renseignement (CNCTR), 29 Apr 2024.
12 Scientific production
12.1 Major publications
- 1 inproceedings"I'm here to fight for ground truth": HTR-United, a solution towards a common for HTR training data.Digital Humanities 2023: Collaboration as OpportunityGraz, Austria2023HAL
- 2 articleCREMMA Medii Aevi: Literary manuscript text recognition in Latin.Journal of Open Humanities Data9April 2023, 4HALDOI
- 3 inproceedingsModular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer.Proceedings of INTERSPEECH 2023INTERSPEECH 2023Dublin, IrelandAugust 2023HALDOI
- 4 articleConstructing a poor man's wordnet in a resource-rich world.Language Resources and Evaluation4932015, 601-635HALDOI
- 5 inproceedingsTackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation.Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Toronto, CanadaJuly 2023, 5394–5413HALDOI
- 6 inproceedingsMANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling.EMNLP 2022 - The 2022 Conference on Empirical Methods in Natural Language ProcessingAbu Dhabi, United Arab EmiratesDecember 2022HAL
- 7 articleSurvey of Low-Resource Machine Translation.Computational Linguistics4832022, 673--732HAL
- 8 inproceedings What does BERT learn about the structure of language? ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics Florence, Italy July 2019 HAL
- 9 inproceedingsHUMB: Automatic Key Term Extraction from Scientific Articles in GROBID.SemEval 2010 WorkshopACL SigLex eventUppsala, SwedenJuly 2010, 248-251HAL
- 10 inproceedingsMUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases.LREC 2022 - 13th Language Resources and Evaluation ConferenceMarseille, FranceJune 2022HAL
- 11 inproceedingsCamemBERT: a Tasty French Language Model.ACL 2020 - 58th Annual Meeting of the Association for Computational LinguisticsSeattle / Virtual, United StatesJuly 2020HALDOI
- 12 inproceedingsWhen Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models.NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language TechnologiesMexico City, MexicoJune 2021HAL
- 13 inproceedingsAsynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures.7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7)Cardiff, United KingdomLeibniz-Institut für Deutsche SpracheJuly 2019HALDOI
- 14 inproceedingsBecause Syntax does Matter: Improving Predicate-Argument Structures Parsing Using Syntactic Features.Conference of the North American Chapter of the Association for Computational Linguistics: Human Language TechnologiesProceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language TechnologiesDenver, USA, United StatesJune 2015HAL
- 15 articleTEI and LMF crosswalks.JLCL - Journal for Language Technology and Computational Linguistics3012015HAL
- 16 inproceedingsThe Lefff, a freely available and large-coverage morphological and syntactic lexicon for French.7th international conference on Language Resources and Evaluation (LREC 2010)Valletta, MaltaMay 2010HAL
- 17 inproceedingsError mining in parsing results.The 21st International Conference of the Association for Computational Linguistics (ACL 2006)Sydney, AustraliaJuly 2006, 329-336HAL
- 18 miscBLOOM: A 176B-Parameter Open-Access Multilingual Language Model.November 2023HAL
- 19 inproceedingsThe French Social Media Bank: a Treebank of Noisy User Generated Content.COLING 2012 - 24th International Conference on Computational LinguisticsKay, Martin and Boitet, ChristianMumbai, IndeDecember 2012, URL: http://hal.inria.fr/hal-00780895
- 20 inproceedingsStatistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither.Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich LanguagesÉtats-Unis Los AngelesAssociation for Computational Linguistics2010, 1--12
- 21 articleParsing Morphologically Rich Languages: Introduction to the Special Issue.Computational Linguistics391March 2013, 8HALDOI
- 22 inproceedingsImproving a symbolic parser through partially supervised learning.The 13th International Conference on Parsing Technologies (IWPT)Naria, JapanNovember 2013HAL
12.2 Publications of the year
International journals
- 23 articleBringing together multimodal and multilevel approaches to study the emergence of social bonds between children and improve social AI.Frontiers in Neuroergonomics5May 2024HALDOI
- 24 articleGraph methods to infer spatial disturbances: Application to Huntington's Disease's speech.Cortex176May 2024, 144 - 160HALDOI
- 25 articleDo Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation.Transactions of the Association for Computational Linguistics12September 2024, 1122–1142HALDOIback to text
- 26 articleARletta. Open-Source Handwritten Text Recognition Models for Historic Dutch.Journal of Open Humanities Data1043July 2024, 1--7HALDOIback to text
- 27 article‘I know what it is’. An interactional study of sex discovery in prenatal ultrasound examinations.Discourse Studies265April 2024, 643-668HALDOI
- 28 articleSpiRit-LM: Interleaved Spoken and Written Language Model.Transactions of the Association for Computational Linguistics13January 2025, 30-52HALback to text
- 29 articleThe Morais Dictionary: Following Best Practices in a Retro-digitized Dictionary Project.International Journal of Humanities and Arts Computing181March 2024, 125 - 147HALDOI
- 30 articleThe Impact of Word Splitting on the Semantic Content of Contextualized Word Representations.Transactions of the Association for Computational Linguistics12April 2024, 299-320HALDOIback to text
Invited conferences
- 31 inproceedingsFAIRer transcriptions: HTR-United and the possibility of a common for training data.Horizons of digital philologyNaples, ItalyApril 2024HAL
- 32 inproceedingsInitiation to Handwritten Text Recognition with eScriptorium.Horizons of digital philologyNaples, ItalyApril 2024HAL
- 33 inproceedingsMcCATMuS : retours sur la production d'un méta-dataset multilingue et multiséculaire.Le patrimoine archivistique face au virage numériqueRimouski, CanadaSeptember 2024HAL
- 34 inproceedingsKreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages: Building MT for Latin American, Caribbean and Colonial African Creole Languages.Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies2024 Annual Conference of the North American Chapter of the Association for Computational LinguisticsVolume 1: Long PapersMexico City, MexicoJune 2024, 3083–3110HAL
International peer-reviewed conferences
- 35 inproceedingsExploring Inline Lexicon Injection for Cross-Domain Transfer in Neural Machine Translation.KEMT 2024 - First International Workshop on Knowledge-Enhanced Machine TranslationProceedings of the First International Workshop on Knowledge-Enhanced Machine TranslationSheffield, United KingdomJune 2024HALback to text
- 36 inproceedingsCommon Ground, Diverse Roots: The Difficulty of Classifying Common Examples in Spanish Varieties.VarDial 2025 - Twelfth Workshop on NLP for Similar Languages, Varieties and Dialects co-located with COLING 2025Abu Dhabi, United Arab EmiratesJanuary 2025HALback to text
- 37 inproceedingsUkraiNER: A New Corpus and Annotation Scheme Towards Comprehensive Entity Recognition.Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and EvaluationTorino, ItalyELRA and ICCLMay 2024HALback to text
- 38 inproceedingsTopic-guided Example Selection for Domain Adaptation in LLM-based Machine Translation.Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research WorkshopSt. Julians, Malta2024HAL
- 39 inproceedingsWhen your Cousin has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages.LREC-Coling 2024 - Joint International Conference on Computational Linguistics, Language Resources and EvaluationProceedings of the The 2024 Joint International Conference on Computational Linguistics, Language Resources and EvaluationTorino, Italy2024HAL
- 40 inproceedingsDecoding Persuasiveness in Eloquence Competitions: An Investigation into the LLM’s Ability to Assess Public Speaking.ICAART 2025 - 17th International Conference on Agents and Artificial IntelligencePorto, PortugalSCITEPRESS - Science and Technology Publications2025, 538-546HALDOI
- 41 inproceedingsEvaluer BLOOM en français.Actes de l'Atelier sur l'évaluation des modèles génératifs (LLM) et challenge d'extraction d'information few-shotEvalLLM2024 - Atelier sur l'évaluation des modèles génératifs (LLM) et challenge d'extraction d'information few-shotToulouse, FranceJuly 2024HALback to text
- 42 inproceedingsTranslate your Own: a Post-Editing Experiment in the NLP domain.Proceedings of the 25th Annual Conference of the European Association for Machine TranslationThe 25th Annual Conference of the European Association for Machine TranslationSheffield, United KingdomJune 2024HALback to text
- 43 inproceedingsTEI Specifications for a Sustainable Management of Digitized Holocaust Testimonies.Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @LREC-COLING 2024First Workshop on Holocaust Testimonies as Language Resources (HTRes) @LREC-COLING 2024Turin, ItalyMay 2024HALback to text
- 44 inproceedingsAn ODD Schema for a Sustainable Encoding of Catalog Objects.TEI 2024 – Texts, Languages and CommunitiesBuenos Aires, ArgentinaOctober 2024HAL
- 45 inproceedingsAcoustic Characterization of Huntington's Disease Emotional Expression: An Explainable AI Approach.ACIIW 2024 - 12th International Conference on Affective Computing and Intelligent Interaction Workshops and DemosGlagow, United KingdomSeptember 2024HAL
- 46 inproceedingsCATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond.2024 International Conference on Document Analysis and Recognition (ICDAR)14806Lecture Notes in Computer ScienceAthens, GreeceSpringer Nature Switzerland2024, 174-194HALDOIback to text
- 47 inproceedingsMolyé: A Corpus-based Approach to Language Contact in Colonial France.NLP4DH 2024 - 4th International Conference on Natural Language Processing for Digital HumanitiesMiami, United StatesAugust 2024HALback to text
- 48 inproceedingsRepurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools.Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024Torino, ItalyMay 2024HAL
- 49 inproceedingsBuilding and Assessing a Named Entity Recognition Resource for Ancient Pharmacopeias.ECAI 2024ECAI 2024 - 27th European conference on artificial intelligence392Frontiers in Artificial Intelligence and ApplicationsSantiago de Compostela, SpainIOS PressOctober 2024, 2354-2361HALDOI
- 50 inproceedingsReconnaissance des écritures dans les imprimés: CATMuS print : un modèle générique, multilingue et diachronique.Humanistica 2024OCRMeknès, Morocco2024HALback to text
- 51 inproceedingsThe birth of French orthography. A computational analysis of French spelling systems in diachrony.CHR2024 – Computational Humanities Research ConferenceAahrus, DenmarkDecember 2024HALback to text
- 52 inproceedingsAnisotropy Is Inherent to Self-Attention in Transformers.EACL 2024 - 18th Conference of the European Chapter of the Association for Computational LinguisticsProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)St Julians, MaltaMarch 2024, 35–48HALback to text
- 53 inproceedingsOn the Scaling Laws of Geographical Representation in Language Models.The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) - Main Conference ProceedingsLREC-Coling 2024 - Joint International Conference on Computational Linguistics, Language Resources and EvaluationTorino, ItalyMay 2024, 12416–12422HALback to text
- 54 inproceedingsTranscrire un manuscrit en grec ancien: Un modèle de reconnaissance automatique pour le codex Palatinus graecus 23.Humanistica 2024OCRMeknès, MoroccoMay 2024HALback to text
- 55 inproceedingsThe Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text.NAACL 2024 Findings - Annual Conference of the North American Chapter of the Association for Computational LinguisticsMexico City, MexicoApril 2024HALback to text
- 56 inproceedingsMAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification.Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language TechnologiesNAACL 2024 - North American Chapter of the Association for Computational Linguistics1: Long PapersMexico City, MexicoAssociation for Computational LinguisticsJune 2024, 4810-4845HALDOIback to text
- 57 inproceedingsOn Modelling Corpus Citations in Computational Lexical Resources.Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)Turin, ItalyMay 2024, 12385--12394HAL
- 58 inproceedingsDoes Context Matter ? Enhancing Handwritten Text Recognition with Metadata in Historical Manuscripts.CHR2024 – Computational Humanities Research ConferenceAahrus, DenmarkDecember 2024HALback to text
- 59 inproceedingsFindings of the WMT24 General Machine Translation Shared Task: The LLM Era is Here but MT is Not Solved Yet.Proceedings of the Ninth Conference on Machine TranslationWMT 2024 - Ninth Conference on Machine TranslationMiami, Florida, United States2024, 1–46HALback to text
- 60 inproceedingsUniversal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark.Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language TechnologiesMexico city, MexicoJune 2024HALDOIback to text
- 61 inproceedingsGenerating English Synthetic Documents with Clinical Keywords: A Privacy-Sensitive Methodology.First Workshop on Patient-Oriented Language Processing @LREC-COLING-2024 (CL4Health) - Workshop ProceedingsFirst Workshop on Patient-Oriented Language Processing (CL4Health)Torino, ItalyMay 2024HALback to text
- 62 inproceedingsConversational Grounding: Annotation and Analysis of Grounding Acts and Grounding Units.Proceedings of LREC-COLING 2024LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and EvaluationTurin, ItalyMay 2024HALback to text
- 63 inproceedingsEvaluating the Effectiveness of Large Language Models in Establishing Conversational Grounding.EMNLP 2024 - Conference on Empirical Methods in Natural Language ProcessingMiami, United StatesAssociation for Computational Linguistics2024, 9767-9781HALDOIback to text
- 64 inproceedingsFindings of the WMT 2024 Biomedical Translation Shared Task: Test Sets on Abstract Level.Proceedings of the 9th Conference on Machine TranslationWMT24 - Ninth Conference on Machine TranslationMiami, Florida, United States2024, 124–138HALback to text
- 65 inproceedingsMultimodal models of repair in social human-agent interactions.WACAI ’24 - Proceedings of Workshop sur les “Affects, Compagnons Artificiels et Interactions” (ACAI)WACAI 2024: Workshop Affect, Compagnons Artificiels et InteractionsBordeaux, FranceJune 2024HALback to text
- 66 inproceedingsExploration of Human Repair Initiation in Task-oriented Dialogue : A Linguistic Feature-based Approach.Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and DialogueSIGDIAL 2024 - 25th Meeting of the Special Interest Group on Discourse and DialogueKyoto, JapanSeptember 2024, 603-609HALback to text
- 67 inproceedingsMaking Sentence Embeddings Robust to User-Generated Content.Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)Torino, Italy2024, 10984--10998HALback to text
- 68 inproceedingsCloaked Classifiers: Pseudonymization Strategies on Sensitive Classification Tasks.Proceedings of the fifth Workshop on Privacy in Natural Language ProcessingFifth Workshop on Privacy in Natural Language ProcessingBangkok, ThailandAugust 2024HALback to text
- 69 inproceedingsBeyond Dataset Creation: Critical View of Annotation Variation and Bias Probing of a Dataset for Online Radical Content Detection.COLING 2025 - 31st International Conference on Computational LinguisticsAbu Dhabi, United Arab EmiratesJanuary 2025HALback to text
- 70 inproceedingsExperimenting With Generic Recognition Systems for Kuzushiji Documents: Furigana Extraction as a Use-Case.Proceedings of JADH Conference, vol. 2024JADH2024 - 13th Conference of Japanese Association for Digital Humanities “Leveraging AI and Digital Humanities for Sustainable Infrastructure”Tokyo, JapanSeptember 2024HAL
- 71 inproceedingsCamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical Data.LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and EvaluationTorino, ItalyMay 2024HALback to textback to text
- 72 inproceedingsTree of Problems: Improving structured problem solving with compositionality.Proceedings of the 2024 Conference on Empirical Methods in Natural Language ProcessingEMNLP 2024 - Conference on Empirical Methods in Natural Language ProcessingProceedings of the 2024 Conference on Empirical Methods in Natural Language ProcessingMiami, FL, United StatesOctober 2024, 18028–18047HALback to text
- 73 inproceedingsPatentEval: Understanding Errors in Patent Generation.NAACL2024 - 2024 Annual Conference of the North American Chapter of the Association for Computational LinguisticsMexico City, MexicoJune 2024HALback to text
National peer-reviewed Conferences
- 74 inproceedingsÉvaluation de l’apport des chaînes de coréférences pour le liage d’entités.Actes de JEP-TALN-RECITAL 2024. 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position35èmes Journées d'Études sur la Parole (JEP 2024) 31ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)Toulouse, FranceATALA & AFPC2024, 397-409HALback to text
- 75 inproceedingsÀ propos des difficultés de traduire automatiquement de longs documents.35èmes Journées d'Études sur la Parole (JEP 2024)35èmes Journées d'Études sur la Parole (JEP 2024) 31ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)1 : articles longs et prises de positionToulouse, FranceATALA & AFPC2024, 2-21HALback to text
Conferences without proceedings
- 76 inproceedingsScience ouverte et Projets numériques.Rencontre annuelle du DIM PAMIRParis, FranceDecember 2024HAL
- 77 inproceedingsStreamlining the Creation of Holocaust-related Digital Editions with Automatic Tools.EHRI Academic Conference - Researching the Holocaust in the Digital AgeVarsovie, PolandJune 2024HALback to text
- 78 inproceedingsCollaboration and Transparency: A User-Generated Documentation for eScriptorium.DH2024 Reinvention & ResponsibilityWashington D. C., United StatesAugust 2024HALback to text
- 79 inproceedingsDo (colored) backgrounds matter? An experiment on artificially augmented ground truth for handwritten text recognition applied to historical manuscripts.CSDH/SCHN 2024: Sustaining Shared FuturesMontréal, CanadaJune 2024HAL
- 80 inproceedingsLeveraging EHRI Online Editions for training automated edition tools.EHRI Workshop Natural Language Processing Meets Holocaust ArchivesPrague, Czech RepublicMarch 2024HAL
- 81 inproceedingsDistributed Texts Services: Présentation.Journées Biblissima+: Partager, décloisonner, réutiliser : outiller la recherche et développer de nouveaux usagesAubervilliers, FranceMay 2024HAL
- 82 inproceedingsLayout Analysis Dataset with SegmOnto.DH2024 - Annual conference of the Alliance of Digital Humanities OrganizationsWashington DC, United States2024HALback to text
- 83 inproceedingsThe CATMuS initiative: building large and diverse corpora for handwritten text recognition.DH AI Seminar 2024 - Digital Humanities / Artificial IntelligenceParis, FranceMay 2024HALback to text
- 84 inproceedingsLe DIM PAMIR et l’inter/pluridisciplinarité.L’interdisciplinarité en action : les projets et les infrastructuresParis, FranceMay 2024HAL
- 85 inproceedingsVers un modèle diachronique pour les mains modernes françaises.Humanistica 2024 - Colloque annuel de l'Association francophone des humanités numériquesMeknès, MoroccoMay 2024HAL
- 86 inproceedingsAutomatic retro-structuration of auction sales catalogs layout and content.DH2024 - Reinvention and ResponsibilityWashinghton DC, United StatesAugust 2024HALback to text
- 87 inproceedingsSocio-Emotional Response Generation: A Human Evaluation Protocol for LLM-Based Conversational Systems.AHRI 2024 : The 3rd Workshop on Affective Human-Robot Interaction at ACII 2024Glasgow, United KingdomSeptember 2024HAL
- 88 inproceedingsSynthetic lines from historical manuscripts: an experiment using GAN and style transfer.Visual Processing of Digital Manuscripts: Workflows, Pipelines, Best Practices. ICIAP 2023 Workshops. ICIAP 2023.14366Lecture Notes in Computer ScienceUdine, ItalySpringer Nature SwitzerlandJanuary 2024, 477-488HALDOI
Scientific book chapters
Doctoral dissertations and habilitation theses
- 90 thesisUnderstanding the automatic text recognition process : model training, ground truth and prediction errors.Le Mans UniversitéNovember 2024HAL
- 91 thesisSentence Embeddings for Massively Multilingual Speech and Text Processing.Sorbonne UniversitéMarch 2024HAL
- 92 thesisImproving Representations for Language Modeling.Sorbonne UniversiteDecember 2024HAL
- 93 thesisSpoken Language Modeling from Raw Audio.Sorbonne UniversitéApril 2024HAL
Reports & preprints
- 94 miscAFRIDOC-MT: Document-level MT Corpus for African Languages.January 2025HALback to text
- 95 miscCamemBERT 2.0: A Smarter French Language Model Aged to Perfection.November 2024HALback to text
- 96 miscLes modèles Bloom pour le traitement automatique de la langue française.February 2024HAL
- 97 miscHow to build an Open Science Monitor based on publications? A French perspective.December 2024HAL
- 98 reportChaînes d’acquisition, de traitement et de publication du texte: Des images à la mise en ligne.Consortium Ariane - Axe 1October 2024HAL
- 99 miscDo (colored) backgrounds matter? An experiment on artificially augmented ground truth for handwritten text recognition applied to historical manuscripts.January 2024HAL
- 100 misc BigO(Bench) – Can LLMs Generate Code with Controlled Time and Space Complexity? March 2025 HAL
- 101 miscDiachronic Document Dataset for Semantic Layout Analysis.November 2024HALback to text
- 102 reportSurvey of Automatic Metrics for Evaluating Machine Translation at the Document Level.Inria Paris, Sorbonne Université; Sorbonne Universite; Inria ParisOctober 2024HALback to text
- 103 miscKréyoLID: From Language Identification Towards Language Mining.March 2025HAL
- 104 miscTowards Zero-Shot Multimodal Machine Translation.July 2024HALback to text
- 105 miscmOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus.June 2024HALback to textback to text
- 106 miscWhy do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck.April 2024HALback to text
- 107 reportPreliminary WMT24 Ranking of General MT Systems and LLMs.WMTJuly 2024HALback to text
- 108 miscHarvesting Textual and Structured Data from the HAL Publication Repository.July 2024HAL
- 109 miscExplicit Learning and the LLM in Machine Translation.March 2025HALback to text
- 110 reportHandling Very Long Contexts in Neural Machine Translation: a Survey.Livrable D3-2.1Projet ANR MaTOSJune 2024, 50HALback to text
- 111 miscInvestigating Length Issues in Document-level Machine Translation.December 2024HALback to textback to text
- 112 reportModel Cards for the MaTOS Project.Projet ANR MaTOSNovember 2024HALback to text
- 113 miscCompositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation.March 2025HALback to text
- 114 miscIn-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation.August 2024HALback to text
Other scientific publications
- 115 miscDataCatalogue : Restructurer automatiquement les catalogues de ventes.Paris, FranceJanuary 2024HAL
- 116 miscTEI Publisher: A Platform for Digital Editions.Paris / Virtual, FranceJanuary 2024HAL
- 117 miscReferee report. For: The artificial intelligence cooperative: READ-COOP, Transkribus, and the benefits of shared community infrastructure for automated text recognition [version 1].2025HALDOI
Scientific popularization
12.3 Cited publications
- 120 inproceedingsTowards a Cleaner Document-Oriented Multilingual Crawled Corpus.Thirteenth Language Resources and Evaluation Conference - LREC 2022Proceedings of the Thirteenth Language Resources and Evaluation Conference12 pages, 6 figures, 2 tablesMarseille, FranceJune 2022HALback to text
- 121 inproceedingsUngoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus.CMLC 2021 - 9th Workshop on Challenges in the Management of Large CorporaLimerick / Virtual, IrelandJuly 2021HALDOIback to text
- 122 articleOnline Extremism Detection in Textual Content: A Systematic Literature Review.IEEE Access92021, 42384-42396back to text
- 123 inproceedingsData-Efficient French Language Modeling with CamemBERTa.61st Annual Meeting of the Association for Computational Linguistics (ACL'23)In Findings of the Association for Computational Linguistics: ACL 2023, pages 5174--5185, Toronto, Canada. Association for Computational Linguistics.Toronto, CanadaJuly 2023, 5174--5185HALDOIback to text
- 124 miscCan LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book?arXiv:2409.19151September 2024, URL: http://arxiv.org/abs/2409.19151DOIback to text
- 125 inproceedingsParaCrawl: Web-Scale Acquisition of Parallel Corpora.Proceedings of the 58th Annual Meeting of the Association for Computational LinguisticsOnlineAssociation for Computational LinguisticsJuly 2020, 4555--4567URL: https://aclanthology.org/2020.acl-main.417DOIback to text
- 126 inproceedingsOn the Dangers of Stochastic Parrots: Can Language Models Be Too Big?Proceedings of the 2021 ACM Conference on Fairness, Accountability, and TransparencyFAccT '21New York, NY, USAVirtual Event, CanadaAssociation for Computing Machinery2021, 610–623URL: https://doi.org/10.1145/3442188.3445922DOIback to text
- 127 articleGraph of Thoughts: Solving Elaborate Problems with Large Language Models.Proceedings of the AAAI Conference on Artificial Intelligence3816March 2024, 17682–17690URL: http://dx.doi.org/10.1609/aaai.v38i16.29720DOIback to text
- 128 inproceedingsAnalyzing Zero-Shot transfer Scenarios across Spanish variants for Hate Speech Detection.Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)Dubrovnik, CroatiaAssociation for Computational LinguisticsMay 2023, 1-13HALDOIback to text
- 129 inproceedingsA Validation of DRAM RAPL Power Measurements.Proceedings of the Second International Symposium on Memory SystemsMEMSYS '16New York, NY, USAAlexandria, VA, USAAssociation for Computing Machinery2016, 455–470URL: https://doi.org/10.1145/2989081.2989088DOIback to textback to text
- 130 inproceedingsBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers)2019, 4171--4186URL: https://www.aclweb.org/anthology/N19-1423/back to textback to text
- 131 inproceedingsBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)Minneapolis, MinnesotaAssociation for Computational LinguisticsJune 2019, 4171--4186URL: https://aclanthology.org/N19-1423DOIback to text
- 132 inproceedingsTraining Neural Machine Translation to Apply Terminology Constraints.Proceedings of the 57th Annual Meeting of the Association for Computational LinguisticsFlorence, ItalyAssociation for Computational LinguisticsJuly 2019, 3063--3068URL: https://aclanthology.org/P19-1294/DOIback to text
- 133 unpublishedSONAR: Sentence-Level Multimodal and Language-Agnostic Representations.October 2023, working paper or preprintHALback to text
- 134 inproceedings``cba to check the spelling'': Investigating Parser Performance on Discussion Forum Posts.NAACLLos Angeles, California2010back to text
- 135 inproceedingsTackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation.Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Toronto, CanadaJuly 2023, 5394--5413HALDOIback to text
- 136 articleOnline Extremism Detection: A Systematic Literature Review With Emphasis on Datasets, Classification Techniques, Validation Methods, and Tools.IEEE Access92021, 48364-48404back to text
- 137 unpublishedIs Anisotropy Inherent to Transformers?October 2023, ACL-SRW 2023 (Poster)HALback to text
- 138 unpublishedDeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing.2023back to text
- 139 miscQuality at a Glance: An Audit of Web-Crawled Multilingual Datasets.2022, URL: https://arxiv.org/abs/2103.12028DOIback to text
- 140 articleQuality at a Glance: An Audit of Web-Crawled Multilingual Datasets.Transactions of the Association for Computational Linguistics102022, 50--72URL: https://aclanthology.org/2022.tacl-1.4DOIback to text
- 141 articleRoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.116922019back to text
- 142 inproceedingsEvaluating the Impact of Text De-Identification on Downstream NLP Tasks.Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)Tórshavn, Faroe IslandsUniversity of Tartu LibraryMay 2023, 10--16URL: https://aclanthology.org/2023.nodalida-1.2back to text
- 143 unpublishedCamemBERT: a Tasty French Language Model.October 2019, Web site: https://camembert-model.frHALback to text
- 144 inproceedingsDistributed Representations of Words and Phrases and their Compositionality.Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States.2013, 3111--3119URL: http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionalityback to text
- 145 inproceedingsMultilingual Auxiliary Tasks Training: Bridging the Gap between Languages for Zero-Shot Transfer of Hate Speech Detection Models.Findings of AACL 2022Accepted to Findings of AACL-IJCNLP 2022Online, FranceNovember 2022HALback to text
- 146 miscLinguistic Variation.2019, URL: https://www.thoughtco.com/what-is-linguistic-variation-1691242back to text
- 147 phdthesisA Data-driven Approach to Natural Language Processing for Contemporary and Historical French.Sorbonne UniversitéJune 2022HALback to text
- 148 inproceedingsEstablishing a New State-of-the-Art for French Named Entity Recognition.LREC 2020 - 12th Language Resources and Evaluation ConferenceDue to COVID19 pandemic, the 12th edition is cancelled. The LREC 2020 Proceedings are available at http://www.lrec-conf.org/proceedings/lrec2020/index.htmlMarseille, FranceMay 2020HALback to text
- 149 inproceedingsA Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages.ACL 2020 - 58th Annual Meeting of the Association for Computational LinguisticsSeattle / Virtual, United StatesJuly 2020HALDOIback to text
- 150 inproceedingsAsynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures.7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7)Cardiff, United KingdomLeibniz-Institut für Deutsche SpracheJuly 2019HALDOIback to textback to text
- 151 articleSAIRUS: Spatially-aware identification of risky users in social networks.Information Fusion922023, 435-449URL: https://www.sciencedirect.com/science/article/pii/S1566253522002457DOIback to text
- 152 inproceedingsCATMuS-Medieval: Consistent Approaches to Transcribing ManuScripts.Digital Humanities - DH2024ADHOWashington DC, United StatesAugust 2024HALback to text
- 153 inproceedingsThe “.Proceedings of the 2022 Conference on Empirical Methods in Natural Language ProcessingAbu Dhabi, United Arab EmiratesAssociation for Computational LinguisticsDecember 2022, 10671--10682URL: https://aclanthology.org/2022.emnlp-main.731/DOIback to text
- 154 articleTreebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations.Language Resources and Evaluation572February 2022, 493-544HALDOIback to text
- 155 articleGreen AI.Commun. ACM6312nov 2020, 54–63URL: https://doi.org/10.1145/3381831DOIback to text
- 156 inproceedingsEnergy and Policy Considerations for Deep Learning in NLP.Proceedings of the 57th Annual Meeting of the Association for Computational LinguisticsFlorence, ItalyAssociation for Computational LinguisticsJuly 2019, 3645--3650URL: https://aclanthology.org/P19-1355DOIback to textback to text
- 157 inproceedingsDataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics.Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)OnlineAssociation for Computational LinguisticsNovember 2020, 9275--9293URL: https://aclanthology.org/2020.emnlp-main.746/DOIback to text
- 158 articleA survey on extremism analysis using Natural Language Processing.CoRRabs/2104.040692021, URL: https://arxiv.org/abs/2104.04069back to text
-
159
inproceedingsCamemBERT-bio : Un modèle de langue français savoureux et meilleur pour la santé.Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN),
volume 1 : travaux de recherche originaux – articles longsParis, FranceATALAJune 2023, 323-334HALback to textback to text - 160 miscTree of Thoughts: Deliberate Problem Solving with Large Language Models.2023, URL: https://arxiv.org/abs/2305.10601back to text