2024Activity reportProject-TeamSEMAGRAMME
RNSR: 201120979K- Research center Inria Centre at Université de Lorraine
- In partnership with:CNRS, Université de Lorraine
- Team name: Semantic Analysis of Natural Language
- In collaboration with:Laboratoire lorrain de recherche en informatique et ses applications (LORIA)
- Domain:Perception, Cognition and Interaction
- Theme:Language, Speech and Audio
Keywords
Computer Science and Digital Science
- A5.8. Natural language processing
- A7.2. Logic in Computer Science
- A9.4. Natural language processing
Other Research Topics and Application Domains
- B2. Health
- B9.6.8. Linguistics
- B9.9. Ethics
1 Team members, visitors, external collaborators
Research Scientists
- Philippe de Groote [Team leader, INRIA, Senior Researcher]
- Bruno Guillaume [INRIA, Researcher]
- Vincent Martin [INRIA, Researcher, from Oct 2024]
- Sylvain Pogodalla [INRIA, Researcher]
Faculty Members
- Maxime Amblard [UL, Professor]
- Karën Fort [UL, Professor, from Sep 2024]
- Karën Fort [SORBONNE UNIVERSITE, Associate Professor, until Aug 2024, UL appointed]
- Jacques Jayez [ENS DE LYON, Emeritus, UL appointed]
- Michel Musiol [UL, Professor Delegation]
PhD Students
- Clémentine Bleuze [UL, from Oct 2024]
- Hee-Soo Choi [UL, ATER, from Sep 2024]
- Hee-Soo Choi [UL, until Aug 2024]
- Marie Cousin [UL]
- Amandine Decker [UL]
- Fanny Ducel [UNIV PARIS SACLAY]
- Maxime Guillaume [YSEOP, CIFRE]
- Amandine Lecomte [Self funded]
- Siyana Pavlova [UL, from Sep 2024]
- Siyana Pavlova [UL, ATER, until Aug 2024]
- Valentin Richard [Univ Amsterdam, from Sep 2024]
- Valentin Richard [UL, until Aug 2024]
- Vincent Tourneur [UL, from Oct 2024]
Technical Staff
- Khensa Amani Daoudi [INRIA, Engineer]
- Iglika Zlatkova Nikolova-Stoupak [UL, from Oct 2024]
- Bertrand Remy [UL, until Sep 2024]
- Vincent Tourneur [INRIA, Engineer, until Aug 2024]
Interns and Apprentices
- Dylan Bettendroffer [INRIA, Intern, from Feb 2024 until Jul 2024]
- Clémentine Bleuze [UL, Intern, from Mar 2024 until Aug 2024]
- Maelle Cornely [INRIA, from Jun 2024 until Jul 2024]
- Ines Hernandez [INRIA, Apprentice, until Sep 2024]
- Zhengjian Li [LORIA, Intern, from Jul 2024 until Sep 2024]
- Emeric Licorni [UL, Intern, from Apr 2024 until Jun 2024]
- Eva Purson [UL, Intern, from Apr 2024 until Jun 2024]
- Nicolas Vincent [LORIA, Intern, from Jun 2024 until Aug 2024]
- Rémi de Vergnette [UL, Intern, from Mar 2024 until Aug 2024]
Administrative Assistants
- Juline Brevillet [UL, until Nov 2024]
- Véronique Constant [INRIA]
- Emmanuelle Deschamps [INRIA]
- Sophie Drouot [INRIA]
- Nathan Grandemange [UL]
- Ouiza Herbi [INRIA]
- Sylvie Hilbert [CNRS]
- Anne-Marie Messaoudi [UL]
- Gallown Nizard [UL, from Sep 2024]
- Cecilia Olivier [INRIA]
External Collaborator
- Mathieu Constant [UL, ATILF]
2 Overall objectives
2.1 Scientific Context
Computational linguistics is a discipline at the intersection of computer science and linguistics. On the theoretical side, it aims to provide computational models of the human language faculty. On the applied side, it is concerned with natural language processing and its practical applications.
From a structural point of view, linguistics is traditionally organized into the following sub-fields:
- Phonology, the study of language abstract sound systems.
- Morphology, the study of word structure.
- Syntax, the study of language structure, i.e., the way words combine into grammatical phrases and sentences.
- Semantics, the study of meaning at the levels of words, phrases, and sentences.
- Pragmatics, the study of the ways in which the meaning of an utterance is affected by its context.
Computational linguistics is concerned by all these fields. Consequently, various computational models, whose application domains range from phonology to pragmatics, have been developed. Among these, logic-based models play an important part, especially at the “highest” levels.
At the level of syntax, generative grammars may be seen as basic inference systems, while categorial grammars are based on substructural logics specified by Gentzen sequent calculi. Finally, model-theoretic grammars amount to sets of logical constraints to be satisfied.
At the level of semantics, the most common approaches derive from
Montague grammars, which are based on the
simply typed
At the level of pragmatics, the situation is less clear. The word pragmatics has been introduced by Morris to designate the branch of philosophy of language that studies, besides linguistic signs, their relation to their users and the possible contexts of use. The definition of pragmatics was not quite precise, and, for a long time, several authors have considered (and some authors are still considering) pragmatics as the wastebasket of syntax and semantics. Nevertheless, as far as discourse processing is concerned (which includes pragmatic problems such as pronominal anaphora resolution), logic-based approaches have also been successful. In particular, Kamp's Discourse Representation Theory gave rise to sophisticated `dynamic' logics. The situation, however, is less satisfactory than it is at the semantic level. On the one hand, we are facing a kind of logical “tower of Babel”. The various pragmatic logic-based models that have been developed, while sharing underlying mathematical concepts, differ in several respects and are too often based on ad hoc features. As a consequence, they are difficult to compare and appear more as competitors than as collaborative theories that could be integrated. On the other hand, several phenomena related to discourse dynamics (e.g., context updating, presupposition projection and accommodation, contextual reference resolution...) are still lacking deep logical explanations. We strongly believe, however, that this situation can be improved by applying to pragmatics the same approach Montague applied to semantics, using the standard tools of mathematical logic.
Accordingly:
The overall objective of the Sémagramme project is to design and develop new unifying logic-based models, methods, and tools for the semantic analysis of natural language utterances and discourses. This includes the logical modeling of pragmatic phenomena related to discourse dynamics. Typically, these models and methods will be based on standard logical concepts (stemming from formal language theory, mathematical logic, and type theory), which should make them easy to integrate.
The project is organized along three research directions (i.e., syntax-semantics interface, discourse dynamics, and common basic resources), which interact as explained below.
Moreover, a transversal and transdisciplinary theme has been developed in the team in the past years: ethics in NLP and more generally in AI.
2.2 Syntax-Semantics Interface
The Sémagramme project intends to focus on the semantics of natural languages (in a wider sense than usual, including some pragmatics). Nevertheless, the semantic construction process is syntactically guided, that is, the constructions of logical representations of meaning are based on the analysis of the syntactic structures. We do not want, however, to commit ourselves to such or such specific theory of syntax. Consequently, our approach should be based on an abstract generic model of the syntax-semantic interface.
Here, an important idea of Montague comes into play, namely, the “homomorphism requirement”: semantics must appear as a homomorphic image of syntax. While this idea is almost a truism in the context of mathematical logic, it remains challenged in the context of natural languages. Nevertheless, Montague's idea has been quite fruitful, especially in the field of categorial grammars, where van Benthem showed how syntax and semantics could be connected using the Curry-Howard isomorphism. This correspondence is the keystone of the syntax-semantics interface of modern type-logical grammars. It also motivated the definition of our own Abstract Categorial Grammars 59.
Technically, an Abstract Categorial Grammar simply consists of a (linear) homomorphism between two higher-order signatures. Extensive studies have shown that this simple model allows several grammatical formalisms to be expressed, providing them with a syntax-semantics interface for free 57, 8.
We intend to carry on with the development of the Abstract Categorial Grammar framework. At the foundational level, we will define and study possible type theoretic extensions of the formalism, in order to increase its expressive power and its flexibility. At the implementation level, we will continue the development of an Abstract Categorial Grammar support system.
As said above, considering the syntax-semantics interface as the starting point of our investigations allows us not to be committed to some specific syntactic theory. The Montagovian syntax-semantics interface, however, cannot be considered to be universal. In particular, it does not seem to be well adapted to dependency and model-theoretic grammars. Consequently, in order to be as generic as possible, we intend to explore alternative models of the syntax-semantics interface. In particular, we will explore relational models where several distinct semantic representations can correspond to the same syntactic structure.
2.3 Discourse Dynamics
It is well known that the interpretation of a discourse is a dynamic process. Take a sentence occurring in a discourse. On the one hand, it must be interpreted according to its context. On the other hand, its interpretation affects this context, and must therefore result in an updating of the current context. For this reason, discourse interpretation is traditionally considered to belong to pragmatics. The cut between pragmatics and semantics, however, is not that clear.
As we mentioned above, we intend to apply to some aspects of pragmatics (mainly, discourse dynamics) the same methodological tools Montague applied to semantics. The challenge here is to obtain a completely compositional theory of discourse interpretation, by respecting Montague's homomorphism requirement. We think that this is possible by using techniques coming from programming language theory, in particular, continuation semantics, and the related theories of functional control operators.
We have indeed successfully applied such techniques in order to model the way quantifiers in natural languages may dynamically extend their scope 58. We intend to tackle, in a similar way, other dynamic phenomena (typically, anaphora and referential expressions, presupposition, modal subordination...).
What characterizes these different dynamic phenomena is that their interpretations need information to be retrieved from a current context. This raises the question of the modeling of the context itself. At a foundational level, we have to answer questions such as the following. What is the nature of the information to be stored in the context? What are the processes that allow implicit information to be inferred from the context? What are the primitives that allow a context to be updated? How does the structure of the discourse and the discourse relations affect the structure of the context? These questions also raise implementation issues. What are the appropriate data types? How can we keep the complexity of the inference algorithms sufficiently low?
2.4 Common Basic Resources
Even if our research primarily focuses on semantics and pragmatics, we nevertheless need syntax. More precisely, we need syntactic trees to start with. We consequently need grammars, lexicons, and parsing algorithms to produce such trees. During the last years, we have developed the notion of interaction grammar 60 and graph rewriting 3, 4 as models of natural language syntax. This includes the development of grammars for French 74, together with morphosyntactic lexicons. We intend to continue this line of research and development. In particular, we want to increase the coverage of our grammars for French, and provide our parsers with more robust algorithms.
Further primary resources are needed in order to put at work a computational semantic analysis of utterances and discourses. As we want our approach to be as compositional as possible, we must develop lexicons annotated with semantic information. This opens the quite wide research area of lexical semantics.
Finally, when dealing with logical representations of utterance interpretations, the need for inference facilities is ubiquitous. Inference is needed in the course of the interpretation process, but also to exploit the result of the interpretation. Indeed, an advantage of using formal logic for semantic representations is the possibility of using logical inference to derive new information. From a computational point of view, however, logical inference may be highly complex. Consequently, we need to investigate which logical fragments can be used efficiently for natural language oriented inference.
3 Research program
3.1 Overview
The research program of Sémagramme aims to develop models based on well-established mathematics. We seek two main advantages from this approach. On the one hand, by relying on mature theories, we have at our disposal sets of mathematical tools that we can use to study our models. On the other hand, developing various models on a common mathematical background will make them easier to integrate, and will ease the search for unifying principles.
The main mathematical domains on which we rely are formal language theory, symbolic logic, and type theory.
3.2 Formal Language Theory
Formal language theory studies the purely syntactic and combinatorial aspects of languages, seen as sets of strings (or possibly trees or graphs). Formal language theory has been especially fruitful for the development of parsing algorithms for context-free languages. We use it, in a similar way, to develop parsing algorithms for formalisms that go beyond context-freeness. Language theory also appears to be very useful in formally studying the expressive power and the complexity of the models we develop.
3.3 Symbolic Logic
Symbolic logic (and, more particularly, proof theory) is concerned with the study of the expressive and deductive power of formal systems. In a rule-based approach to computational linguistics, the use of symbolic logic is ubiquitous. As we previously said, at the level of syntax, several kinds of grammars (generative, categorial...) may be seen as basic deductive systems. At the level of semantics, the meaning of an utterance is captured by computing (intermediate) semantic representations that are expressed as logical forms. Finally, using symbolic logics allows one to formalize notions of inference and entailment that are needed at the level of pragmatics.
3.4 Type Theory and Typed Lambda-Calculus
Among the various possible logics
that may be used, Church's simply typed
4 Application domains
4.1 Deep Semantic Analysis
Our applicative domains concern natural language processing applications that rely on a deep semantic analysis. For instance, one may cite the following ones:
- textual entailment and inference,
- dialogue systems,
- semantic-oriented query systems,
- content analysis of unstructured documents,
- (semi) automatic knowledge acquisition,
- discourse structure analysis (argumentative relations, discourse markers),
- lexical resources.
4.2 Text Transformation
Text transformation is an application domain featuring two important sub-fields of computational linguistics:
- parsing, from surface form to abstract representation,
- generation, from abstract representation to surface form.
Text simplification or automatic summarization belong to that domain.
We aim at using the framework of Abstract Categorial Grammars we
develop to this end. It is indeed a reversible framework that allows
both parsing and generation. Its underlying mathematical structure of
4.3 Types for discourse markers
While there is a rich descriptive literature on Discourse Markers (DM), for instance words/expressions like so or yet in English, the question of their representation in type systems is understudied. In addition to basic types such as individuals or events, or simple functional types (properties, etc.), DM are known to operate on domains like states of affairs, beliefs or speech acts. The entities inhabiting these domains are themselves complex. For instance, speech acts involve discourse planning in the form of a network of intentions and actions. Moreover, DM can combine with one another, forming clusters whose meaning is not always apparent from the meanings of the component DM. Within the context of the ANR CODIM, we aim at developing a typing system for (i) taking into account the array of types denoted by DM and (ii) addressing the questions of the semantic nature of their combinations.
5 Social and environmental responsibility
5.1 Footprint of research activities
ANR InExtenso:
WP4 of the project is dedicated to the evaluation of the environmental impact of the LLMs. More precisely, it aims at proposing a method for measuring the environmental impact of digital health and use it in the project evaluations and beyond.
6 Highlights of the year
The project proposal MALINCA (ERC Synergy Grant call 2024, ERC-2024-SyG) has been accepted for funding.
7 New software, platforms, open data
7.1 New software
7.1.1 ACGtk
-
Name:
Abstract Categorial Grammar Development Toolkit
-
Keywords:
Natural language processing, Functional programming, Logic programming, Lambda-calculus, Ocaml
-
Scientific Description:
Abstract Categorial Grammars (ACG) are a grammatical formalism in which grammars are based on typed lambda-calculus. A grammar generates two languages: the abstract language (the language of parse structures), and the object language (the language of the surface forms, e.g., strings, or higher-order logical formulas), which is the realization of the abstract language.
ACGtk provides two software tools to develop and to use ACGs: acgc, which is a grammar compiler, and acg, which is an interpreter of a command language that allows one, in particular, to parse and realize terms.
-
Functional Description:
ACGtk provides a piece of software for developing and using Abstract Categorial Grammars (ACG).
-
Release Contributions:
This new version of the software provides heavy internal modifications to manage and enumerate the parses of an expression, in particular when the latter is highly ambiguous. These modifications also prepare the integration of features to sort the results according to weight information associated with grammatical rules.
- URL:
- Publications:
-
Contact:
Sylvain Pogodalla
-
Participants:
Philippe De Groote, Pierre Ludmann, Jiri Marsik, Sylvain Pogodalla, Vincent Tourneur
7.1.2 Grew
-
Name:
Graph Rewriting
-
Keywords:
Semantics, Syntactic analysis, NLP, Graph rewriting
-
Functional Description:
Grew is a Graph Rewriting tool dedicated to applications in NLP. Grew takes into account confluent and non-confluent graph rewriting and it includes several mechanisms that help to use graph rewriting in the context of NLP applications (built-in notion of feature structures, parametrization of rules with lexical information).
-
News of the Year:
In 2024, one new version 1.16 was released (together with several bug fixes). The major new features are: new code for corpusbank handling and new syntax in request and key clustering (`length` and `delta` between two nodes)
- URL:
- Publications:
-
Contact:
Bruno Guillaume
-
Participants:
Bruno Guillaume, Guy Perrier, Guillaume Bonfante
7.1.3 HostoMytho
-
Keywords:
Game with a purpose, Natural language processing
-
Functional Description:
HostoMytho is a GWAP, or "game with a purpose" developed within the framework of the CODEINE ANR project. The aim of the game is to allow users to annotate medical files generated automatically, in order to evaluate their plausibility (quality of the language and medical semantics) and to add different layers of information (negation, hypothesis, time, etc.). HostoMytho is multiplatform.
- URL:
- Publication:
-
Contact:
Karën Fort
-
Partners:
LISN, CEA-List
7.1.4 Arborator-Grew
-
Name:
Arborator's Collaborative Annotation
-
Keywords:
Annotation tool, Syntactic analysis
-
Functional Description:
The online interface allows managing collaborative annotation projects in dependency syntax. It is possible to use Grew queries and also to directly rewrite graphs in the annotation tool.
-
News of the Year:
During 2024, we redesigned the tool's interface, refactored the code base for both frontend and backend. In parallel, we set up pipelines for automatic deployment and improved code documentation. In addition, we continued to work on improving existing functionalities and adding new ones based on user requests.
- URL:
- Publication:
-
Contact:
Bruno Guillaume
-
Participants:
Khensa Amani Daoudi, Bruno Guillaume, Gael Guibon, Kim Gerdes, Kirian Guiller
-
Partners:
Université Paris Nanterre, LIMSI, LISN
8 New results
8.1 Syntax-Semantics Interface
Participants: Maxime Amblard, Marie Cousin, Philippe de Groote, Bruno Guillaume, Maxime Guillaume, Sylvain Pogodalla, Siyana Pavlova, Valentin Richard, Zhengjian Li.
8.1.1 Abstract Categorial Grammars
Feature Structure
ACG has proven to be a powerful framework with well-defined theoretical properties. It was, however, lacking a facility which is useful and widely used for grammar engineering: feature structures. The latter are often used to express in a concise way some combinatorial properties related to morphosyntactic properties of expressions, for instance subject-verb agreement.
We worked on extending the ACG type system to provide a generic feature structure framework. This extension relies on a restricted addition of the product (records) and dependent types and still allows for the reduction of grammars to Datalog programs (which is used to implement ACG parsing in ACGtk, see Sec. 7). We ran an experiment with the actual Yseop proprietary grammars and the ACG system with features. It showed a significant improvement, both in the size of the grammar (decreased) and the efficiency of text generation (increased) 44.
Multityped ACG (mACG) and Weighted ACG
Symbolic parsing with large coverage grammars usually leads to combinatorial explosion of syntactic ambiguities (a single expression has many syntactic analyses). A widespread method to address this issue is to use statistics and probabilities, leading for instance to probabilistic Context Free Grammars (pCFGs) and probabilistic Tree Adjoining Grammars (pTAGs). We have introduced weighted ACGs to provide such capabilities to the ACG framework. Weighted ACGs fit well with the multityped extension (mACGs) 67 we introduced to still allow for grammar composition. We have proposed encodings of hidden Markov models, PCFGs and pTAGs (not yet published). We moreover introduced into ACGtk mechanisms to preparte taking such weighting information into account when enumerating solutions 21 (see Section 7.1.1). These mechanisms are base on the principles of zippers 61 applied to (shared) forests, instead of trees.
Encoding of Meaning-Text Theory Into ACGs
Meaning-Text Theory (MTT) is a linguistic theory geared towards generating natural language expressions from semantic representations 68. It relies on seven representation levels (e.g., semantics, deep syntax, surface syntax, etc.). Representations at each level are related to representations at the adjacent levels by rewriting devices. Each representation is made of several strucures, among which the predicative and the communicative ones. MTT uses the key concept of paraphrase, especially in these rewriting devices. ACGs come with several composition modes, one of which in particular corresponds to transduction of (tree or graph) structures.
We have therefore been studying the ability of ACGs to model MTT structure transformations between adjacent levels, focusing on the structures and levels of semantics, deep syntax, and surface syntax.
In previous work 54, 53 we proposed an encoding of MTT into ACGs where the predicative structures of the semantic level in MTT was used. However, MTT rewriting processes also make use of communicative structure information, decorating the predicate structures (at the semantic level) with theme and rheme information.
Indeed, both expressions "Charlie is Taylor's son" and "Charlie, the son of Taylor" share the same predicative structure and are not paraphrases of each other. While the second one is a nominal expression, the first one is a verbal expression about Charlie, that states that he is Taylor's son. The difference between both expression, that share the same semantic predicative graph, is made by the communicative structure.
It shows the crucial role the communicative structure structure plays in MTT since they determine, from a given semantic graph (i.e., predicative and communicative structures), which deep-syntactic graph is to be obtained.
We have therefore proposed to also take into account this communicative structure, using suitable types and grammatical composition as offered by the ACG framework 47.
8.1.2 Formal semantics of dependency relations
We have continued to work on our formal semantic theory of dependency grammars. This theory, which is fully compositional and robust, is based on the following principles:
- Dependency relations are represented as binary functions of type
(or ), where is the syntactic category of the governor and is the syntactic category of the governee. - By virtue of a coherence principle, the different ways of encoding a dependency structure by means of a
-term should all give rise to the same semantic interpretation. - The semantic interpretation of a syntactic category cat is a type of the form
, where is the Montagovian interpretation of cat and are the Montagovian interpretations of the syntactic categories of the phrases whose heads can potentially be governed by the head of a phrase of category cat. - Saturating operators allow phrases to recover their usual Montagovian interpretations.
- Verbs and verbal phrases are semantically interpreted as sets of sets of events.
We have tested these principles on advanced syntactic phenomena 45. We considered two cases: the relative clauses (which depend on the acl:relcl dependency relation) and the open clausal complements (which depend on the xcomp dependency relation).
8.1.3 Formal semantics of adnominal modification
We have proposed a treatment of adnominal modification that parallels the treatment of adverbial modification in neo-Davidsonian event semantics 42. To this end, we introduced a notion of perspective that allows nouns to be interpreted as sets of sets of perspective. The resulting theory provides a unified compositional treatment of intersective, subsective, modal, and privative adjectives, and avoids the intensional paradoxes caused by an extensional treatment of subsecutive adjectives.
8.1.4 Semantic Representation
In 27, we presented the first version of YARN, a new semantic representation formalism which aim is to unify the advantages of logic-based formalisms while retaining direct interpretation, making it widely usable. YARN is rooted in the encoding of different semantic phenomena as separate layers. The paper presents a formal definition of the mathematical structure that constitutes YARN and illustrates with concrete examples how this structure can be used in the context of semantic representation for encoding multiple phenomena (such as modality, negation and quantification) as layers built on top of a central predicate-argument structure. The benefit of YARN is that it allows for the independent annotation and analysis of different phenomena as they are easy to “switch off”. Furthermore, the ability of YARN to encode simple interactions between phenomena is explored. The authors wrap up the work presented by a discussion of some of the interesting observations made during the development of YARN so far and outline our extensive future plans for this formalism.
Zhengjian Li conducted an M1 internship under the supervision of Siyana Pavlova, Bruno Guillaume and Maxime Amblard. He studied the semantic annotation of real data in Yarn and developed an interface to automatically produce Yarn representation graphs.
As part of a mobility program at Sapienza University under the supervision of Roberto Navigli, Hee-Soo Choi collaborated with the SapienzaNLP team on a multilingual semantic parsing project proposing a strategy to semi-automatically generate semantic annotations across languages from LLMs. The work was published at the international conference ACL 2024 25.
8.1.5 Syntax and semantics of questions
Natural language statements are composed not only of declarative sentences but also of interrogative ones. Moreover, sentences cannot be categorized into purely declarative or purely interrogative sentences. Typically, a declarative statement may contain a subordinated interrogative clause:
-
(a)
I don't know where Mary is.
In 28, a kind of French subordinated interrogative which has not been observed before is described and analyzed. It consists of an adverbial modifier clause introduced by a preposition, e.g. selon (`depending on'):
-
(b)
Selon comment vous vous positionnez, vous n'aurez pas tous la même perception.
Up to now, subordinated interrogative clauses have been claimed to only be acceptable as subjects or complements of verbs, nouns or adjectives. This discovery widens our comprehension of the potential contribution of interrogatives in discourse.
This interaction between declarative and interrogative clauses is particularly present in dialogues, where the logical notion of answerhood is as significant as the one of inference. In order to tackle this issue from a formal standpoint, we investigated the properties and possible uses of inquisitive semantics. Inquisitive semantics is a formal semantic theory based on a logic that provides a uniform treatment of both declarative and interrogative expressions.
In 29, the interaction of questions and possibility modal operators is investigated. It is shown that when an existential modal scopes over the question operator, several dynamic properties of questions are weakened or blocked. In particular, modalized singular-which-questions have a weaker exhaustivity, uniqueness presuppositions and block anaphoric reference to the wh-word, except in modal subordination. A model based on dynamic inquisitive semantics and capturing these phenomena is provided.
8.1.6 Use of semantics
Before the invention of the printing press, texts could only be reproduced through manual copying, a process prone to errors, accidents, and intentional modifications. These changes altered each manuscript and were subsequently propagated by other scribes. For philologists reconstructing text history and genealogical relationships (stemma codicum), analyzing these variants is crucial. Stemmatology methods aim to objectively construct genealogical trees of textual transmission.
At the University of Lorraine, the Écritures laboratory and MSH have focused on uncovering the genealogical lineage of Hebrew manuscripts. A join project with Maxime Amblard seeks to improve the manual work involved in critical editions of the Hebrew Bible by applying advanced methods from applied mathematics and natural language processing to reconstruct stemmas.
Iglika Zlatkova Nikolova-Stoupak is recruted to develop stemmatology algorithms. She design, train and test learning model to automatically tag scribal variants in manuscripts. The model will classify variants (orthographic, lexical, grammatical, etc.) based on expert-provided data and suggest classifications for new string comparisons. She also develop methods to compute semantic-based distances between Hebrew words, enabling the comparison of variant meanings. This will involve creating textual embeddings and neural network-based representations of Hebrew words.
Maxime Amblard started a collaboration with the French Company Namkin. The industry faces numerous challenges that necessitate the evolution of BtoB marketing tools, in order to develop a valuable offer and provide an enhanced customer experience. Namkin's BrainLab develops industrial marketing tools for digitalizing customer relations, evolving business models, and exploiting business and economic data for business development. One of the key challenges of marketing intelligence is to identify risks and opportunities so as to guide marketing strategies. Among the sources of information useful to detect risks and opportunities, Namkin has identified Business Events, that is, “textually reported real-world occurrences, actions, relations, and situations involving companies and firms” 62.
While modern semantic representations may contain vast quantities of information, they do not always (or necessarily) contain the information that is useful for the concrete application. For instance, significant challenges still persist in dealing with temporal relations and finely-grained negation interpretation.
Recent research has looked into the benefits of exploiting semantic representations, and in particular Abstract Meaning Representation, for low-resources scenarios and document level event argument extraction. However, it appears that AMR has to be adapted in order to optimally support event extraction related tasks 78. One major limitation of AMR for document-level event extraction is that AMR works at the sentence level, and thus requires the aggregation of sentence-level representations. AMR is also limited in terms of negation and universal quantification expressive power.
8.2 Discourse Dynamics
Participants: Maxime Amblard, Philippe de Groote, Amandine Decker, Jacques Jayez, Michel Musiol, Emeric Licorni, Ines Hernandez.
8.2.1 Dialogue Modeling
Discourse relation prediction arguably is the most difficult task in discourse parsing. Previous studies have generally focused on explicit or implicit discourse relation classification in monologues, leaving dialogue an under-explored domain. Facing the data scarcity issue, Chloé Braud (IRIT), Maxime Amblard and Chuyuan Li proposed to leverage self-training strategies based on a Transformer backbone. Moreover, they designed the first semi-supervised pipeline that sequentially predicts discourse structures and relations. Using 50 examples, their relation prediction module achieves 58.4 in accuracy on the STAC corpus, close to supervised state-of-the-art. Full parsing results show notable improvements compared to the supervised models both in-domain (gaming) and cross-domain (technical chat), with better stability 24.
Maxime Amblard and Amandine Decker continue to work on topic segmentation 17 and topical structure modeling 18. Topics play an important role in dialogue coherence, as what is currently discussed constrains the possible contributions of the participants, and initiating a topic while the previous one is still under discussion may be confusing without appropriate signals. However, the way to actually define the notion of topic is debated in linguistics and not sufficiently discussed in dialogue modeling. A precise description of topics and topic shifts in conversation would contribute to a better understanding of what makes us judge a sequence of utterances to be coherent.
In 17, linear topic segmentation is performed on transcriptions of the Friends TV show. These contain multi-party conversations in a daily life setting which are complex to study due to their chaotic nature. Linear topic segmentation proves to be ill-suited for such complicated dialogues. Hence a new topic segmentation method must be investigated.
36 presents a method to build an unsupervised topic similarity measure. It relies on the existing tree structure of Reddit threads and allows us to analyse the evolution of topics over the course of a conversation.
Ines Hernandez conducted an M2 internship under the supervision of Amandine Decker and Maxime Amblard. Her study began by enhancing a topic similarity model. Since topicality is a key feature while differentiating subordinating and coordinating relations, she then applied this improved model for the specific task of classifying these relations using a conversational dataset. To test its performance she utilized three different datasets which contain information from different sources and have different structures, in English and French. Additionally, she proposed an analysis of the results by rhetorical relations, in the attempt to observe if certain rhetorical relations are more straight-forward to classify. Furthermore, she fine-tunes pre-train language models, such as BERT, roBERTa, deBERTa, and Llama2 to continue the experiments on this classification task. Notably, BERT outperforms other models in classifying these relations, possibly due to its Next Sentence Prediction (NSP) training objective. Finally, she proposes to provide context to the models during training, since this is considered crucial for understanding how segments are connected. This shows potential results, improving in particular the BERT model.
In order to analyze different types of topic shifts, 18 propose to create a corpus of written task-oriented conversations (discussion of the ethical dilemma of the balloon task), where the dialogues happen by message exchanges. Such a controlled setting where the main topic is fixed, and subtopics are more easily identifiable, could be very helpful when it comes to understanding how people change the topic and react to topic shifts in dialogues. Emeric Licorni conducted an L3 internship under the supervision of Amandine Decker and Maxime Amblard to explore the design of this experiment. Applying the topic shift similarity metric mentioned above to human annotated conversations could enable us to develop a hierarchical topic segmentation model.
8.2.2 Discourse Markers
Jacques Jayez is currently working with Mathilde Dargnat (ATILF), Paola Herreño (Ph.D. candidate ATILF-LLF) and Maeva Sillaire (Ph.D. candidate ATILF) on the semantic representation of D(iscourse) M(arkers). DM are words/expressions like so or well in English which help structuring discourse or communicating speakers' internal epistemic or affective states. The domain-based approach initiated in the 90s consists in defining different types (aka domains) of semantic objects, like states of affairs, beliefs or speech acts. Such objects have a rich internal structure, which calls for a sufficiently expressive representation. For instance, speech acts involve discourse planning. Moreover DM can impact various layers of meaning (propositional content, presupposed content, etc.). Our goal is to implement formally the domain intuition by using tools from languages with a flexible subtyping mechanism (Ocaml) and to investigate whether the notion of monad, familiar in Haskell, can help us to characterize some `side effects' of DM.
At the moment (January 2025), we have constructed a libray of about 600 automata in the language of the Unitex-Gramlab software, to detect occurrences of about 700 DM. For words or expressions which are lexically ambiguous, this tools helps us to identify their uses as DM. For instance, bon can be an adjective (un bon roman, a good novel) or a discourse particle like in bon, il faut qu'on parte (OK, we have to go). A comparison with the results of various taggers (probabilistic and based on LLM) is under way. We have also attacked the typing problem mentioned above by (i) clarifying the different distinctions necessary to have a clear notion of type for DM and (ii) by capturing the basic information structures for DM in the language of Dialogue Game Boards of 56 (64).
Jacques Jayez is also working on the argumentative dimension of discourse, using a combination of standard Bayesian approach and game-theoretical notions 23, 63.
8.2.3 Pathological Discourse Modeling
Michel Musiol is once again part-time delegate at Inria (délégation nationale SHS) for the period 2023-2024. This proximity has enabled us to pursue an active collaboration on the modeling of pathological and clinical discourse.
In the context of the MePheSTO project (Digital Phenotyping for Psychiatric Disorders from Social Interaction - DFKI-Inria AI project), a multimodal perspective on this issue has been introduced. This includes the development of an interlocutory model of what might be termed a "therapeutic effect" in clinical/psychopathological interviewing, with regard to the cognitive-discursive profile and oculomotor behavior of participants (Amandine Lecomte ś thesis forthcoming 2025; Marie-Hélène Pierre's thesis forthcoming 2025).
Indeed, based on a previous study addressing the issue of cognitive impairment bypass based on the dynamics of the repetition process 65, 66 and another modeling visual attention in the psychologist-schizophrenic patient interaction 73 as well as on the basis of new data (for example, from subjects with ultra-high risk of developing psychosis or subjects who have experienced a major depressive episode 55, in the MePheSTO project), Maxime Amblard, Michel Musiol and Amandine Lecomte have explored the dependent relationships between supportive visual behavior and interaction dynamics, on the one hand, and the relationships between mental pathology and oculomotor disorders, on the other hand.
In addition, in 70 they have sketched a methodology for analyzing speech disorders, which will have the particularity of helping to select the discontinuous sequences most likely to carry thought disorders. They have anticipated the development of a modeling system based on the principles of pragmatic linguistics and formal semantics, which, when applied to carefully selected discontinuous discourse sequences 69, will have a good chance of revealing the nature of the underlying thought disorders. They compared the conjectures with the results of an earlier study on the discovery of four "proven" types of discontinuous sequences, and showed which of these sequences can therefore be considered carrying thought disorders. All the area of thought disorders in schizophrenic disourse might be modelized on the basis of computational and semantical tools 72, 75.
They also analyzed these sequences by testing certain principles of semantic modeling in order to identify the nature of the disorders and thought operations underlying the discontinuous sequences concerned 71, 70. They show that discursive thought disorders should not be considered simply as the expression of a dysexecutive syndrome 52, but also as a device likely to affect more complex thought operations such as inferences involved in the conversational context representation system, in semantic memory and in calculating the meaning of utterances or in calculating the meaning of the speaker.
Improving the heuristics of formal systems for recognizing speech disorders and interpreting thought disorders on the basis of more appropriate and accurate semantic modeling may lead to the development of more discriminating and effective diagnostic tools, particulary to account for Schwachwan-diamond diagnostic and cognitive dysfunction 77 The formatting of the formal systems they have achieved will make it possible to represent the interlocutory structure of the disorder more and more accurately in its natural context of expression (speech), and should lead to the development of computerized diagnostic aids.
Finally, the increased precision of formal modeling applied to communication disorders should also make it possible to test the hypothesis that certain discourse configurations are related to thought disorders (in the broad sense), while others reveal cognitive dysfunctions that have more to do with the conditions of the possibility of discourse.
In this line, Michel Musiol has also updated MOS-SF36 norms in the young French population 76. This program focuses on accurate and rapid diagnosis, as well as long-term therapeutic follow-up. These are major challenges for contemporary psychopathology. Thanks to modeling 65, 66 and computing 51, this work is developing multimodal methodologies for investigating symptoms (language, speech, neuropsychological and cognitive processes, eye movements and visual attention) that are sufficiently accurate to give rise in the medium term to the development of computerized diagnostic and therapeutic follow-up tools for the benefit of those involved in mental health.
Improving the accuracy of these tools also requires to refine the aimed targets (i.e. the definition of the symptoms). In this line, Vincent Martin has been involved in a work aiming at unifying the semiology of psychiatry through data vizualisation 13.
Rémi De Vergnette conducted an M1 intersnship under the supervision of Maxime Amblard. The internship focused on developing methods for the automatic analysis of interview data. Using video recordings and eye-tracker data, the goal was to extract and utilize a list of fixations made by the psychologist and patient, along with the corresponding fixation zones. Previously, much of this work was performed manually using commercial analysis software. Automating this process was essential due to the large number of interviews to analyze (150 across two hospitals in Nice and Aix-en-Provence), with further data collections planned. Reducing manual processing allows more time to focus on in-depth analyses requiring expert input from psychologists. He developed several important works listed afterward. The development of an algorithm for zone-based video segmentation. The creation of algorithms to associate fixation data with zones, accounting for biases introduced by the software and uncertainties in measurement values. The implementation of a module to extract raw eye-tracker data from acquisition files. The design of two interfaces to facilitate collaboration with participating psychologists.
8.2.4 Cognitive traces of side issues
It is by now widely believed that natural language communication operates at several levels. This means that information is distributed across several partially independent dimensions. For instance, a sentence like My stupid neighbor made noise again, simultaneously conveys that my neighbor made noise (the truth-conditional content), that the speaker considers he is stupid (an expressive, side issue 1) and that he had made noise before (the presupposition of again, side issue 2). While these phenomena have been extensively described from an empirical perspective, there is at present no unified framework for representing their differences and possible interactions under a formal, computational or cognitive point of view.
We examined the motor effects of presuppositions, using the convenient lexical material of factive verbs, that is, verbs which presuppose the truth of the complement clause. For example, Mary knows that Paul cheated on the exam presupposes that Paul cheated on the exam (side issue) and asserts (truth-conditional content) that she believes that. It has been shown that the oral presentation of movement-related verbs like jump or push elicits some activation in the motor cortex and finally results into an involuntary contraction of the thumb-index arc, which can be recorded by a special electromagnetic cell, called a grip force sensor.
We adapted this technique to the case of factive verbs on a series of sentences of the form Mary knows that Paul throws the ball, compared with high base-level sentences like Paul throws the ball and low base-level sentences like Paul does not throw the ball. Summarizing, our results indicate that the sentences with the factive verbs elicit a very similar response to that of high base-level ones, and, as expected, a different response from that of low base-level one. This suggests that, at least for factive verbs, the presupposed status leaves no trace of a special cognitive treatment, which would lead for instance, to a delayed or weaker motor response.
However, when applied to more complex negative sentences like Mary does not know that Paul throws the ball, there is no evidence of a motor trace. This is in agreement with observations in the literature suggesting that, under negation, presuppositions are more 'difficult' to process than in simple assertive sentences. More precisely, in the case of motor response, negation interacts with the presupposition, which suggests that truth-conditional content and side issue cognitive processing cannot be totally separated 9.
8.3 Common Basic Resources
Participants: Maxime Amblard, Hee-Soo Choi, Philippe de Groote, Bruno Guillaume, Sylvain Pogodalla, Karën Fort.
8.3.1 Universal Dependencies and Surface Syntactic Universal Dependencies
The Universal Dependencies project (UD) aims at building a syntactic dependency scheme which allows for similar analyses for several different languages. Bruno Guillaume is active in the UD community, and participate in the development and the improvement of the French data in this international initiative.
During 2024, he continued working, in collaboration with Sylvain Kahane, Kim Gerdes and their teams on the promotion of the Surface Syntactic Universal Dependencies (SUD) framework. SUD is an annotation scheme for syntactic dependency treebanks, that is almost isomorphic to UD (Universal Dependencies). Contrary to UD, it is based on syntactic criteria (favoring functional heads) and the relations are defined on distributional and functional bases.
This work is mainly conducted in the ANR project Autogramm (Induction of descriptive grammar from annotated corpora) started in 01 2022. The goal of this project is to automate, as far as possible, the extraction of descriptive grammars and grammatical descriptions from annotated corpora for linguistic and typological studies. The project also promotes the development of treebanks for low-resourced languages, in order to extract quantitative descriptive grammars for these languages.
Bruno Guillaume, in collaboration with Althea Löfgren, Santiago Herrera, Sylvain Kahane and Natalia Levshina have conducted a comparative analysis of UD Treebanks to explore how they can asset for linguistic diversity in linguistic typology. This work was presented in 43.
In 2024, two new versions of Universal Dependencies were released. Bruno Guillaume worked in collaboration with field linguists to produce or enhanced treebanks in Surface Syntactic Universal Dependencies and to convert these treebanks to Universal Dependencies:
- Version 2.14 on May:
- new treebank for Northern Hausa (with Bernard Caron)
- new treebank for Southern Hausa (with Bernard Caron)
- Version 2.15 on November:
In 20, different ways to annotate both syntactic and morphological relations in a dependency treebank are compared. New formats, called mSUD and mUD, are proposed; they are compatible with the Universal Dependencies (UD) schema for syntactic treebanks. The paper emphasizes on mSUD rather than mUD, the former being based on distributional criteria for the choice of the head of any combination, which allows to clearly encode the internal structure of a word, that is, the derivational path. The authors investigate different problems posed by a morph-based annotation, concerning tokenization, choice of the head of a morph combination, relations between morphs, additional features needed, such as the token type differentiating roots and derivational and inflectional affixes. The annotation schema is then applied to different languages from polysynthetic languages such as Yupik to isolating languages such as Chinese.
The two papers 32, 39, present a new phonetic resource for Nigerian Pidgin, a low-resource language of West Africa. Aiming to provide a new tool for research on intonosyntax, the authors have augmented an existing syntactic treebank of Nigerian Pidgin, associating each orthographically transcribed token with a series of syllable-level alignments and phonetizations. Syllables are further described using a set of continuous and discrete prosodic features. This new approach provides a simple tool for researchers to explore the prosodic characteristics of various syntactic phenomena. The papers present the format of the corpus, the various features added, and several explorations that can be performed using an online interface. A prosodically specified lexicon extracted using this resource is also presented. In it, each orthographic form is accompanied by the frequency of its phoneme-level variants, as well as the suprasegmental features that most frequently accompany each syllable. Finally, several additional case studies on how this corpus can used in the study of the language’s prosody are presented.
Bruno Guillaume is a member of the core group of the UniDive COST action where he is in charge of the working groups dedicated to “corpus annotation”. This action was presented in a workshop in 31.
8.3.2 Enriching Lexical Resources
As part of her thesis on French lexical resources, Hee-Soo Choi conducted experiments on the enrichment of lexical semantic graphs using link prediction models. She proposed a resource-centric approach of the link prediction task by generating confidence-aware predictions that can complete a sparse lexical semantic graph. This work was published at the international conference LREC-COLING 2024 16 and its French version at the national conference TALN 35.
8.3.3 FENEC
Karën Fort worked with Alice Millour (Université Paris 8) and Yoann Dupont (Université Paris 3) on the creation of a balanced sample corpus for French named entity recognition. The created corpus, FENEC, is freely available on GitHub, and was presented in a paper at TALN 2022. An extended version in English was presented in 2024 at LREC-COLING 26.
8.3.4 Synthetic clinical texts generation
In the context of the CODEINE ANR project and more specifically of Nicolas Hiebel's PhD thesis, Karën Fort worked with Aurélie Névéol (LISN-CNRS) and Olivier Ferret (CEA) on the generation of synthetic clinical texts.
The key idea of the project is to use confidential corpora to automatically generate anonymous synthetic texts capable of emulating real documents from the perspective of their linguistic characteristics. Nicolas Hiebel worked with Hugo Boulanger (another Olivier Ferret PhD student) on generating clinical texts using constraints. This work has been published at the Clinical NLP workshop 15 and in French at TALN 33 and in 38.
Another part of the project consists in using a Games With A Purpose to validate and then annotate the synthesized clinical texts. This game, developed by Bertrand Remy, is called HostoMytho (see Section 7.1.3), and includes various mini-games for different annotation layers, such as negation, error typing, or plausibility rating. The game is multi-platform, and therefore intended to be used on the web (see: online HostoMytho), on Android and iOS. It has been presented at the Games and NLP workshop in 2024 22.
8.4 Ethics and biases
Participants: Karën Fort, Maxime Amblard, Michel Musiol, Marc Anderson, Fanny Ducel, Clémentine Bleuze.
8.4.1 Ethics@Loria
Karën Fort originated a working group at LORIA for AI ethics (ethics@loria), involving researchers from various teams, including Maxime Amblard, Marc Anderson, Armelle Brun (BIRD), Mathieu d'Aquin (Orpailleur), Christophe Cerisara (Synalp), Anne Bonneau (Multispeech), Slim Ouni (Multispeech) and Abdessamad Imine (Pesto). Aurore Coince helped manage the group. Ethics@loria proposed the Doctoral training on Ethics "Write your dystopia" in 2022 .
One major outcome of the training, which took some time to be produced, is a book collecting the six short stories written by the PhD students in three languages (French, English and Spanish) 50. It was translated in English and French and made available both as a book and online in 2024: thinkbeforeloading.loria.fr thanks to many people's (esp. Marie Baron's) efforts.
8.4.2 Evaluating Stereotypes in Masked Language Models in Many Languages
Following the creation of French CrowsPairs 7, Karën Fort contacted researchers interested in creating a CrowsPairs corpus for their language, in order to test the language models. The group got bigger and bigger to finally include 22 researchers (including Fanny Ducel, from the team) for 7 languages (German, Maltese, Spanish, Italian, Chinese, standard Arabic and Catalan). The group worked together for more than a year to produce corpora and test masked language models in these 7 new languages. The corpora are freely available, along with the code to test the language models and the guidelines we followed for the adaptation.1 It's important to notice that this work has been performed without any funding. The work has been detailed in a paper presented at LREC-COLING 2024 19.
8.4.3 Evaluating stereotypes in autoregressive language models
Fanny Ducel authored a critical literature review on the topic of stereotypical biases in language models, under the supervision of Karën Fort and Aurélie Névéol. This work, presented in an earlier version at the Workshop on Algorithmic Injustice in Amsterdam in June 2023, has been published in a longer version in the French journal "Revue TAL" 12.
Fanny Ducel, under the supervision of Karën Fort and Aurélie Névéol, measured gender stereotypical biases in cover letters generated by autoregressive language models, in French and Italian. This work has been presented and published at TALN in French 37, and in the international Language Resources and Evaluation journal 11.
In the context of the "Ethics and NLP" ("Éthique et TAL") day, organized at LORIA by Karën Fort, Aaron Boussidan, Fanny Ducel, Karën Fort and Aurélie Névéol presented an abstract on the use of ChatGPT in research on bias 34. Fanny Ducel also presented some actionable directions for bias research at the Workshop "New Perspectives on Bias and Discrimination in Language Technology" in Amsterdam in November 2024 41.
Karën Fort is PI of a new 4 year ANR project (2023-2027), InExtenso (Intrinsic and Extrinsic evaluation of biases in large language models), in collaboration with Rouen's hospital (CHU) and LISN-CNRS. The project aims at better identifying stereotyped biases in LLMs in French and, when possible, mitigate them. Within the framework of this project, Clémentine Bleuze, under the supervision of Karën Fort and Aurélie Névéol, started a thesis on the perception and evaluation of biases in medical applications of Large Language Models.
8.4.4 NLP for NLP and Ethics
Clémentine Bleuze conducted an M2 internship under the supervision of Karën Fort and Maxime Amblard, and in collaboration with Fanny Ducel. During this internship, she worked on the notion of scientific overclaiming (when researchers inadequately interpret or present elements of their research) in NLP papers. Contributions of this work include the definition of a taxonomy of relevant research claims, the constitution of a corpus of NLP claims originating from ArXiv and ACL papers (a subpart of which is human-annotated), and the training of BERT-based models to predict claim types. This work had led to the presentation of a poster during the scientific days of thematic research network LIFT 2 in October 2024 40.
8.4.5 Ethics in AI Integration into Industry
The ongoing collaboration between Karën Fort and Marc Anderson resulted in a final common article in 2024 10 about how the recommendations made by the ethics group were taken into account in the AI-Proficient project.
9 Partnerships and cooperations
9.1 European initiatives
9.1.1 Other european programs/initiatives
- Bruno Guillaume is a member of the core group of the COST action: CA21167 - Universality, diversity and idiosyncrasy in language technology (UniDive). He is the leader of the working group named "Corpus Annotation".
9.2 National initiatives
9.2.1 ANR Project: InExtenso
Participants: Karën Fort, Maxime Amblard, Michel Musiol, Fanny Ducel.
-
Title:
Intrinsic and Extrinsic evaluation of biases in large language models
-
Duration:
10 2023–09 2027
-
Coordinator:
Karën Fort
-
Partners:
CHU Rouen, LISN, LORIA
-
Participants:
Maxime Amblard, Fanny Ducel, Karën Fort (coordinator), Michel Musiol, Miguel Couceiro
-
Abstract:
Large Language Models (LLM) are the Swiss Army knife of today’s Natural Language Processing (NLP). They often outperform the state-of-the-art on benchmarks commonly used in the field for tasks such as part-of-speech tagging, text classification and named-entity recognition, thus paving the way to a myriad of end-user applications. However, it has been shown that LLM exhibit major ethical issues including significant environmental impact, mirroring and amplification of stereotyped biases, which in turn have a disproportionate impact on historically disadvantaged social groups. It is urgent to address the social impact of NLP as the applications we develop, such as chatGPT, are now directly made available to end users. The detection and mitigation of biases have therefore become an active area of research in the past few years, focusing mainly on Masked Language Models (MLM) such as BERT in English and the North American social context. Several sources of bias were identified in the NLP pipeline. However the interconnection between sources and overall impact of each source on downstream applications remains unclear. In this project, we want to observe the entire pipeline, from the intrinsic point of view (within the model itself), to the pre-training task point of view (in the case of autoregressive LLM, text generation), on to some real-world downstream applications. We chose to focus on two types of medical applications: mental illness diagnosis help and information extraction from clinical records for public health purposes such as patient enrollment into clinical trials. The project will provide corpora and methods for a global evaluation of bias in LLM in French as well as studies to further the understanding of biases in clinical NLP pipelines and the environmental impact of the integration of these models in digital health.
9.2.2 ANR Project: CoDeinE
Participants: Karën Fort, Bruno Guillaume, Bertrand Remy.
-
Title:
artificial text COrpus DEsIgNed Ethically automatic synthesis of clinical documents
-
Duration:
03 2021–02 2026
-
Coordinator:
Aurélie Névéol (Limsi)
-
Partners:
CRC, CEA List, LISN, LORIA
-
Participants:
Bruno Guillaume, Karën Fort (local coordinator), Bertrand Remy
-
Abstract:
Machine learning methods have become prevalent in language technologies. They rely on annotated corpora to train models and evaluate algorithms. The CoDeinE project proposes to address the lack of shareable corpora in sensitive domains such as health or banking. The key idea of the project is to use confidential corpora to automatically generate synthetic texts that mimic the linguistic properties of real documents while preserving confidentiality. We will use clinical documents in electronic patient records as a case study. Furthermore, the project will rely on Games With A Purpose and crowd sourcing to validate and annotate the synthesized texts.
9.2.3 ANR Project: Autogramm
Participants: Bruno Guillaume, Karën Fort, Khensa Amani Daoudi.
-
Title:
Induction of descriptive grammar from annotated corpora
-
Duration:
01 2022–12 2025
-
Coordinator:
Sylvain Kahane (Université Paris Nanterre)
-
Partners:
MoDyCo, LACITO, LISN, Inria Nancy – Grand Est
-
Participants:
Bruno Guillaume (local coordinator), Karën Fort
-
Abstract:
The goal of this project is to automate, as far as possible, the extraction of descriptive grammars and grammatical descriptions from annotated corpora for linguistic and typological studies. The project also promotes the development of treebanks for under-endowed languages, in order to extract quantitative descriptive grammars for these languages. The project uses the annotation scheme SUD (Surface-syntactic Universal Dependencies), the query tool Grew-match and the annotation tool ArboratorGrew.
9.2.4 ANR Project: CODIM
Participants: Maxime Amblard, Jacques Jayez.
-
Title:
Compositionality and discourse markers
-
Duration:
01 2023–12 2026
-
Coordinator:
Mathilde Dargnat (Université de Lorraine and ATILF)
-
Partners:
ATILF, LLF, LORIA
-
Participants:
Maxime Amblard, Jacques Jayez
-
Abstract:
The CODIM project focuses on the two main linguistic resources for organizing monologues or conversations in human languages : D(iscourse) M(arkers)(therefore/donc, well/ben, bon etc. in English/French) and prosody (in particular, intonation). It will evaluate their status with respect to two major views on communication: compositionality (the possibility of combining meaningful expressions into more complex meaningful expressions) and pattern or construction-based approaches (the idea that language users exploit partly `frozen’ strings of words). We will compare the semantic and prosodic properties of simple and complex French DM (e.g. ah + bon) found in corpora for written and spoken French, using a variety of technical tools for DM identification (category-driven text mining), clustering (statistics and Machine Learning) and research in prosody (duration and intensity measures, contour representation). The project fosters a number of collaborations between linguists and computer scientists.
10 Dissemination
Participants: Maxime Amblard, Clémentine Bleuze, Hee-Soo Choi, Marie Cousin, Fanny Ducel, Philippe de Groote, Amandine Decker, Karën Fort, Bruno Guillaume, Maxime Guillaume, Vincent Martin, Michel Musiol, Valentin Richard, Siyana Pavlova, Sylvain Pogodalla, Vincent Tourneur.
10.1 Promoting scientific activities
10.1.1 Scientific events: organisation
- Karën Fort: organizer of the Journées éthique et TAL (GDR LIFT, TAL, ATALA), Nancy, 2 4 2024. 46
- Karën Fort: co-organizer, with Sylvain Loiseau (LACITO) and Berthold Crysman (LLF) of the GDR LIFT summer school on Annotation, AnnoDemo. From 2024-06-03 to 2024-06-07.
Member of the conference program committees
- Philippe de Groote: Seventh meeting of the Society for Computation in Linguistics, SCiL2024.
Reviewer
- Karën Fort: LREC-COLING 2024, TALN 2024, Games and NLP 2024, EvalLLM 2024, Journées LIFT 2024, Alsic 2024.
- Sylvain Pogodalla: 30th Workshop on Logic, Language, Information and Computation (WoLLIC 2024), selected papers of Logic and Engineering of Natural Language Semantics 20 (LENLS20).
- Bruno Guillaume: LREC-COLING 2024, Games and NLP 2024.
- Maxime Amblard: Wollic, TALN, Semdial, LREC-COLING 2024, ECAI, COLM, ISA, Ethique et TAL
10.1.2 Journal
- Maxime Amblard: editor in chief of the Revue TAL.
Member of the editorial boards
- Philippe de Groote: Area editor of the FoLLI-LNCS series.
- Vincent Martin: Philosphie et Médecine.
- Sylvain Pogodalla: Member of the editorial board of the journal Traitement Automatique des Langues, in charge of the Résumés de thèses section.
Reviewer - reviewing activities
- Vincent Martin: Journal of Medical Internet Research, JMIR Formative Research, Schizophrenia Bulletin.
10.1.3 Invited talks
- ACL Teaching NLP Workshop, August 15th, 2024, Karën Fort, Teaching Ethics in Natural Language Processing in practice
- SophIA, Nov. 29th, 2024, Karën Fort, Why we should change the way we evaluate LLMs
- Journée de l'axe Terrains "l'IA en terrains minés", ATILF, June 25th 2024, Karën Fort and Fanny Ducel, Enjeux éthiques de l'IA par le prisme du TAL et des biais stéréotypés
- Festival IA, Avignon (France), Nov. 13th, 2024, Vincent Martin, Table ronde "IA en santé metnale, avancée ou menace pour notre avenir ?"
- Congrès du Sommeil, Lille (France), Nov. 20th, 2024, Vincent Martin, Atelier "Enjeux du recueil sémiologique et d'un vocabulaire standardisé en médecine du sommeil"
- LIFT 2 - Journées de lancement, Orléans (France), Nov. 14th, 2024, Bruno Guillaume, "Enrichissement des banques de dépendances syntaxiques : Deux expériences avec la morphologie et la prosodie"
- Symposium Santée Mentale, January 30th, 2024, Maxime Amblard, ”Apprentissage multitâche pour la détection de la dépression dans les dialogues”
- LIH, Luxembourg, February 5th, 2024, Maxime Amblard and Michel Musiol”NLP & gaze for mental disorders”.
- CERCLES, Maxime Amblard
- ANR Isovote days, Maxime Amblard was invited at the round table ”Intelligence artificielle et traitement de corpus en sciences humaines et sociales : vers une révolution des pratiques scientifiques ?”, October 11th, 2024
- Annual seminar ANR SELEXINI, June 19th, 2024, Hee-Soo Choi, "French Lexical Semantic Graphs: Enrichment by Link Prediction and Integration in a WSD model" 14
10.1.4 Leadership within the scientific community
National responsibilities
- GDR LIFT 2, a follow-up to GDR LIFT (Linguistique Informatique, Formelle et de Terrain), has started in 2024, led by Karën Fort. In 2024, some important events were organized, including:
- the AnnoDemo summer school (Banyuls-sur-mer, June 2024),
- an Ethics and NLP Journée d'études (LORIA, April 2024),
- the kick-off of LIFT2 (Orléans Nov. 2024),
- a speech Datathon in Paris (Nov 2024).
- Maxime Amblard: Leader of INSIGHT project (Lorraine Université d'Excellence project - PIA).
International responsibilities
- Karën Fort has been co-chair of the ACL Ethics Committee since 2021, with Min-Yen Kan (Univ. of Singapore) and Y. Tsvetkov (Univ. of Washington), then Luciana Benotti (Universidad Nacional de Córdoba).
10.1.5 Scientific expertise
- Sylvain Pogodalla: evaluation for the Inria Quadrant Programme.
- Maxime Amblard: evaluation HCERES du LIASD (Université Paris 8).
10.1.6 Research administration
- Karën Fort
- Member of the Conseil de Pôle AM2I of Université de Lorraine.
- Member of CNU 27 (Computer Science): participation to qualifications (then became Professor and had to leave)
- Maxime Amblard
- Member of CNU 27 (Computer Science)
- Head of the master in Natural Language Processing (master 1 and 2).
- Vincent Martin
- Member of the steering committee of the Human Language Technologies college from the French Association for Artificial Intelligence (AFIA).
- Sylvain Pogodalla:
- Elected member of the comité de centre Inria Nancy – Grand Est.
- In charge of the local commission IES (information et édition scientifique) of the Inria Nancy – Grand Est and LORIA.
- Member of the national commission IES of Inria.
10.2 Teaching - Supervision - Juries
10.2.1 Teaching
- Licence:
- Maxime Amblard, AI Introduction, 14h, L1, Université de Lorraine, France.
- Maxime Amblard, Chuyuan Li NLP for beginners, 20h, L2, Université de Lorraine, France.
- Maxime Amblard, Discover Data Processing, 15h, L2, Université de Lorraine, France.
- Maxime Amblard and Chuyuan Li, Linguistic engineering, 20h, L3, Université de Lorraine, France.
- Maxime Amblard, Ethical aspects of NLP, 10h, L3, Université de Lorraine, France.
- Maxime Amblard, Clémentine Bleuze, Découverte du Traitement des Données Langagières, 30h, L2, Université de Lorraine, France
- Hee-Soo Choi, Advanced Databases, 20h, L2, Université de Lorraine, France.
- Hee-Soo Choi, Advanced Object-Oriented Programming in Python, 37.5h, L3 NLP major, Université de Lorraine, France.
- Marie Cousin, Programmation Orientée Objet Avancée, 20h, L2, Université de Lorraine, France.
- Amandine Decker, Introduction à la Programmation (Java), 18h, L1, Université de Lorraine, France.
- Karën Fort, Bases de données relationnelles, 55h, L3, Sorbonne Université, France.
- Karën Fort, Relational databases, 55h, L3, Sorbonne Université, France.
- Siyana Pavlova, Traitement Automatique des Langues, 10h, L2, Université de Lorraine, France.
- Siyana Pavlova, Formalismes de représentation et de raisonnement, 30h, L3, Université de Lorraine, France.
- Siyana Pavlova, Ingénierie des langues, 20h, L3, Université de Lorraine, France.
- Vincent Tourneur, Administration UNIX, 24h, L2, IUT Charlemagne, Université de Lorraine, France.
- Vincent Tourneur, Structures de données, 40h, L1, IUT Charlemagne, Université de Lorraine, France.
- Master:
- Maxime Amblard and Amandine Decker, Methods for NLP, 36h, M1 NLP (IDMC), Université de Lorraine, France.
- Maxime Amblard, NLP project, 20h, M1 NLP (IDMC), Université de Lorraine, France.
- Maxime Amblard and Amandine Decker, Dialogue ChatBot and Question Answering, 21h, M2 NLP (IDMC), Université de Lorraine, France.
- Maxime Amblard and Amandine Decker, Dialogue Engineering, 14h, M2 NLP (IDMC) LI, Université de Lorraine, France.
- Maxime Amblard and Marie Cousin, Math and Theoretical Computer Science, 50h M1 NLP (IDMC), Université de Lorraine, France.
- Maxime Amblard, Introduction to NLP, M1 NLP (IDMC), Université de Lorraine, France.
- Hee-Soo Choi, Written Corpora (English), 30h, M1 NLP (IDMC), Université de Lorraine, France.
- Hee-Soo Choi, Software Projects (English), 24h, M1 NLP (IDMC), Université de Lorraine, France.
- Fanny Ducel, Python Programming (English), 37h, M1 NLP (IDMC), Université de Lorraine, France.
- Fanny Ducel, Introduction to G5K (English), 4h, M1 NLP (IDMC), Université de Lorraine, France.
- Philippe de Groote and Marie Cousin Formal Logic, 22h, M1 NLP (IDMC), Université de Lorraine, France.
- Philippe de Groote, Formal languages, 22h, M1 NLP (IDMC), Université de Lorraine, France.
- Philippe de Groote and Amandine Decker, Semantics and Discourse, 30h, M2 NLP (IDMC), Université de Lorraine, France.
- Philippe de Groote, Computational structures and logics for natural language modeling, 12h, MPRI, Université Paris Cité, France.
- Karën Fort, Fanny Ducel, Ethics and NLP (English), 19h, M1 NLP (IDMC), Université de Lorraine, France.
- Karën Fort, Fanny Ducel, Ethics and orientation (English), 25h, M2 NLP (IDMC), Université de Lorraine, France.
- Karën Fort, Corpora, resources and tools for linguistics, 39h, M1, Sorbonne Université, France.
- Bruno Guillaume, Written Corpora (English), 15h, M1 NLP (IDMC), Université de Lorraine, France.
- Bruno Guillaume, Lexical Resources (English), 15h, M2 NLP (IDMC), Université de Lorraine, France.
- Vincent Martin, Voice biomarkers (English), 12h, M2 NLP (IDMC), Université de Lorraine, France
- Vincent Martin, NLP projects (English), 3h, M1 NLP (IDMC), Université de Lorraine, France
- Vincent Martin, Critical analysis of artificial intelligence for health (English), 6h, Master 2 Health Engineering, Université Grenoble Alpes, France
- Vincent Martin, Back to the big wide world: how to integrate digital tools into clinical practice? (English), 6h, Master 2 Health Engineering, Université Grenoble Alpes, France
- Sylvain Pogodalla, Semantics, 10h, M1 NLP (IDMC), Université de Lorraine, France
- Sylvain Pogodalla and Amandine Decker, Syntactic Models, 20h, M2 NLP (IDMC), Université de Lorraine, France
- Doctorate:
- Karën Fort, Annotation collaborative de corpus, 4h, Summer school of the GDR LIFT, AnnoDemo 06 2024 – Banyuls-sur-Mer, France.
- International Summer School:
- Bruno Guillaume with Daniel Zeman and Agata Savary, Corpus annotation infrastructure at the 1st UniDive training school, 8-12 July 2024 in Chișinău, Moldova.
10.2.2 Supervision
PhD in progress
- Vincent-Thomas Barrouillet, Le discours pathologique du sujet schizophrène, caractérisation psycholinguistique et computationnelle des déviations décisives à la logicité dialogique en étude de corpus, since 10 2019. Supervision: Michel Musiol and Maxime Amblard.
- Clémentine Bleuze, Perception et évaluation des biais dans les applications des LLM au domaine biomédical, since 10 2024. Supervision: Karën Fort and Aurélie Névéol (LISN-CNRS).
- Colleen Beaumard, Biomarqueurs vocaux collectés par des agents conversationnels pour l'aide au diagnostic et le suivi des troubles du sommeil et des troubles mentaux, since 10 2022. Supervision: Jean-Luc Rouas (Université de Bordeaux, LaBRI), Pierre Philip (Université de Bordeaux, SANPSY) and Vincent Martin.
- Hee-Soo Choi, Lier des ressources lexicales du français en vue d'une interopérabilité entre niveaux linguistiques, since 10 2021. Supervision: Karën Fort and Mathieu Constant.
- Marie Cousin, Modélisation de paraphrase dans les grammaires catégorielles abstraites, since 10 2022. Supervision: Philippe de Groote and Sylvain Pogodalla.
- Amandine Decker, Modelling Topic-level Interaction in Pathological Conversations, since 10 2022. Supervision: Maxime Amblard and Ellen Breitholtz (University of Gothenburg, Sweden).
- Fanny Ducel, Evaluating stereotyped biases in auto-regressive language models, since 10 2023. Supervision: Karën Fort and Aurélie Névéol (LISN-CNRS).
- Maxime Guillaume, Structures de traits pour les Grammaires Catégorielles Abstraites, since 07 2021. Supervision: Philippe de Groote and Raphaël Salmon (Yseop).
- Santiago Herrera, Extraction de grammaires descriptives à partir de corpus annotés en syntaxe, since 09 2022. Supervision Sylvain Kahane (MoDyCo, Université Paris Nanterre) and Bruno Guillaume.
- Nicolas Hiebel, Création éthique de données textuelles artificielles : application au domaine biomédical, since 10 2021. Supervision: Aurélie Névéol (LISN-CNRS), Karën Fort and Olivier Ferret (CEA).
- Amandine Lecomte, Analyse longitudinale de prise en charge psychothérapeutique de patients psychiatriques et de patients atteints de maladies neurodégénératives : informatisation et modélisation dialogique des indices comportementaux associés à l’efficacité (vs échec) des stratégies de prise en charge tentées par les thérapeutes, since 10 2019. Supervision: Michel Musiol and Alexandra König.
- Siyana Pavlova, Tools and Methods for Semantic Annotation, since 11 2020. Supervision: Maxime Amblard and Bruno Guillaume.
- Valentin Richard, Aspects compositionnels et dynamiques de la sémantique inquisitrice, since 09 2021. Supervision: Philippe de Groote, Floris Roelofsen and Reinhart Muskens (Universiteit van Amsterdam, ILLC).
- Vincent Tourneur, Algorithmes d’analyse syntaxique pour les grammaires catégorielles abstraites, since 10 2024. Supervision: Philippe de Groote.
10.2.3 Juries
- Karën Fort PhD examiner and jury president for Célina Treuillier, Modelisation individuelle et multi-factorielle du phenomene de polarisation pour une personnalisation de l’apport en diversite dans les recommandations de news, Université de Lorraine, Oct. 2024.
- Karën Fort PhD examiner and jury president for Vijini Liyanage, Detection of artificially generated academic texts, Université Paris 13, May 2024.
- Karën Fort PhD reviewer (rapporteure) for Lilia Segundo Diaz, Juegos del español – A Collaborative Game-Based Approach to Building a Parsed Corpus of European Spanish Dialects, Hasselt University, Belgium, April 2024.
- Maxime Amblard PhD reviewer and jury member for Alban Petit, Structured prediction methods for semantic parsing, université Paris Saclay, France, February 2024
- Maxime Amblard PHD reviewer and president for Navneet Agarwal, Autοmated depressiοn level estimatiοn : a study οn discοurse structure, input representatiοn and clinical reliability, Université Normandie, June 2024
- Maxime Amblard PhD examiner and jury president for Esteban Marquer, Reasoning over Data : Analogy-based and Transfer Learning to improve Machine Learning, Université de Lorraine, France June 2024
- Maxime Amblard PHD examiner and jury president for Florian Marchal-Bornet, Towards Explaining Recommendation with Stories inside an Augmented Territory, Université de Lorraine, France, September 2024
- Michel Musiol PHD examiner and jury president for Mimault, T. Nouvelle perspective clinique des hommes auteurs de violences conjugales : Vers une typologie dissociative visant à mieux orienter la prévention de la récidive, joint supervision Université de Lorraine and Université du Québec à Trois-Rivières. November 2024
10.3 Popularization
- Maxime Amblard is a member of the scientific committee of )i( interstices.
- Maxime Amblard is a member of the procès du robot team.
10.3.1 Productions (articles, videos, podcasts, serious games, ...)
- Karën Fort, Interview in Inria.fr: Pour une éthique du traitement automatique des langues : le regard de Karën Fort
- Karën Fort, Revue Vie de la recherche scientifique (VRS), 437, Les vrais dangers de l'IA
- Karën Fort, Maxime Amblard, Marc Anderson organised a doctoral training which allowed to creation of 6 short stories in 3 languages, forming a book 50
- Amandine Lecomte interview in France 3 Lorraine, April 2024 and France Info, April 2024, ”Santé : Comment l'intelligence artificielle peut aider les médecins.”
10.3.2 Participation in Live events
- Hee-Soo Choi: 2024-06-27, introducing linguistics and NLP to primary school children who have won the regional competition La Nouvelle de la Classe (ATILF, Nancy, France)
- Hee-Soo Choi and Fanny Ducel: 2024-01-16, presentation to high school students in the context of the "Chiche !" initiative (lycée Saint-Exupéry Fameck, France)
- Marie Cousin, Amandine Decker and Vincent Tourneur: 2024-01-11, presentation to high school students in the context of the "Chiche !" initiative (Lycée des métiers du tertiaire Jean-Victor Poncelet Saint-Avold, France)
- Marie Cousin, Amandine Decker, Karën Fort: 2024-09-21 and 2024-09-22, participation in “Journées européennes du Matrimoine” (Féru des Sciences, Nancy, France)
- Karën Fort, Invited presentation to the Association des Professeurs de Mathématiques de l'Enseignement Public (APMEP) : "Les outils de TAL : une opportunité pédagogique peut en cacher une autre", March 2024.
- Karën Fort, Apéro Scientifique du club ORION Human Interact - 10 avril 2024, Les enjeux éthiques de l'IA vus par le prisme du traitement automatique des langues
- Karën Fort, Mines students, LORIA - 17 janvier 2024, Enjeux éthiques du Traitement Automatique des Langues (TAL) : le cas des biais stéréotypés
- Karën Fort, IHEST, LORIA - 26 juin 2024, Enjeux éthiques de l'IA par le prisme du TAL
- Marie Cousin: 2024-02-02, participation in FIRST (Femmes Ingénieures, Réussir en Sciences et Technologies), présentation de la recherche à des élèves (filles) de seconde (Lycée Fabert, Metz, France)
- Valentin Richard gave a talk for Université Populaire et Participative de Vandœuvre at the Fabrique des Possibles (Vandœuvre-lès-Nancy), 17 April 2024.
- Maxime Amblard did a talk about AI for University libraries at Epinal
- Maxime Amblard did a talk about “Artificial Intelligence for schizophrenia » for high school students to prepare a presentation to the Comité Consultatif National d’Ethique, 22 mai 2024, Assemblée Nationale.
- Maxime Amblard did a talk ”IA et éthique : connaître les biais pour fixer les règles” for the national DPO day organised by CNIL and Metropole Grand Nancy, June 12th, 2024.
- Maxime Amblard presented Artificial Intellignece at la Frugalité Heureuse, national architecture meeting, October 6th, 2024.
- Maxime Amblard presented Artificial Intelligence for hight school students at Lycée Loritz, February 8th 2024.
11 Scientific production
11.1 Major publications
- 1 inproceedingsThe Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research.Proceedings of the 61st Annual Meeting of the Association for Computational LinguisticsVolume 1: Long Papers61st Annual Meeting of the Association for Computational Linguistics1Toronto, CanadaAssociation for Computational Linguitics2023, 13141-13160HAL
- 2 articleHuman Where? A New Scale Defining Human Involvement in Technology Communities from an Ethical Standpoint.International Review of Information EthicsAugust 2022HAL
- 3 articleNon-size increasing Graph Rewriting for Natural Language Processing.Mathematical Structures in Computer Science28082018, 1451--1484HALDOIback to text
- 4 bookApplication of Graph Rewriting to Natural Language Processing.1Logic, Linguistics and Computer Science SetISTE Wiley2018, 272HALback to text
- 5 article"You'll be a nurse, my son!" Automatically Assessing Gender Biases in Autoregressive Language Models in French and Italian.Language Resources and EvaluationOctober 2024HALDOI
- 6 articleA Note on Intensionalization.Journal of Logic, Language and Information2222013, 173-194HALDOI
- 7 inproceedingsFrench CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English.ACL 2022 - 60th Annual Meeting of the Association for Computational LinguisticsDublin, IrelandMay 2022HALback to text
- 8 articleA syntax-semantics interface for Tree-Adjoining Grammars through Abstract Categorial Grammars.Journal of Language Modelling532017, 527--605HALDOIback to text
- 9 articleFactives at hand: When presupposition mode affects motor response.Journal of Experimental Psychology2022HALDOIback to text
11.2 Publications of the year
International journals
- 10 articleEvaluating the acceptability of ethical recommendations in industry 4.0: an ethics by design approach. AI & Society: Knowledge, Culture and CommunicationJanuary 2024HALDOIback to text
- 11 article"You'll be a nurse, my son!" Automatically Assessing Gender Biases in Autoregressive Language Models in French and Italian.Language Resources and EvaluationOctober 2024HALDOIback to text
National journals
- 12 articleBias Research for Language Models is Biased: a Survey for Deconstructing Bias in Large Language Models.Revue TAL : traitement automatique des langues643September 2024, 119-143HALback to text
- 13 articleLa domestication de la sémiologie : proposition d’une organisation graphique du thesaurus semeioticus psychiatrique chez l’adulte.Annales Médico-Psychologiques, Revue PsychiatriqueNovember 2024HALDOIback to text
Invited conferences
- 14 inproceedingsFrench Lexical Semantic Graphs: Enrichment by Link Prediction and Integration in a WSD model.Séminaire annuel ANR SELEXINIParis, FranceJune 2024HALback to text
International peer-reviewed conferences
- 15 inproceedingsUsing Structured Health Information for Controlled Generation of Clinical Cases in French.The 6th Clinical Natural Language Processing Workshop At NAACL 2024 (ClinicalNLP 2024)Mexico city, MexicoJune 2024HALback to text
- 16 inproceedings Beyond Model Performance: Can Link Prediction Enrich French Lexical Graphs? The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING) Turin, Italy May 2024 HAL back to text
- 17 inproceedingsWith a Little Help from my (Linguistic) Friends: Topic Segmentation of Multi-party Casual Conversations.Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)Malta, MaltaAssociation for Computational LinguisticsMarch 2024, 177--188HALback to textback to text
- 18 inproceedings"Wait, did you mean the doctor?": Collecting a Dialogue Corpus for Topical Analysis.Proceedings of the 28th Workshop on the Semantics and Pragmatics of DialogueSEMDIAL 2024Rovereto, Italy2024HALback to textback to text
- 19 inproceedingsYour Stereotypical Mileage may Vary: Practical Challenges of Evaluating Biases in Multiple Languages and Cultural Contexts.The 2024 Joint International Conference on Computational Linguistics, Language Resources and EvaluationTurin (Italie), ItalyMay 2024HALback to text
- 20 inproceedingsJoint Annotation of Morphology and Syntax in Dependency Treebanks.The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)Turino, ItalyMay 2024HALback to text
- 21 inproceedingsACGtk: A Toolkit for Developing and Running Abstract Categorial Grammars.Functional and Logic Programming. 17th International Symposium, FLOPS 2024, ProceedingsLecture Notes in Computer ScienceFunctional and Logic Programming. 17th International Symposium, FLOPS 2024LNCS 14659Kumamoto, JapanSpringerMay 2024, 13-30HALDOIback to text
- 22 inproceedingsHostomytho: A GWAP for Synthetic Clinical Texts Evaluation and Annotation.Games and Natural Language Processing Workshop at LREC-COLING 2024, May 2024, Turin, ItalyTurin (Italie), ItalyMay 2024HALback to text
- 23 inproceedingsArgumentation et probabilités, ou pourquoi l'argumentation rationnelle n'est pas (toujours) un raisonnement.Congrès Mondial de Linguistique FrançaiseLausanne, FranceJuly 2024HALback to text
- 24 inproceedingsDiscourse Relation Prediction and Discourse Parsing in Dialogues with Minimal Supervision.Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)Malte, MaltaAssociation for Computational LinguisticsMarch 2024, 161--176HALback to text
- 25 inproceedingsMitigating Data Scarcity in Semantic Parsing across Languages: the Multilingual Semantic Layer and its Dataset.The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)Bangkok, ThailandAugust 2024HALback to text
- 26 inproceedingsUnveiling Strengths and Weaknesses of NLP Systems Based on a Rich Evaluation Corpus: the Case of NER in French.Proceedings of LREC-COLING 2024LREC-COLING 2024Turin, ItalyMay 2024HALback to text
- 27 inproceedingsYARN is All You Knit: Encoding Multiple Semantic Phenomena with Layers.Proceedings of The Fifth International Workshop in Designing Meaning RepresentationThe Fifth International Workshop in Designing Meaning RepresentationTurin, ItalyMay 2024HALback to text
- 28 inproceedings"selon comment vous vous positionnez" : Study of French Interrogative-based Adverbial Adjuncts.Actes du 9e Congrès Mondial de Linguistique Française9e Congrès Mondial de Linguistique Française191Lausanne, SwitzerlandSHS Web of ConferencesJune 2024, 14010HALDOIback to text
- 29 inproceedingsDynamic Effects of Modalized Questions.Proceedings of the 24th Amsterdam ColloquiumProceedings of the 24th Amsterdam Colloquium298-307Amsterdam, NetherlandsDecember 2024HALback to text
- 30 inproceedingsCOORTE: A Toolkit for Putting a French Spelling Reform Suggestion into Practice.Actes du 9e Congrès Mondial de Linguistique Française9e Congrès Mondial de Linguistique Française191Lausanne, SwitzerlandSHS Web of ConferencesJune 2024, 11002HALDOI
- 31 inproceedingsUniDive: A COST Action on Universality, Diversity and Idiosyncrasy in Language Technology.Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages (SIGUL-2024) @ LREC-COLING 20243rd Annual Meeting of the Special Interest Group on Under-resourced LanguagesTorino, Italy2024HALback to text
- 32 inproceedingsNew Methods for Exploring Intonosyntax: Introducing an Intonosyntactic Treebank for Nigerian Pidgin.Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)Turin, ItalyACLMay 2024, 12207–12216HALback to text
National peer-reviewed Conferences
- 33 inproceedingsGénération contrôlée de cas cliniques en français à partir de données médicales structurées.Actes de JEP-TALN-RECITAL 2024. Actes des 35èmes Journées d'Études sur la Parole35èmes Journées d'Études sur la Parole (JEP 2024) 31ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)Toulouse, FranceATALA & AFPCJuly 2024, 435-448HALback to text
- 34 inproceedingsWhat ChatGPT tells us about ourselves.Journée d'étude Éthique et TAL 2024Nancy, FranceApril 2024HALback to text
- 35 inproceedingsAu-delà de la performance des modèles : la prédiction de liens peut-elle enrichir des graphes lexico-sémantiques du français ?Actes de JEP-TALN-RECITAL 2024. 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de positionToulouse, FranceATALA & AFPC2024, 36-49HALback to text
- 36 inproceedingsBuilding an Unsupervised Topical Similarity Measure for Conversation.35èmes Journées d'Études sur la Parole (JEP 2024)35èmes Journées d'Études sur la Parole (JEP 2024) 31ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)1 : articles longs et prises de positionToulouse, FranceATALA & AFPC2024, 362-375HALback to text
- 37 inproceedingsAutomatically Assessing Gender Biases in Autoregressive Language Models..TALN 2024Toulouse, FranceJuly 2024HALback to text
- 38 inproceedingsOù la frugalité rejoint l'éthique : utilisation de données synthétiques pour la reconnaissance d'entités cliniques.Journée d'étude sur le traitement automatique des langues frugal et la recherche d'information frugale, ATALAParis, FranceJanuary 2024HALback to text
- 39 inproceedingsDe nouvelles méthodes pour l'exploration de l'interface syntaxe-prosodie : un treebank intonosyntaxique et un système de synthèse pour le pidgin nigérian.35èmes Journées d'Études sur la Parole (JEP 2024)35èmes Journées d'Études sur la Parole (JEP 2024) 31ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)1 : articles longs et prises de positionToulouse, FranceATALA & AFPC2024, 376-383HALback to text
Conferences without proceedings
- 40 inproceedingsVers la création d'une super-intelligence » : un corpus pour étudier les revendications des articles de TAL.Journées de lancement LIFT 2Orléans, FranceNovember 2024HALback to text
- 41 inproceedingsDesiderata for Actionable Bias Research.New Perspectives on Bias and Discrimination in Language TechnologyAmsterdam (Pays-Bas), FranceNovember 2024HALback to text
- 42 inproceedingsPerspective on individuals.Sinn und Bedeutung 29Noto, Italy2024HALback to text
- 43 inproceedingsExploring Sampling Strategies for Linguistic Diversity: A Comparative Analysis of UD Treebanks.15th International Conference of the Association for Linguistic TypologySingapore (SG), SingaporeDecember 2024HALback to text
Scientific book chapters
- 44 inbookExtending Abstract Categorial Grammars with Feature Structures: Theory and Practice.14569Logic and Engineering of Natural Language Semantics, 20th International Conference, LENLS20, Osaka, Japan, November 18–20, 2023, Revised Selected PapersLecture Notes in Computer ScienceSpringer2024, 118-133HALDOIback to text
- 45 inbookOn the Semantics of Dependencies: Relative Clauses and Open Clausal Complements.14569Logic and Engineering of Natural Language Semantics, 20th International Conference, LENLS20, Osaka, Japan, November 18–20, 2023, Revised Selected PapersLecture Notes in Computer ScienceSpringer2024, 244-259HALDOIback to text
Edition (books, proceedings, special issue of a journal)
- 46 proceedingsEthics and NLP: 10 years after.Journée d'études ATALA "éthique et TAL : 10 ans après"2024HALback to text
Reports & preprints
- 47 miscAdding communicative structure to the MMT into ACG encoding.October 2024HALback to text
Scientific popularization
- 48 inbookPENSE M'EN.THINK BEFORE LOADING2024HAL
- 49 inbookSTATE OF THE ART.THINK BEFORE LOADING2024HAL
- 50 bookK.Karën Fort, M. M.Marc M Anderson, A.Aurore Coince, M.Mathieu D’aquin, M.Maxime Amblard, S. E.Sarah E. Carter, I.Ilaria Tiddi and D.Diane Ranville, eds. Think before loading: Ne vous en faites pas, ça va mal se passer.2024HALback to textback to text
11.3 Cited publications
- 51 inproceedingsMon ordinateur est-il un bon psy ? Le TAL au service du diagnostic médical.Journée du GDR TAL : Intelligence artificielle et technologies des langues : l'ordinateur passe la barrière de la langue (2021)GDR TALParis, FranceJanuary 2021HALback to text
- 52 articleL’hypothèse du continuum bipolarité-schizophrénie au risque psycholinguistique des ruptures discursives.Ann MedPsycholto appear (available online in January 2025)2025back to text
- 53 inproceedingsMeaning-Text Theory within Abstract Categorial Grammars: Towards Paraphrase and Lexical Function Modeling for Text Generation.Proceedings of the 15th International Conference on Computational Semantics (IWCS)Nancy, FranceAssociation for Computational LinguisticsJune 2023HALback to text
- 54 inproceedingsVers une implémentation de la théorie sens-texte avec les grammaires catégorielles abstraites.Actes de CORIA-TALN 2023. Actes des 16e Rencontres Jeunes Chercheurs en RI (RJCRI) et 25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL)Paris, FranceATALAJune 2023, 72-86HALback to text
- 55 articleDigital Phenotyping for Differential Diagnosis of Major Depressive Episode: Narrative Review.JMIR Mental Health10January 2023, e37225HALDOIback to text
- 56 bookThe Interactive Stance.OxfordOxford University Press2012back to text
- 57 articleOn the expressive power of Abstract Categorial Grammars: Representing context-free formalisms.134http://www.springerlink.com/content/1572-9583/2004, 421--438HALDOIback to text
- 58 inproceedingsTowards a Montagovian account of dynamics.Proceedings of the 16th Semantics and Linguistic Theory Conference (SALT 16)2006DOIback to text
- 59 inproceedingsTowards abstract categorial grammars.Association for Computational Linguistics, 39th Annual Meeting and 10th Conference of the European ChapterColloque avec actes et comité de lecture. internationale.Toulouse, FranceJuly 2001, 148--155HALback to text
- 60 articleInteraction Grammars.72-42009, 171--208HALDOIback to text
- 61 articleThe Zipper.Journal of Functional Programming751997, 549--554DOIback to text
- 62 inproceedingsEconomic Event Detection in Company-Specific News Text.Proceedings of the First Workshop on Economics and Natural Language ProcessingMelbourne, AustraliaAssociation for Computational LinguisticsJuly 2018, 1--10URL: https://aclanthology.org/W18-3101/DOIback to text
- 63 inproceedings(Innocent?) Bias in Argumentation.IMPAQTS (Implicit Manipulation in Politics -- Quantitatively Assessing the Tendentiousness of Speeches) final conferenceRoma IIIRome, ItalyApril 2023HALback to text
- 64 inproceedingsDiscourse markers are not special (but they can be complicated.Empirical Issues in Syntax and Semantics. Selected papers from CSSP 2023Paris, France2025HALback to text
- 65 incollectionL'entretien clinique avec la personne polyhandicapée : un terrain commun sciences du langage / psychiatrie.Les sciences du langage face aux défis de la disciplinarisation et de l'interdisciplinarité. Malika Temmar, Marina Krylyschin, Guy Achard-Bayle (éds).January 2021HALback to textback to text
- 66 inproceedingsOrganisations et fonctions du comportement verbal de type ``backchannels'' dans l'interaction clinique avec la personne souffrant de schizophrénie.8eme Congrès mondial de linguistique françaiseOrléans, FranceJuly 2022HALback to textback to text
- 67 inproceedingsMultityped Abstract Categorial Grammars and Their Composition.WoLLIC 2022 - 28th International Workshop on Logic, Language, Information, and Computation13468Lecture Notes in Computer ScienceIaşi, RomaniaSpringer International PublishingSeptember 2022, 105--122HALDOIback to text
- 68 bookSemantics: From Meaning to Text.1Studies in Language Companion Series129Amsterdam/PhiladelphiaJohn Benjamins Publishing Company2012back to text
- 69 articleIncohérence et formes psychopathologiques dans l’interaction verbale.Psychose, langage et action: Approches neuro-cognitives2009, 217back to text
- 70 articleLe problème de l'analyse des troubles de la pensée dans le discours avec la personne schizophrène~: proposition méthodologique.872April 2022, 347--369HALDOIback to textback to text
- 71 articleLa rationalité de l'incohérence en conversation schizophrène.in press2006, in pressHALDOIback to text
- 72 incollectionL'analyse de l'interaction verbale « patient » - « thérapeute » par la modélisation formelle : perspectives diagnostiques et informatisation.L’avenir de la psychiatrieacceptedEllipse2025back to text
- 73 articleAjustement comportemental et mouvements de saccades oculaires dans la schizophrénie.812016, 365--379HALDOIback to text
- 74 inproceedingsA French Interaction Grammar.RANLP 2007 - International Conference on Recent Advances in Natural Language ProcessingIPP & BAS & ACL-BulgariaBorovets, BulgariaINCOMA Ltd, Shoumen, BulgariaSeptember 2007, 463--467HALback to text
- 75 articleA physical framework to harmonize human interaction analysis across disciplines.Current Psychologyto appear2025back to text
- 76 articleUpdated norms of the MOS-SF36 in the young French population.2022HALDOIback to text
- 77 articleSelf-Beneficial Transactional Social Dynamics for Cooperation in 1 Shwachman-Diamond Syndrome: A Mixed-Subject Analysis using 2 Computational Pragmatics.Frontiers in Psychologyaccepted2025back to text
- 78 inproceedingsAn AMR-based Link Prediction Approach for Document-level Event Argument Extraction.Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Toronto, CanadaAssociation for Computational LinguisticsJuly 2023, 12876--12889URL: https://aclanthology.org/2023.acl-long.720/DOIback to text