Section: Partnerships and Cooperations

National Initiatives

LabEx EFL (Empirical Foundations of Linguistics) (2011 – 2021)

Participants : Laurence Danlos, Benoît Sagot, Marie-Hélène Candito, Benoît Crabbé, Pierre Magistry, Djamé Seddah, Maximin Coavoux, Éric Villemonte de La Clergerie.

Linguistics and related disciplines addressing language have achieved much progress in the last two decades but improved interdisciplinary communication and interaction can significantly boost this positive trend. The LabEx (excellency cluster) EFL (Empirical Foundations of Linguistics), launched in 2011 and headed by Jacqueline Vaissière, opens new perspectives by adopting an integrative approach. It groups together some of the French leading research teams in theoretical and applied linguistics, in computational linguistics, and in psycholinguistics. Through collaborations with prestigious multidisciplinary institutions (CSLI, MIT, Max Planck Institute, SOAS...) the project aims at contributing to the creation of a Paris School of Linguistics, a novel and innovative interdisciplinary site where dialog among the language sciences can be fostered, with a special focus on empirical foundations and experimental methods and a valuable expertise on technology transfer and applications.

Alpage is a very active member of the LabEx EFL together with other linguistic teams we have been increasingly collaborating with: LLF (University Paris 7 & CNRS) for formal linguistics, LIPN (University Paris 13 & CNRS) for NLP, LPNCog (University Paris 5 & CNRS) LSCP (ENS, EHESS & CNRS) for psycholinguistics, MII (University Paris 4 & CNRS) for Iranian and Indian studies. Alpage resources and tools have already proven relevant for research at the junction of all these areas of linguistics, both before the start of the LabEx EFL and within several EFL “scientific operations”. Moreover, the LabEx provides Alpage with opportunities for collaborating with new teams, e.g., on language resource development and empirical studies in collaboration with descriptive linguists.

The LabEx EFL's scientific activities are spread accross 7 autonomous scientific “strands”. In 2016,Benoît Sagot, Marie Candito and Benoît Crabbé were respectively deputy-head of strand 6 on “Language Resources”, strand 5 on “Computational semantic analysis” and strand 2 on “Experimental grammar from a cross-linguistic perspective”. Several project members are in charge of research operations within these 3 strands.


ANR project Profiterole (2017 - 2020)

Participants : Benoît Crabbé, Éric Villemonte de La Clergerie, Benoît Sagot.

PROFITEROLE is a 4-year ANR research project led by Sophie Prévost (LATTICE) that involves computational linguists and specialists of Medieval French from LATTICE (Univ. Paris 3, CNRS, ENS), ALPAGE and ICAR (Univ. Lyon, ENS).

PROFITEROLE has three closely correlated main goals that fall within the fields of linguistics and Natural Language Processing (NLP): (1) formal and computational modeling phonological, morphological and syntactic aspects of the diachronic evolution of French; (2) targeting the development of a methodology to explore and annotate heterogeneous linguistic data while providing automatic analysers for various stages of the French language; (3) expanding linguistic resources for French, by building a large annotated corpus (1 million words) of Medieval French (9th-15th centuries) and morphological lexicons (plus NLP tools) covering several stages of French. Alpage members will essentially be involved on the computational and formal modeling aspects of the project and on the design of automated processing tools for lexicon and syntax.

ANR project PARSITI (2016 - 2020)

Participants : Marie-Hélène Candito, Djamé Seddah [principal investigator] , Benoît Crabbé, Éric Villemonte de La Clergerie, Benoît Sagot.

Exploiting multilingual user-generated content (UGC), for applications such as information extraction, text mining or summarization, and facilitate their access to a wider audience implies a qualitative step-ahead in Natural Language Understanding. This is because UGC differs from better-studied edited data in many ways, including by non-canonical syntax, highly contextualised nature and rich lexical variability. The ParSiTi ANR project focuses on three critical aspects: (1) Robust Parsing Technologies, (2) Accurate Machine Translation Engines and (3) Context-aware Methods, all backed by State-of-the-Art Morphological Analysers and Normalization tools. To showcase the different models and algorithms designed during the project, a Machine Translation System will be developed that will be able to translate UGC between French, Arabic and English.

ANR project PARSEME-FR (2016 - 2019)

Participants : Marie-Hélène Candito, Mathieu Constant [principal investigator] , Benoît Crabbé, Laurence Danlos, Éric Villemonte de La Clergerie, Djamé Seddah.

PARSEME-FR is a 4-year ANR research project headed by Mathieu Constant (LIGM, Université Paris-Est Marne-la-Vallée, currently in “délégation” at Alpage). PARSEME-FR partners are LIGM, Alpage, LI (Université de Tours), LIF (Aix-Marseille Université) and LIFO (Université d'Orléans). This project aims at improving linguistic representativeness, precision and computational efficiency of Natural Language Processing (NLP) applications, notably parsing. The project focuses on the major bottleneck of these applications: Multi-Word Expressions (MWEs), i.e. groups of words with a certain degree of idiomaticity such as “hot dog”, “to kick the bucket”, “San Francisco 49ers” or "to take a haircut". In particular, it aims at investigating the syntactic and semantic representation of MWEs in language resources, the integration of MWE analysis in (deep) syntactic parsing and its links to semantic processing. Expected deliverables include enhanced language resources (lexicons, grammars and annotated corpora) for French, MWE-aware (deep) parsers and tools linking predicted MWEs to knowledge bases. This proposal is a spin-off of the European IC1207 COST action PARSEME on the same topic.

Alpage is participating mainly to two tasks: (i) the production of an evaluation corpus annotated with MWE and (ii) the production of MWE-aware statistical parsers, both for surface syntax and deep syntax. MWE recognition can be viewed as part of a more ambitious task of recovering the semantic units of a sentence. Combining it to deep syntactic parsing will provide a further step towards semantic parsing.

ANR project SoSweet (2015 - 2019)

Participants : Djamé Seddah, Marie-Hélène Candito, Benoît Sagot, Éric Villemonte de La Clergerie, Benoît Crabbé.

Led by Jean-Phillipe Magué (ENS Lyon), the SoSweet project focuses on the synchronic variation and the diachronic evolution of the variety of French language used on Twitter. Its goal is to provide a state-of-the-art socio-linguistic description of half a billion tweets collected over 5 years.

Alpage, specialized in natural language processing, takes care of the linguistics enrichment part, which provides the other partners with normalized and structurally enriched forms of text. Alpage is also responsible of providing distributional analysis of our corpus, by the means of various forms of word clustering in order to define sociolinguistic variants in the tweets.

ANR project ASFALDA (2012 – 2016)

Participants : Marie-Hélène Candito [principal investigator] , Marianne Djemaa, Benoît Sagot, Éric Villemonte de La Clergerie, Laurence Danlos.

Alpage is principal investigator team for the ANR project ASFALDA , lead by Marie Candito. The other partners are the Laboratoire d'Informatique Fondamentale de Marseille (LIF), the CEA-List, the MELODI team (IRIT, Toulouse), the Laboratoire de Linguistique Formelle (LLF, Paris Diderot) and the Ant'inno society.

The project aims to provide both a French corpus with semantic annotations and automatic tools for shallow semantic analysis, using machine learning techniques to train analyzers on this corpus. The target semantic annotations are structured following the FrameNet framework [54] and can be characterized roughly as an explicitation of “who does what when and where”, that abstracts away from word order / syntactic variation, and to some of the lexical variation found in natural language.

The project relies on an existing standard for semantic annotation of predicates and roles (FrameNet), and on existing previous effort of linguistic annotation for French (the French Treebank). The original FrameNet project provides a structured set of prototypical situations, called frames, along with a semantic characterization of the participants of these situations (called roles). We propose to take advantage of this semantic database, which has proved largely portable across languages, to build a French FrameNet, meaning both a lexicon listing which French lexemes can express which frames, and an annotated corpus in which occurrences of frames and roles played by participants are made explicit. The addition of semantic annotations to the French Treebank, which already contains morphological and syntactic annotations, will boost its usefulness both for linguistic studies and for machine-learning-based Natural Language Processing applications for French, such as content semantic annotation, text mining or information extraction.

To cope with the intrinsic coverage difficulty of such a project, we adopt a hybrid strategy to obtain both exhaustive annotation for some specific selected concepts (commercial transaction, communication, causality, sentiment and emotion, time), and exhaustive annotation for some highly frequent verbs. Pre-annotation of roles will be tested, using linking information between deep grammatical functions and semantic roles.

The project is structured as follows:

  • Task 1 concerns the delimitation of the focused FrameNet substructure, and its coherence verification, in order to make the resulting structure more easily usable for inference and for automatic enrichment (with compatibility with the original model);

  • Task 2 concerns all the lexical aspects: which lexemes can express the selected frames, how they map to external resources, and how their semantic argument can be syntactically expressed, an information usable for automatic pre-annotation on the corpus;

  • Task 3 is devoted to the manual annotation of corpus occurrences (we target 20000 annotated occurrences);

  • In Task 4 we will design a semantic analyzer, able to automatically make explicit the semantic annotation (frames and roles) on new sentences, using machine learning on the annotated corpus;

  • Task 5 consists in testing the integration of the semantic analysis in an industrial search engine, and to measure its usefulness in terms of user satisfaction.

The scientific key aspects of the project are:

  • an emphasis on the diversity of ways to express the same frame, including expression (such as discourse connectors) that cross sentence boundaries;

  • an emphasis on semi-supervised techniques for semantic analysis, to generalize over the available annotated data.

ANR project Polymnie (2012-2016)

Participants : Laurence Danlos, Éric Villemonte de La Clergerie, Timothée Bernard.

Polymnie is an ANR research project headed by Sylvain Podogolla (Sémagramme, Inria Lorraine) with Melodi (INRIT, CNRS), Signes (LABRI, CNRS) and Alpage as partners. This project relies on the grammatical framework of Abstract Categorial Grammars (ACG). A feature of this formalism is to provide the same mathematical perspective both on the surface forms and on the more abstract forms the latter correspond to. ACG allows for the encoding of a large variety of grammatical formalisms, in particular Tree Adjoining grammars (TAG).

The role of Alpage in this project is to develop sentential or discursive grammars written in TAG and to participate in their conversion in ACG. Results were first achieved in 2014 concerning text generation: GTAG formalism created by Laurence Danlos in the 90's has been rewritten in ACG [64], [65], [66]. As regards discursive analysis, D-STAG formalism created by Laurence Danlos in the 00's has also been rewritten in ACG in 2015 [67] (see also [27]).

Other national initiatives

“RAPID” project VerDI (2016 – 2019)

Participants : Benoît Sagot, Héctor Martínez Alonso.

The ANR “RAPID” project VerDI focuses on the automatic identification of information dissimulation on the Internet and on social networks. Such dissimulations can be produced by omiting crucial pieces of information within documents or during written online discussions, by hiding them within a massive information flow, or using other techniques. VerDI aims at extending an existing journalistic fact-checking tool developed by Trooclick, the company that leads the project.

FUI project COMBI (2014-2016)

Participant : Laurence Danlos.

COMBI is is an “FUI 16” project. It started in February 2014 for a two year duration. It groups 5 industrial partners (Temis, Isthma, Kwaga, Yseop and Qunb) and Alpage. Temis and Istma work on data mining from texts and big data. Kwaga works on the interpretation and inferences that can be drawn from the data retrieved in the analysis module. Alpage and Qunb work, under the supervision of Yseop, on the production of respectively texts and graphics describing the results of the interpretation module. Currently, COMBI aims at creating the full chain for a user case concerning the weekly activity of an on-line service.

Alpage works on text generation, with the adaptation of TextElaborator, a generation system developed in the 10's by WatchAssistance and based on G-TAG. Alpage also works on the opportunity to describe pieces of information by texts, graphics or both.

Institut de Linguistique Française and Consortium CORLI within the TGIR Huma-Num

Participants : Benoît Sagot, Stéphane Riou, Djamé Seddah.

Huma-Num is a TGIR (Very Large Research Infrastructure) dedicated to digital humanities. Among Huma-Num initiatives are a dozen of consortia, which bring together most members of various research communities. Among them is the CORLI consortium (following, among other, the Corpus Écrits consortium in which previously participating), which is dedicated, among other topics, to all aspects related to written corpora, from NLP to corpus development, corpus specification, standardization, and others. All types of written corpora are covered (French, other languages, contemprorary language, medieval language, specialized text, non-standard text, etc.). The consortium CORLI is managed by the Institut de Linguistique Française, a CNRS federation of which Alpage is a member since June 2013, under the supervision of Franck Neveu.

Alpage is involved in various projects within this consortium, and especially in the development of corpora for CMC texts (blogs, forum posts, SMSs, textchat...) and shallow corpus annotation, especially with MElt, and in the development of a preliminary version of the future Corpus de Référence du Français (French Reference Corpus).