Section: Partnerships and Cooperations
National Initiatives
ANR project ASFALDA (2012 – 2015)
Participants : Marie-Hélène Candito [principal investigator] , Marianne Djemaa, Benoît Sagot, Éric Villemonte de La Clergerie, Laurence Danlos.
Alpage is principal investigator team for the ANR project ASFALDA , lead by Marie-Hélène Candito. The other partners are the Laboratoire d'Informatique Fondamentale de Marseille (LIF), the CEA-List, the MELODI team (IRIT, Toulouse), the Laboratoire de Linguistique Formelle (LLF, Paris Diderot) and the Ant'inno society.
The project aims to provide both a French corpus with semantic annotations and automatic tools for shallow semantic analysis, using machine learning techniques to train analyzers on this corpus. The target semantic annotations are structured following the FrameNet framework [47] and can be characterized roughly as an explicitation of “who does what when and where”, that abstracts away from word order / syntactic variation, and to some of the lexical variation found in natural language.
The project relies on an existing standard for semantic annotation of predicates and roles (FrameNet), and on existing previous effort of linguistic annotation for French (the French Treebank). The original FrameNet project provides a structured set of prototypical situations, called frames, along with a semantic characterization of the participants of these situations (called roles). We propose to take advantage of this semantic database, which has proved largely portable across languages, to build a French FrameNet, meaning both a lexicon listing which French lexemes can express which frames, and an annotated corpus in which occurrences of frames and roles played by participants are made explicit. The addition of semantic annotations to the French Treebank, which already contains morphological and syntactic annotations, will boost its usefulness both for linguistic studies and for machine-learning-based Natural Language Processing applications for French, such as content semantic annotation, text mining or information extraction.
To cope with the intrinsic coverage difficulty of such a project, we adopt a hybrid strategy to obtain both exhaustive annotation for some specific selected concepts (commercial transaction, communication, causality, sentiment and emotion, time), and exhaustive annotation for some highly frequent verbs. Pre-annotation of roles will be tested, using linking information between deep grammatical functions and semantic roles.
The project is structured as follows:
Task 1 concerns the delimitation of the focused FrameNet substructure, and its coherence verification, in order to make the resulting structure more easily usable for inference and for automatic enrichment (with compatibility with the original model);
Task 2 concerns all the lexical aspects: which lexemes can express the selected frames, how they map to external resources, and how their semantic argument can be syntactically expressed, an information usable for automatic pre-annotation on the corpus;
Task 3 is devoted to the manual annotation of corpus occurrences (we target 20000 annotated occurrences);
In Task 4 we will design a semantic analyzer, able to automatically make explicit the semantic annotation (frames and roles) on new sentences, using machine learning on the annotated corpus;
Task 5 consists in testing the integration of the semantic analysis in an industrial search engine, and to measure its usefulness in terms of user satisfaction.
The scientific key aspects of the project are:
an emphasis on the diversity of ways to express the same frame, including expression (such as discourse connectors) that cross sentence boundaries;
an emphasis on semi-supervised techniques for semantic analysis, to generalize over the available annotated data.
ANR project EDyLex (2010 – 2013)
Participants : Benoît Sagot [principal investigator] , Rosa Stern, Damien Nouvel, Virginie Mouilleron, Marion Baranes, Sarah Beniamine, Laurence Danlos.
EDyLex was an ANR project (STIC/CONTINT) headed by Benoît Sagot, which came to an end on June 30, 2013. The focus of the project was the dynamic acquisition of new entries in existing lexical resources that are used in syntactic and semantic parsing systems: how to detect and qualify an unknown word or a new named entity in a text? How to associate it with phonetic, morphosyntactic, syntactic, semantic properties and information? Various complementary techniques will be explored and crossed (probabilistic and symbolic, corpus-based and rule-based...). Their application to the contents produced by the AFP news agency (Agence France-Presse) constitutes a context that is representative for the problems of incompleteness and lexical creativity: indexing, creation and maintenance of ontologies (location and person names, topics), both necessary for handling and organizing a massive information flow (over 4,000 news wires per day).
The participants of the project, besides Alpage, were the LIF (Université de Méditerranée), the LIMSI (CNRS team), two small companies, Syllabs and Vecsys Research, and the AFP.
In 2013, several important developments have been achieved:
Finalization of a beta version of the first non-alpha release of the WOLF (Free French WordNet)
Improvement or development of modules for automatic detection, classification and morphological analysis of unknown words (neologisms, new named entities) in French corpora and integration within a full-featured processing pipeline (see 6.2 );
Collaboration with Vocapia for interfacing the results of this pipeline with Vocapia's language models, in order to improve speech recognition systems used at AFP;
Use of an EDyLex-specific version of the NewsProcess architecture, previously developed at Alpage, for meeting the expectations of the EDyLex project in terms of lexicon extension from dynamic corpora, here AFP news wires.
ANR project Polymnie (2012-2015)
Participants : Laurence Danlos, Éric Villemonte de La Clergerie.
Polymnie is an ANR research project headed by Sylvain Podogolla (Sémagramme, Inria Lorraine) with Melodi (INRIT, CNRS), Signes (LABRI, CNRS) and Alpage as partners. This project relies on the grammatical framework of Abstract Categorial Grammars (ACG). A feature of this formalism is to provide the same mathematical perspective both on the surface forms and on the more abstract forms the latter correspond to. As a consequence:
ACG allows for the encoding of a large variety of grammatical formalisms such as context-free grammars, Tree Adjoining grammars (TAG), etc.
ACG define two languages: an abstract language for the abstract forms, and an object language for the surface forms.
The role of Alpage in this project is to develop sentential or discursive grammars written in TAG so as to study their conversion in ACG. First results achieved in 2013 are described in 6.14 .
Other national initiatives
“Investissements d'Avenir” project PACTE (2012 – 2014)
Participants : Benoît Sagot, Kata Gábor.
PACTE (Projet d'Amélioration de la Capture TExtuelle) is an “Investissements d'Avenir” project sumbitted within the call “Technologies de numérisation et de valorisation des contenus culturels, scientifiques et éducatifs”. It started in November 2012, although the associated fundings only arrived at Alpage in July 2013.
PACTE aims at improving the performance of textual capture processes (OCR, manual script recognition, manual capture, direct typing), using NLP tools relying on both statistical (-gram-based, with scalability issues) and hybrid techniques (involving lexical knowledge and POS-tagging models). It addresses specifically the application domain of written heritage. The project takes place in a multilingual context, and therefore aims at developing as language-independant techniques as possible.
PACTE involves 3 companies (Numen, formerly Diadeis, main partner, as well as A2IA and Isako) as well as Alpage and the LIUM (University of Le Mans). It brings together business specialists, large-scale corpora, lexical resources, as well as the scientific and technical expertise required.
The results obtained at Alpage in 2013 within PACTE are described in 6.7
Consortium Corpus Écrits within the TGIR Huma-Num
Participants : Benoît Sagot, Djamé Seddah.
Huma-Num is a TGIR (Very Large Research Infrastructure) dedicated to digital humanities. Among Huma-Num initiatives are a dozen of consortia, which bring together most members of various research communities. Among them is the Corpus Écrits consortium, which is dedicated to all aspects related to written corpora, from NLP to corpus development, corpus specification, standardization, and others. All types of written corpora are covered (French, other languages, contemprorary language, medieval language, specialized text, non-standard text, etc.). The consortium Corpus Écrits is managed by the Institut de Linguistique Française, a CNRS federation of which Alpage is a member since June 2013, under the supervision of Franck Neveu.
Alpage is involved in various projects within this consortium, and especially in the development of corpora for CMC texts (blogs, forum posts, SMSs, textchat...) and shallow corpus annotation, especially with MElt.