EN FR
EN FR
Bilateral Contracts and Grants with Industry
Bibliography
Bilateral Contracts and Grants with Industry
Bibliography


Section: New Software and Platforms

Alpage's linguistic workbench, including Sx Pipe and MElt

Participants : Benoît Sagot [correspondant] , Kata Gábor, Marion Baranes, Pierre Magistry, Pierre Boullier, Éric Villemonte de La Clergerie, Djamé Seddah.

See also the web page http://lingwb.gforge.inria.fr/ .

Alpage's linguistic workbench is a set of packages for corpus processing and parsing. Among these packages, two packages are of particular importance: the Sx Pipe pre-processing chain, and the MElt part-of-speech tagger.

Sx Pipe [109] is a modular and customizable chain aimed to apply to raw corpora a cascade of surface processing steps. It is used

  • as a preliminary step before Alpage's parsers (e.g., FRMG);

  • for surface processing (named entities recognition, text normalization, unknown word extraction and processing...).

Developed for French and for other languages, Sx Pipe includes, among others, various named entities recognition modules in raw text, a sentence segmenter and tokenizer, a spelling corrector and compound words recognizer, and an original context-free patterns recognizer, used by several specialized grammars (numbers, impersonal constructions, quotations...). It can now be augmented with modules developed during the former ANR EDyLex project for analysing unknown words; this involves in particular (i) new tools for the automatic pre-classification of unknown words (acronyms, loan words...) (ii) new morphological analysis tools, most notably automatic tools for constructional morphology (both derivational and compositional), following the results of dedicated corpus-based studies. New local grammars for detecting new types of entities and improvement of existing ones, developed in the context of the PACTE project, will soon be integrated within the standard configuration.

MElt is a part-of-speech tagger, initially developed in collaboration with Pascal Denis (Magnet, Inria — then at Alpage), which was trained for French (on the French TreeBank and coupled with the Lefff), also trained on English [79] , Spanish [88] , Italian [124] , German [38] , Dutch, Polish, Kurmanji Kurdish [138] and Persian [119] , [120] . It is state-of-the-art for French. It is now able to handle noisy corpora (French and English only; see below). MElt also includes a lemmatization post-processing step. A preliminary version of MElt which accepts input DAGs has been developed in 2013, and is currently under heavy rewriting and improvement in the context of the PACTE project (see  6.3 ).

MElt is distributed freely as a part of the Alpage linguistic workbench.

In 2014, additional efforts have been achieved for a better pre-processing of noisy input text. This covers two different scenarios:

  • user-generated content (see  6.2 ); two sets of tools are available for processing user-generated content: (i) very noisy computer-mediated content, such as found on social media, forums or blogs, are adressed within the MElt part-of-speech tagger via a three-step procedure (normalisation, tagging, de-normalisation with tag redistribution); this work is performed in relation with the CoMeRe project, funded by the Institut de Linguistique Française [14] ; (ii) less noisy customer data, for preparing shallow semantic analysis; this work is performed in collaboration with the viavoo company [17] .

  • output of OCR systems, in the context of the PACTE project (see  6.3 ).