Joint team with LaBRI and the Department of Linguistics of Université Bordeaux 3 Michel de Montaigne — in particular with the research ministry Jeune Equipe JE 2385 TELANCO and the C.N.R.S. UMR 5610 ERSS.
LaBRI is a joint C.N.R.S. UMR 5800 team, involving Université Bordeaux 1, and the Ecole Nationale Supérieure d'Electronique, d'Informatique, et de Radiocommunications de Bordeaux, ENSEIRB.
ERSS is a joint C.N.R.S. team involving Université Toulouse-Le Mirail and Université Michel de Montaigne in Bordeaux
TELANCO is a team of Université Michel de Montaigne Jeune Equipe 2385 from the Research Ministry
TheSignes team is addressing several domains of computational linguistics such as:Ê
flexional and derivational morphology
syntax
logical (or predicative) semantics
lexical semantics
discourse representation
by means of formal methods such as:
formal language theory
categorial grammars
resource logic
lambda calculus
higher order logic
Two applications illustrate this approach:Ê
natural language tools for Sanskrit
modelling of French Sign Language grammar
We also develop the corresponding computational linguistics tools. Ultimately these tools will result in a significant generic NLP platform encompassing analysis, generation and acquisition devices. Some specific languages will deserve particular attention, like Sanskrit, French Sign Language, French.
Since the early days of computer science, natural language is both one of its favorite applicative field and the source of technical inspiration, as exemplified by the relation between formal language theory and linguistics.
Nowadays, the motivation is the need to handle lots of digitalized textual and even spoken information, in particular on the Internet, but also interesting mathematical and computational questions raised by computational linguistics, which can lead to other applications.
Most common natural language tools are information retrieval systems, spell checkers, and in a lesser proportion, natural language generation, automatic summary, computer aided translation.
Statistical methods and corpus linguistics have been quite successful for the last years, but there is a renewal of symbolic methods, and especially of logical ones, because of the advances in logic, the improvement of computer abilities for these rather slow algorithms, and overall the need for systems which handle the meaning of phrases, sentences, or discourses.
For all these applications, like queries in natural language, refined information retrieval, natural language generation, or computer aided translation, we need to relate the syntax of an utterance to its meaning. This relation, known as the syntax/semantics interface and its automatization, is the center of this project. This notion is in general used for sentences, but we also work on the extension of this correspondence to discourse and dialogue.
The study of the interface between syntax and semantics makes way for interesting questions of a different nature:
As said above, this enables applications that require access and computation of meaning.
Up to now semantics only plays a minor role in Natural Language Processing although a linguistic viewpoint, the two sides of the linguistic signs its signifiantand signifiéare a central subject ever since Saussure. The linking of the observable part of the sign or of the sentence and its meaning, is a constant question in linguistics both in Chomsky's Generative Grammar or in the Meaning-Text theory of Mel'cuk. ,
From a mathematical and algorithmic viewpoint, this interface is the place of some challenges: what is the link between two of the main frameworks, namely generative grammars and categorial grammars? The first ones are exemplified by Tree Adjoining Grammars TAGs or Minimalist Grammars . They enjoy efficient parsing algorithms and a broad covering of syntactic constructs. The second ones (see e.g. ) are less efficient but provide more acurate analyses. Indeed these latter systems are used for syntax as well as for logical or predicative semantics like Montague semantics , and thus allows generation algorithms. Other models, like dependency grammars, provide a different account of the syntax/semantics interface. A comparison between the dependency model and a generative/logical one enables an assessment of the adequation of these families of models, and this is one of the main challenges of contemporary formal linguistics.
At one end of our spectrum stands morphology, and as often in generative grammar, we consider it as part of syntax. It should be nevertheless observed that the computational models involved in the processing of morphology are of different aspects : finite state automata, regular tranducers, etc. ,
At the other end, on the semantical side, we do not consider ontological aspects of semantics, or lexical semantics, but rather extend the logical semantics to discourse and dialog. This is usually done by Discourse Representation Theory , which is topdown, incremental and involves state changes.
Computational models for phonology and morphology are a traditional application of finite state technology. , , , These models often combine symbolic or logical systems, like rewriting systems, and statistical methods like probabilistic automata which can be learnt from corpus by Hidden Markov Models.
Morphology is described by means of regular transducers and regular relations, and lexical data bases, as well as tables of phonological and morphological rules are compiled or interpreted by algebraic operations on automata.
The existing techniques for compiling such machinery are rather confidential, while any naive approach leads to a combinatorial explosion. When transformation rules are local, it is possible to compile them into an invertible transducer directly obtained from the tree which encodes the lexicon.
A generic notion of sharing allows to have compact representation of such automata. Gérard Huet has implemented a toolkit based on this technique, which allows a very efficient automatical segmentation of a continuous phonologic text.
This study of the linear structure of language and of word structures is by itself sufficient for applications like orthographic correctors and text mining. Furthermore, this preprocessing is required for the analysis of other layers of natural language like syntax, semantics, pragmatics, etc.
While linear structure is in general sufficient for morphological structure, trees are needed to depict phrasal structure, and, in particular, sentence structure. Different families of syntactic models are studied in Signes: rewriting systems of the Chomsky hierarchy, including tree grammars, and deductive systems, i.e. categorial grammars.
The former grammars, rewrite systems, have excellent computational properties and quite a good descriptive adequacy. Relevant classes of grammars for natural language syntax, the so-called mildly context sensitive languages, are just a bit beyond context-free languages, and they hare parsable in polynomial time as well. Among theses classes of grammars let us mention Tree Adjoining Grammars, , , Minimalist Grammars. , , — Dependency Grammars share some properties with them but the general paradigm is quite different , .
Edward Stabler introduced Minimalist Grammars (MGs) as a formalization of the most recent model of the Chomskian or generative tradition and they are quite appealing to us. They offer a uniform model for the syntax of all human languages.
There are two universal, language independent, rules, called mergeand move: they respectively manage combination of phrases and movement of phrases (or of smaller units, like heads).
Next, a language is defined by a (language dependent) lexicon which provides words with features describing their syntactic behavior: some features trigger mergeand some others move. Indeed, features have positive and negative variants which must cancel each other during the derivation (this is rather close to resource logics and categorial grammars).
Consequently they are able to describe numerous syntactic constructs, providing the analyzed sentences with a fine grained and complete syntactic structure. The richer the syntactic structure is, the easier it is to compute a semantic representation of the sentence.
They also cover phenomena which go beyond syntax, namely they include morphology via flexional categories, and they also incorporate some semantic phenomena like relations between pronouns and their possible antecedents, quantifiers, etc.
A drawback of rewrite systems, including minimalist grammars, is that they do not allow for learning algorithms which could automatically construct or enlarge grammars from structured corpuses. But their main drawback comes from the absence of structure on terminals, which gives no hint about the predicative structure of the sentence.
Indeed, a strong reason for using categorial grammars, despite their poor computational properties, and poor linguistic coverage, is that they provide a correspondence bewteeen syntactic analyses and semantic representations. This is to be explained in the next section on the syntax/semantics interface.
In order to improve the computational properties of categorial grammars, and to extend their scope, one can try to connect them to more efficient and wider formalisms, like minimalist grammars. , ,
Why does there exists a simple and computable correspondence between syntax and semantics in categorial grammars? This is mainly due to the internal functional structure of non-terminals in categorial grammars, which yields a correspondence with semantic formulae and functions. This correspondence between syntactic and semantic categories extend to terms, or analyses because the usual logic in use for typed lambda-calculus is an extension of the resource logic used for syntactic deductions or analyses. ,
Nevertheless this computational correspondence between syntax and semantics provided by categorial grammars is very limited. Firstly, for the correspondence between syntactic and semantic types to hold, we have to provide words with syntactic types which are ad hoc, and even wrong. For instance, why should the type of a determiner depend of the constituent it is involved with? Secondly, the truth-conditional aspect of Montague semantics can be discussed both from a theoretical and from a practical viewpoint. According to cognitive sciences, and even to common sense, it is unlikely that human beings develop all possible interpretations when they process and understand a sentence, and in practice such a construction of all models is definitely untractable. Thirdly, a strict compositional principle does not hold, as the famous Geach examples shows.
In this project we address the first issue, which is a real limit, and the third one, in the next section on discourse. The first point is one of the motivations for studying the syntax/semantics interface for minimalist grammars. Indeed, they are rather close to categorial grammars and resource logic, and using this similarity we are able to extend the correspondence to a much richer grammatical formalism, without having strange syntactic types. ,
The generative lexicon is a way to represent the internal structure of the meaning of words and morphemes. Hence it is relevant not to say mandatory for computing the semantic counterpart of morphological operations. The information which depicts the sense of a word or morpheme is organized in three layers: the argument structure (related to logical semantics and syntax), the event structure, and the qualia structure.
The argument structure provides types (in the type-theoretical sense) to the arguments encoded in the qualia structure no matter whether they are syntactically mandatory or optional. The event structure follows . It unfolds an event into several ordered sub-events with a mark on the most salient sub-event. Events are typed according to the typology of Vendler: state, process, transition, this later type including achievement and accomplishment. The qualia structure relates the argument structure and the event structure in roles: formal, constitutive, telic, agentive.
This information and its organization into the generative lexicons allows an explanation of, for instance, polysemy and of compositionality (in particular in compound words). This kind of model which relates knowledge representation to linguistic organization is especially useful for word sense disambiguation during (automatic) syntactic and semantic analysis.
Montague semantics has some limits. Two of them which, technically speaking, concern the context, can be overcome by using DRT, that is Discourse Representation Theory and its variants. , Firstly, if one wants to construct the semantics of a piece of text, one has to take into account sequences of sentences, either discourse or dialogue, and to handle the context which is incrementally defined by the text. Secondly, some constructs do not obey the strict compositionally of Montage semantics, since pronouns can refer to bound variables. For instance a pronoun of the main clause can be bound in a conditional sub-clause.
For these reasons, Discourse Representation Theory was introduced. This model defines an incremental view of the construction of discourse semantics. As opposed to Montague semantics, this construction is top-down, and proceeds more like state change than like functional application — although lambda-DRT present DRT in a Montague style, see e.g. .
The team has developed competences in logic, lambda-calculus. These models are commonly used in computational linguistics :
An example is categorial grammars, with their parsing-as-deduction paradigm, which use proofs in Lambek calculus or linear logic as syntactic trees.
Another example is Montague semantics which uses the Church description of higher-order logic, implemented in lambda calculus in order to have the compositionality principle of Frege.
Finally, Discourse Representation Theory also is logic, in a different syntax, and can be combined with Montague semantics to obtain lambda-DRT.
Consequently it is quite natural to develop tools in programming languages relying on logic and type theory:
The Grail syntactic and semantic parser for Multi Modal Categorial grammars, defined and implemented by Richard Moot, is written in Prolog. This is the most developed and efficient software for categorial grammars, relying on recent development in linear logic, in particular proof nets.
Under the supervision of Yannick Le Nir and Christian Retoré, a team of students implemented in OCaML the first steps of a platform for parsing and learning categorial grammars and related formalisms.
Gérard Huet developped a toolkit for morphology, the Zen toolkit, using finite state technology, in OCaML. He obtained excellent performances, thus proving the relevance of purefunctional programming for computational linguistics.
Sanskrit literature is extremely rich, and is part of the world cultural patrimony. Nowadays, Internet can provide to both specialists and inquiring minds an access to it.
This kind of resource already exists for ancient Greek and Latin literature. For instance, Perseus ( http://www.perseus.tufts.edu) provides an online access to texts. A simple click on each word analyses it, and brings back the lexical item of the dictionary, possible meanings, statistics on its use, etc.
The work described in the following sections enables such computational tools for Sanskrit, some of which are already developed and made available on a web site ( http://sanskrit.inria.fr). These tools efficiently and accurately assist the annotation of Sanskrit texts. Besides, a tree bank of Sanskrit examples also is under construction. When the literature is annotated, this work will ultimately lead to a Sanskrit analogous of Perseus.
After a mundial prohibition decided in 1880 (and which lasted untill the sixties in the USA and untill the eighties in France) Sign Languages, deaf people can use sign language and rather recently ithese languages are the object of new studies and development: a first aspect is social acknowledgment of sign language and of the deaf community, a second aspect is linguistic study of this language with a different modality (visual and gestural as opposed to auditive and phonemic) and the third and most recent aspect which relies on the second, is the need for sign language processing. A first goal is computer aided learning of Sign Language for hearing people and even deaf people without access to sign language. A more challenging objectives would be computer aided translation from or to sign language, or direct communication in sign language.
Given the rarity of linguistic study on the syntax and semantics of sign languages — some exceptions concerning American Sign Language are , , — before to be able to apply our methodology, our first task is to determine what the structure of the sentence is, using our personal competence as well as our relationship with the deaf community.
We intend to define methods and tools for generation of sign language sentences. It should be noted that there is a sequence of different representations of a sentence in Sign Language, from a grammatical description with agreement features and word/sign order that we are familiar with, to a notation system like Signwriting or to a language for the synthesis of 3D images and movies. Our competences on the interface between syntax and semantics are well designed for a work in generation of the grammatical representations.
A first application would be a software for teaching Sign Language, like the CD ROM Les Signes de Manoby IBM and IVT. Indeed, presently, only dictionnaries are available on computers, or examples of sign language videos, but no interactive software. Our generation tools, once developed, could be useful to educative purposes.
This software has been devopped by Gérard Huet for many years, initally in the project-team Cristaland it is clearly the most significant software presented in Signes.
It is a generic toolkit extracted by Gérard Huet from his Sanskrit modeling platform allowing the construction of lexicons, the computation of morphological derivatives and flexed forms, and the segmentation analysis of phonetic streams modulo euphony. This little library of finite state automata and transducers, called Zen for its simplicity, was implemented in an applicative kernel of Objective Caml, called Pidgin ML. A literate programmingstyle of documentation, using the program annotation tool Ocamlweb of Jean-Christophe Filliâtre, is available for Ocaml. The Zen toolkit is distributed as free software (under the GPL licence) in the Objective Caml Hump site. This development forms a significant symbolic manipulation software package within pure functional programming, which shows the faisability of developing in the Ocaml system symbolic applications having good time and space performance, within a purely applicative methodology.
A number of uses of this platform outside of the Cristal team are under way. For instance, a lexicon of french flexed forms has been implemented by Nicolas Barth and Sylvain Pogodalla, in the Calligramme project-team at Loria. It is also used by Talana (University of Paris 7).
The algorithmic principles of the Zen library, based on the linear contexts datastructure (`zippers') and on the sharing functor (associative memory server), were presented as an invited lecture at the symposium Practical Aspects of Declarative Languages (PADL), New Orleans, Jan. 2003 . An extended version was written as a chapter of the book ``Thirty Five Years of Automating Mathematics'', edited in honor of N. de Bruijn .
Gérard Huet's Sanskrit Site ( http://sanskrit.inria.fr) provides a unique range of interactive resources concerning Sanskrit philology . These resources are built upon, among other ingredients, the Zen Toolkit (see above). The site registers thousands of visitors monthly.
The declension enginegives the declension tables for Sanskrit substantives.
The conjugation engineconjugates verbs for the various tenses and modes.
The lemmatizertags inflected words.
A dictionarylists inflected forms of Sanskrit words. Full lists of inflected forms, in XML format (given with a specific DTD), are released as free linguistic resources available for research purposes. This database, developed in collaboration with Pr. Peter Scharf, from the Classics Department at Brown University, has been used for research experiments by the team of Pr. Stuart Shieber, at Harvard University.
The Sanskrit Readersegments simple sentences, where the (optional) finite verb form occurs in final position. This reader enhances the hand-tagged Sanskrit reader developed by Peter Scharf, that allows students to read simple texts differently: firstly in davanagari writing, then word-to-word, then in a word-to-word translation, then in a sentence-to-sentence translation.
The Sanskrit Parsereliminates many irrelevant pseudo-solutions (segmentations) listed by the Sanskrit reader.
The Sanskrit Taggeris an assistant for the tagging of a Sanskrit corpus. Given a sentence, the user chooses among different possible interpretations listed by the morpho-syntactic tools and may save the corresponding unambiguously tagged sentence on disk. The process is as follows. The user on his client machine types in a sentence, calls remotely the parser, inspects the small number of surviving taggings, then may inspect each one in order to peruse the semantic analysis, presented as a pseudo-English paraphrase. Some non-determinism may remain — typically, a given segment may be lemmatized in several ways, either by homonymy, or by morphological ambiguity. Each path in the semantic dependency matrix is shown with its bonus-malus, and the user may select the one he prefers, yielding a completely disambiguated analysis which he may then store on his client machine, as an hypertext document indexing in the Sanskrit Heritage Dictionary (our structured lexical database). This service has no equivalent worldwide.
Another on-going project is the construction of a tree bank of Sanskrit examples, in collaboration with Pr. Brendan Gillon, from McGill University in Montreal.
Within the type-logical grammar paradigm, Multi-Modal Categorial Grammars (MMCG, see e.g. ) are one of the richest approach. Richard Moot carefully implemented Grail, an analyzer for MMCG that is the most complete system for natural language analysis based on type logical grammars with lexicon/grammars. Several languages are supported (although with different levels of linguistics coverage): dutch, english, french, italian, hindi. Grail is distributed under Gnu LGPL .
The Grail parser/theorem prover for categorial grammars, originally developed at the University of Utrecht, has been rewritten from scratch, taking into account modern insights about proof nets as well as requiring only open-source software to run. This new release also includes computational theoretical improvement in accordance with : parallel use of structural postulates (which introduce flexibility for word order, tree structure etc.) and degree of preference in order to improve the complexity of the analysis due to the exponential number of choices. The parser has also been adapted to allow for a tight integration with the supertagger . Also, several new strategies for reducing the search space have been implemented, significantly improving parsing performance.
DepLin takes a syntactic dependency tree as the input. The topological grammar translates such an (unordered) tree to an ordered constituent tree, called topological tree. In the following step, this tree is simplified to a three level prosodic constituent tree (prosodic words, prosodic phrases, prosodic sentences). From this tree, a very simple sound output device can concatenate prerecorded sound files corresponding to the different prosodic words (with their prosodic markup). This allows for auditory tests of the resulting sentences in constructed communicative contexts (question-answer sets). The construction of the prerecorded files is quite time consuming; it has been tested on small vocabulary of Modern Greek.
DepLin was developed by Kim Gerdes. It is distributed as free software (GPL) and, apart from our internal usage at the Signes group (in particular for German and Greek), is mainly used at the University of Paris 7 for the development of different grammars (in particular Arabic and French).
An editor for corpora with functional dependency annotation was developed by Kim Gerdes in collaboration with the ERSS, Toulouse. This ``corpus arborator'' is distributed under the GPL and used in Bordeaux and ERSS Toulouse.
LeFFF (Lexique des Formes Fléchies du Français) offers, under the LGPL For Linguistic Resources, a wide-coverage lexicon of inflected forms for French, which associates to each form its lemma and its morphological features (other features are under construction). It has been developed by Lionel Clément, Benoît Sagot and Bernard Lang. Its available at http://www.lefff.net/. This resource co-developed by Lionel Clément (before he joined the SIGNES group).
XLFG is a parser prototype for research. It implements the Lexical Functional Grammar (LFG) formalism. It used for teaching in various universities. It is distributed as free software ( http://dept-info.labri.fr/~clement/xlfg/). It has been developed by Lionel Clément (before he joined the SIGNES group).
Lexed is a lexicaliser. It allows to search a dictionary entry from a string. The finite automata-based algorithm is particularly fast, and offers a good alternative to hashes for large dictionnaries. Lexed is distributed for unix platforms with a GPL Licence. This software has been developed by Lionel Clément (before he joined the SIGNES group).
Yab is a compiler compiler similar to YACC. With Yab it is possible to deal with ambiguities and share semantic constructions beetween different analyses. Yab is distributed with a GPL licence. This software has been developed by Lionel Clément (before he joined the SIGNES group).
This is a software allowing to segment a text in tokens. Ambiguity between simple and compound words is represented through a direct acyclic graph (DAG). This software has been developed by Lionel Clément (before he joined the SIGNES group) and is part of Lexed (see above).
Maxime Amblard developed a tree-drawing package in ML. This package is included as a contribution in the open-source parser for Minimalist Grammars developed and distributed by John Hale ( http://www.linguistics.ucla.edu/people/stabler/hale/index.html).
This software, CGToolsis an academic prototype. It is the combination of two Travaux d'Etude et de Rechercheof 4 t hyear students: Véronique Moriceau et Jérôme Pasquier (Université de Nantes, 2002) which has been reorganized and extended by Thomas Poussevin, Jean-François Deverge, Fahd Haiti, Anthony Herbé (Université Bordeaux 1, 2003). It is written in OCaML, with an interface written in Tcl/Tk and the input and output format are XML files (DAGs for representing analyses, proofs and trees).
Presently, the following algorithms are implemented:
learning of categorial grammars from structured sentences;
inter-translation in any possible direction between AB categorial grammars, Lambek grammars, context-free grammars in Greibach normal form, and context-free grammars in Chomsky normal form;
parsing of categorial grammars by proof search;
parsing of context-free grammars with the Cocke-Kasami-Younger algorithm.
Gérard Huet continued his work on developing a computational linguistics platform adapted to Sanskrit, based on applicative programming in Ocaml.
The main effort in 2005 concerned curbing the overgeneration of the segmenter by a semantic analysis. Each segmentation solution, represented as a list of morphological items (inflected words tagged with their lemmatization as a root entry together with a morphological generator carring its various features), is translated into a sequence of semantic role scripts. Verbal forms become sites of actions/situations, expecting complements as role assignments. These roles depend on the regime of the verb, given its voice. For instance, a transitive verb in the active voice demands a subject in the nominative and an object in the accusative for its role saturation. Dually, nominal phrases provide the corresponding roles. Matching opposite polarities gives rise to a constraint satisfaction problem over the role features. This corresponds, in Western linguistics, to the construction of the dependency structurein the sense of Tesnière, as computed in computational systems based on dependency grammars (and having their analogues in feature logical programming platforms such as HPSG or LFG). In the terminology of Indian linguists like Pāṇini, we do the analysis of kaarakas. The constraint satisfaction problem is similar to proper typing of categorial grammar parse trees, or to the construction of a proof net in commutative linear logic. However, non-linear phenomena are frequent. For instance, agreement of an adjective and its qualifying noun is a kind of contraction.
The constraint satisfaction engine proceeds as a sequence of stream processors applied to the tagged sentence stream, going from right to left. Tool words are treated as postfix stream combinators - they are allowed to compute only in the past of their utterance. From this work arises the notion of a linguistic toolas a feature structure stream transducer. Pronouns are linguistic tools in this sense, since their purpose is to link to their anaphoric antecedent. In Sanskrit, a case study for coordination led to the implementation of the catool. This postfix conjunction has the effect of merging antecedent noun phrases with three semantic upper bound operations, respectively for gender, number, and person. For instance, it has the effect of transducing the sequence of tagged items for ``two girls and one boy'' into one tag for ``several male persons'', paving the way to the proper recognition of this compound item as a proper subject to a verb conjugated in the plural. This iteration of stream combinators computes a compound bonus-malus score.
The constraint engine, still under design, demonstrates a remarkable filtering capacity. Very often, sentences with several hundred potential phonemic segmentations are processed successfully, in the sense that most segmentation candidates are rejected as dubious, their bonus-malus score being below some threshold, while the intended meaning is retained. Rejection scores of 98% are frequent. This is rather encouraging, and it is expected that by December 2005 the prototype system will be released as a Sanskrit corpus tagging assistant. This application is entirely distributed as a Web service. The user on his client machine types in a sentence, calls remotely the parser, inspects the small number of surviving taggings, then may inspect each one in order to peruse the semantic analysis, presented as a pseudo-English paraphrase. Some non-determinism may remain - typically, a given segment may be lemmatized in several ways, either by homonymy, or by morphological ambiguity. Each path in the semantic dependency matrix is shown with its bonus-malus, and the user may select the one he prefers, yielding a completely disambiguated analysis which he may then store on his client machine, as an hypertext document indexing in the Sanskrit Heritage Dictionary (our structured lexical database). This service has no equivalent worldwide.
This Sanskrit platform was presented at the ATALA workshop on "Traitement automatique des langues anciennes", on May 21st in Paris. It was the topic of an invited lecture at the 5th International Conference on Logical Aspects of Computational Linguistics (LACL 2005) in Bordeaux on April 28th.
This work builds on the Zen toolkit for lexical processing designed by the author, and distributed as a free software Ocaml library. It investigates a notion of mixed automaton or aum, first presented in 2003 in his Automata Mista article for the Manna Festschrift. This work is being pursued as a general model for the modular construction of finite state machines, possibly non-deterministic, and possibly transducing their input on an output tape, in a purely applicative inductive data type whose operations model constructions of regular relations.
This year a new generic layer was abstracted for compiling control for the reactive engine, implementing an original notion of modular transducer. The user provides a system of regular expression over phases, as well as specific aum recognizers for each phase. A meta-programming tool, implementing the Berry-Sethi algorithm for regular expression compiling, yields a sequential dispatchertailored to the specific application, as a stand-alone ML module, linked as a plug-in to the generic Zen toolkit. This was the topic of the summer internship of Benoît Razet for his 2nd year Master project at University Paris 6 . This work lead to the release of version 2 of the Zen computational linguistics toolkit, as a free software Pidgin ML library. A joint article on the design of modular tranducers has been submitted for publication .
Kim Gerdes, Sylvain Kahane (University of Paris 10) and Hi-Yon Yoo (University of Paris 7) discussed the implications of the replacement of the classical ``morphologica'' structure of the Meaning-Text Framework with the topological constituent tree . They showed that two types of topological structures are frequently found: rather descriptive structures with multiple embeddings, and flattened out structures that form the templates that are actually used in the language production. The variety of flattened out structures can then be explained as a combination of different embeddings of simpler structures. The theoretical question remaining for the integration of these structures in the (linear) Meaning-Text Model is which of these structures actually appears as the intermediate representation between syntax and phonology.
Calling German a ``V2'' language is a simplification. In many cases, it is possible to place two constituents before the finite verb. The reasons to do so seem to depend on the semantic and the communicative structure of the sentence, and very little on the syntactic functions of the elements. Kim Gerdes showed that this apparent contradiction with the Meaning-Text modularity separating semantics from topology can be resolved by exploring the power of the communicative markup on the syntactic dependency tree .
On the basis of the Spoken Dutch Corpus (CGN, a database containing syntactic annotations for a million of words in contemporary spoken Dutch), Richard Moot experimented with several strategies for automatically extracting, at different levels of detail, a type-logical treebank representing a lexicon for categorial grammars .
The size of the extracted lexicons, with an average of around 50 different formulas possible for each word in a sentence, poses a considerable challenge for parsing using the extracted grammars. By adapting methods used for Part-of-Speech tagging (notably maximum-entropy models, which currently outperform other models) to these much richer lexical items, an approach called supertagging, it is possible to find the most likely sequences of lexical lookups for a sentence. Depending on the level of detail maintained in the lexicon, the number of different formulas varies between 1000 and 7000, whereas the correctness of supertag disambiguation varies between 72 and 80%, which is comparable to results obtained TAGs using the (presumably cleaner) Penn Treebank.
The types in a categorial grammar form a hierarchy (using only the derivability relation between them). This hierarchy can be exploited to treat different linguistic phenomena such as French object clitics, even with clitic climbing and to correctly compute semantic representations in Montague style no matter whether control phenomena occur.
Henri Portine showed that the problem of relative clauses in daily French use is often blurred by the blending of two problems, namely that of the existence of an object in a corpus and the way it is possibly recoverable according to the properties of the corpus on the one hand, and that of the duality between relative clauses and complement clauses on the other hand . He also showed that the analysis of relative clauses in daily French use, identified as relative clauses which in fact would be complement clauses, is based on a conception of syntax as pure machinery. He proposed an analysis of this type of relative clauses, which opens up on a notional conception of the antecedents.
Houda Anoun is extending her implementation of categorial grammars in Coq to categorial minimalist grammars. This interactive proof search (i.e. parsing) enables to test and explore the properties of several variants of these mixed grammars proposed by Lecomte, Retoré, Vermaat.
Maxime Amblard proved the existence of a Minimalist Grammar which generate the counting dependencies languages
. He also presented an algorithm for the construction of the lexicon Lex
mproducing these languages
. This class of languages, which models sentences such as ``Peter, Mary and Charles had respectively 14, 12 and 6 in
math, history and sport'', belongs to the context-sensitive languages in the hierarchy of Chomsky. This result is a generalization to any
of the Stabler presentation with
n= 5
. It also generalizes the similar results of
by providing a simpler grammar and handling such nested counter languages.
On the other hand our team also provides a criticism of grammars with movement. Many concepts like ``movement'', ``scrambling'', ``gapping'', ``right node raising'', etc., have their origin in the choice of constituent structures for the representation of syntax. Kim Gerdes explored the historical work on the development of X-bar phrase structures as the central syntactic representation. He showed how this choice came into being and how it persisted against all successful implementations of simple alternatives .
There are many ambiguities with quantifier scopes in natural languages. The different possible readings of a sentence can be expressed with CLLS (Constraint Language for Lambda Structures), that modelises underspecified lambda-terms. Given a syntactic analysis with Minimalist Categorial Grammars, Amblard described how to extract relevant semantic representations with CLLS , discarding spurious cases.
Roberto Bonato has defined an incremental algorithm for computing the binding relationship bewteen words and especially when the bound term is a pronoun (possible or impossible coreference with its antecedent). Up to now there was no incremental computing of this relation, which was defined as a set of constraints on a complete analysis. He now also explores alternative interpretations of the traditional Principles of Binding Theory, with special attention devoted to Reinhart's 1983 work Anaphora and Semantic Interpretationand Reinhart and Reuland's 1993 Reflexivity. He integrated such different approaches into a unified computational framework that looks very promising in deriving from general computational principles some of the major stipulations of these approaches stemmed from the last 30 years of linguistic and formal semantics tradition .
Christian Bassac showed with Pierrette Bouillon in that the availability of various types of anaphoric reference (via a definite determiner NP, a possessive determiner NP or a demonstrative determiner) to the modifier in N1 modN2 headcompounds is predictable according to the type of the relationship R that holds between N1 and N2 and the role it is encoded in. The fact that no misalignment could be found in the data of the three languages considered (English, French and Turkish) tends to show that the predictions made are articulated on deep-rooted aspect of the semantics of compound (they are so to speak qualia-driven) and can probably be generalized to other languages.
Christian Bassac defended a strong conception of compositionality for English root compounds and showed that analyses that have prevailed so far such as Downing's — these analyses plead for a completely unconstrained and unpredictable meaning of root compounds — are both over pessimistic and linguistically poorly motivated .
Christian Bassac analysed with Mehmet Ciçek the morphology of Turkish verbal and nominal predication to show that they are not opposed but both integrate a copula, which is sometimes manifested only by second articulation phenomena such as word stress . The results of this contribution challenge the claims of Pollock's theory of functional heads and plead for lexical rules to build the highly complex verbal forms of Turkish.
Henri Portine shed light on the discrepancy between the couple polysemy/homonymy considered from a diachronic point of view and the same couple as a cognitive fact . He showed that polysemy is a chain of relations, and showed too that cognitive homonymy is based on the breaking of this chain, which is evidence of its radical difference from diachronic homonymy, which is the naming of the absence of a relation. From a cognitive point of view, the couple polysemy/homonymy is relevant in lexical semantics.
Most works on the Generative Lexicon (GL) are informal, leading to results that are more descriptive than apt to automation. A working group in SIGNES (C. Bassac, P. Henry, R. Marlet, C. Rétoré, as well as J. Vanier, an intern from the Ecole Centrale de Paris) has started a foundational effort to formalize GL. The goal is, given a parsed sentence, to construct possible interpretations in the form of logical formulas along the lines of Montague semantics but focusing on lexical information. A master thesis has being written along these lines but no article has been submitted yet.
The entries of GL have been formalized, with attention to variable binding and typing. This also includes role qualification for variables, dotted types, as well as subentries for the "telic" quale ("trigger" and "result" features). The type hierarchy and the set of primitive predicates are not fixed in the formalization: they are considered as parameters, to be defined along with any given lexicon instance.
A general framework for constituant composition has been defined and the main generative mechanisms, such as coercion and co-composition, have been specified as formal algorithms. The composition mechanisms have also been extended to depend on a semantic distance between predicates, enabling combination modulo predicate similitude as well as ranking between different interpretations.
Another facet of this formalization work concerns how to abstract semantic issues that are irrelevant to GL, such as anaphora resolution or quantification originating from determiners, and how to nonetheless recover this information in the final formulas. This abstraction cleans up the syntactic representation of the sentence, only keeping simple word associations, to be used as input to the GL combination mechanisms. This leaves the focus on what GL is good at, i.e., to define how words associate to construct new meanings.
Joan Busquets has been working on vp-Ellipsis and his different semantic-discourse constraints. A comparative analysis between vp-Ellipsis and Stripping in Catalan and English shows that both types of constructions need to be clearly distinguished in Catalan, as it is in English. This evidence will come from the analysis of the so-called information packaging. On the one hand, Stripping constructions are under the control of focus by means of parallel foci. On the other hand, vp-Ellipsis constructions are not constrained by the information packaging, although this notion might help to disambiguate the target in certain cases. These results are found in . A more fine-grained analysis with some anaphoric discourse properties for both constructions will be at issue in the final published version .
The anaphoric properties of the Catalan expression fer-ho(do it) in elliptical contexts has been explored from a semantic and discourse point of view. We describe the set of semantic constraints that the form fer-hoimposes to the complements which it substitutes. Moreover, we provide relevant linguistic examples to analyze this contexts as narrow ellipsis, opposed to vp- ellipsis as wide ellipsis. .
Finally, the interaction among negation, vp-Ellipsis, and presupposition has been considered from a dynamic discourse semantics approach ( Segmented Discourse Representation Theory). By means of a set of constraints related to the Contrastdiscourse relation, we are able to explain the difference between factitive and non-factitive verbs in elliptical contexts when the negation is the unique remnantin the elliptical or target proposition .
Most natural language quantifiers are vague, e.g. in French : ``quelques, peu, un peu, beaucoup, certains''. Moreover they suggest different kinds of inference : ``logical'' consequences and implicatures, in the gricean sense. Using the Logic of Partial Information, Areski Naït-Abdallah (University of Brest) and Alain Lecomte gave a rigorous account of some pragmatic notions formerly studied by the linguist O. Ducrot .
Pierre Guitteny studied the diathesis in LSF and proved the existence of passive or inverse constructions in LSF on the basis of a corpus study .
Pursuing the work of Olivier De Langhe, Pierre Guitteny, Henri Portine and Christian Retoré, Emilie Voisin further experimented with sign order in LSF. She observed that verb flexion, if any, can be influenced by the subject, the object as well as personal transfer. Her analysis showed that under some circumstances, in particular when verbal flexion is influenced by the object, the sign order is SOV rather than OSV .
Gérard Huet ported his Sanskrit processing workbench as an application for the Simputer, a hand-held computing device running Linux developed in India. He visited the PicoPeta corporation in Bangalore, one of the manufacturers of the Simputer, in order to initiate a possible technology transfer towards a pocket Sanskrit machine.
The region Aquitaine is funding (together with INRIA and LABRI-CNRS) a project on sign language processing and a PhD grant on the same topic. Given an accurate video recorder and corresponding software and computer, our team should be able to constitute a very good quality corpus of spontaneous sign language speech as well as guided experiments. Contact: Christian Retoré
Signesis one of the fifteen research team of the Groupe de Recherches 2521 (C.N.R.S.) directed by Francis Corblin (Université Paris IV). This research program is divided into Opérations: Modèles et formats de représentation pour la sémantique, Les Modèles à l'épreuve des données, Sémantique et corpus, Les interfaces de la sémantique linguistique, Sémantique computationnelle. The Signesteam is part of the later two operations, which could be translated as Interfaces of linguistic semanticsand Computational semantics.
Alain Lecomte is supervising a project VALI ( Vers des assistants lecteurs intelligents) in this setting. It is intended to develop tools to help the new researcher to grasp the contents of a research article. To do so, the contents can be organized using linguistic theories like SDRT and logical tools like the proof assistant Coq can be applied to deduce relationship between parts of contents.
The team Signesis an active node of this network and, in particular of the section 6 of this network: computational logic for natural language processing, headed by Michael Moortgat. The contact person is Gérard Huet.
A research program entitled Generative grammar and deductive systems for the processing of natural language syntax and semanticshas been approved for 2004 and renewed for 2005. The other team in this bilateral research program is Computational linguistics and logicdirected by Michael Moortgat at Utrecht Institue of Linguistics. The dutch contact is Willemijn Vermaat, and the french one is Christian Retoré.
Enric Vallduvi, Joan Busquets, Pascal Amsili, Etude comparative des connecteurs et des marqueurs discursifs dans le cadre d'une sémantique dynamique du discours.
Gérard Huet is member of the Académie des sciencessince November 2002.
Gérard Huet was invited to become member of the International Advisory Board of NII (National Institute of Informatics) in Tokyo, Japan. He participated to the first meeting of this board on June 2nd, and was subsequently offered to write a tribune in NII's journal .
Alain Lecomte is on the editorial board of the journal TAL – Traitement Automatique des Langues, Editions Hermès, Paris since august 2001.
Alain Lecomte and Christian Retoré are on the editorial board of the book series Research in Logic and Formal Linguistics, Edizione Bulzoni, Roma, since 1999.
Henri Portine is on the editorial board of the journal ALSIC – Apprentissage des Langues et Systèmes d'Information et de Communication
Christian Retoré is reviewer for Mathematical Reviewssince october 2003.
Christian Retoré is editor in chief of the journal TAL – Traitement Automatique des Langues, Editions Hermès, Paris since April 2004. (in the editorial board since 2001).
Maxime Amblard and Renaud Marlet chaired the LACL Student Session committee, 2005.
Christian Bassac was on the program committee of International Morphology Conference, Toulouse, December 2005.
Christian Bassac was on the program committee of the 3rd international workshop on Generative Approaches to the Lexicon , Geneve, Mars 2005.
Christian Bassac was on the program committee of the student session of Logical Aspects of Computational Linguistics 2005 (Bordeaux)
Joan Busquets was on the program committee of the Symposium sur l'étude du Sens : Exploration et Modélisation 2005 (Biarritz)
Joan Busquets, Richard Moot and Christian Retoré were on the committee of Logical Aspects of Computational Linguistics 2005 (Bordeaux)
Richard Moot was on the reading committee of TALN 2006.
Christian Retoré was on the program committee of ESSLLI 2005 (Edinburgh).
Christian Retoré is on the reading committee of Human Language Technology / Empirical Methods in NLP 2005 (Vancouver)
Christian Retoré is on the program committee of Traitement Automatique du Langage Naturel 2006 (Leuven)
Christian Retoré is on the reading committee of Human Language Technology / North American Chapter of the ACL 2006 (New-York)
Christian Bassac is a member of the hiring committee in linguistics of Université Bordeaux 3.
Joan Busquets is a member of the hiring committees in linguistics of Université Toulouse 2 and Université Bordeaux 3.
Gérard Huet is a nominated scientific personnality of the board of governors of the Université Paris 7.
Renaud Marlet was a member of the hiring committee for junior research scientist at INRIA Futurs.
Henri Portine is a member of the hiring committees in linguistics of Université Paris 3 and Université Bordeaux 3.
Henri Portine is an elected member of the board of governors of the Université Bordeaux 3 and of Institut Universitaire de Formation des Maîtres d'Aquitaine.
Henri Portine is the head of the linguistic and literature faculty of Université Bordeaux 3.
Henri Portine is the head of the research team Text, Language, CognitionJE2385.
Christian Retoré is a member of the hiring committee in computer-science of Université Bordeaux 1.
Christian Retoré is a member of the committee of the faculty of mathematics and computer science of the Université Bordeaux 1.
Christian Bassac organised the Journées de Linguistique Anglaise, 26-27 October 2005.
Joan Busquets, Richard Moot, Christian Retoré organized the 5th international conference on Logical Aspects of Computational Linguistics, 28-30 April 2005.
Kim Gerdes, Maxime Amblard organized the weekly seminar Linguistique et informatiqueUniversités Bordeaux 1 et 3.
Since all its members are university staff, Signesis intensively implied in teaching, both in the computer science cursus (University Bordeaux 1) and in the linguistic cursus (University of Bordeaux 3). Let us cite the lectures whose topic is computational linguistics:
Natural language processing, Bordeaux 1, PhD students in computer science (Christian Retoré)
Structures Informatiques et Logiques pour la Modélisation Linguistique, Parisian Master of Research in Informatics (MPRI). (Gérard Huet, Philippe de Groote)
Symbolic natural language processing, Bordeaux 1, 5 t hyear in computer science (Christian Retoré)
Utterance acts and semantics, Bordeaux 3, 5 t hyear in linguistics (Henri Portine)
The syntax of Wh-clauses and extraction, Bordeaux 3, 5 t hyear in linguistics (Christian Bassac)
Finite state natural language processing, Bordeaux 1, 4 t hyear in computer science (Christian Retoré)
The principle of charity: Quine and Davidson, Bordeaux 3, 4 t hyear in linguistics (Joan Busquets)
Pragmatics, Bordeaux 3, 4 t hyear in linguistics (Joan Busquets)
Word order and its formalization, Bordeaux 3, 4 t hyear in linguistics (Kim Gerdes)
Linguistic formalisms, Bordeaux 3, 4 t hyear in linguistics (Lionel Clément, Kim Gerdes, Renaud Marlet)
Christian Retoré is reviewing the habilitation of Isabelle Tellier ( Modéliser l'acquisition de la syntaxe du langage naturel via l'hypothèse de la primauté du sensUniversité de Lille 1, 8- 12-05).
Gérard Huet supervised the master thesis of Benoît Razet: Automates modulaires, Université Paris 7, 2005.
Christian Retoré and Kim Gerdes supervised the master thesis of Nicolas Letteron: Construction automatique d'un dictionnaire à partir de corpus, Université Bordeaux 1.
Chrisitan Bassac, Renaud Marlet and Christian Retoré supervised the master thesis of Jules Vanier: Vers un modèle de représentation des connaissances dédié à l'analyse sémantique de la phrase, University of Paris 7 (and Ecole Centrale de Paris).
Alain Lecomte is supervising the thesis work of Tran Vu Truc Logique d'informations partielles pour le traitement des implicites. (Université Grenoble II)
Alain Lecomte and Christian Retoré are co-supervising the thesis work of Maxime Amblard, Calcul de représentations sémantiques dans les grammaires minimalistes. (Université Bordeaux 1)
Henri Portine and Renaud Marlet are supervising the thesis work of Emilie Voisin, Génération automatique d'énoncés en Langue des Signes Française. (Université Bordeaux 3)
Henri Portine is supervising the thesis work of Pierre Guitteny, Le passif en Langue des Signes Française. (Université Bordeaux 3)
Christian Retoré and Alexandre Dikovsky (Université de Nantes) are co-supervising the thesis work of Erwan Moreau, Acquisition de grammaires catégorielles et de grammaires de dépendances. (Université de Nantes)
Christian Retoré and Denis Delfitto (Università di Verona) are co-supervising the thesis work of Roberto Bonato, Algorithmes de calcul de représentations sémantiques à partir d'analyses de type générativiste et algorithmes inverses. (cotutored PhD Université Bordeaux 1 / Università di Verona)
Lionel Clément (INRIA-Rocquencourt) visited Signes in January 2005. (seminar)
Jan van Eijck (Amsterdam) visited Signes in March 2005 (van Gogh PAI) Earley algorithm for parsing indexed grammars. Definable generalised quantifiers.
Willemijn Vermaat, Matteo Capeletti (OTS, Utrecht) visited Signes in May 2005 ( van Gogh PAI)
Cristiano Chiesi (Sienna & MIT) visited Signes in May 2005 (incremental parsing of minimalist grammars)
Jean-Marie Pierrel (ATILF, Nancy) visited Signes in May 2005 (seminar)
Marie-Laure Guénot (LPL, Aix) visited Signes in June 2005 (seminar)
Laurence Danlos (TALANA, Paris) visited Signes in june 2005 (seminar)
Jens Michaelis (Tuebingen/Potsdam) and Hans-Martin Gaertner (Berlin) visited Signes in september 2005 for a week (working group)
Emilie Guimer de Neef (France Telecom), Emilie Chetelat and Loic Kervajan (France Telecom & DELIC) visited Signes in November 2005.
Michael Moortgat, Matteo Capeletti (OTS, Utrecht) visited Signes in December 2005 ( van Gogh PAI)
In January, G. Huet participated to the annual TECS Excellence week in Pune, India, as member of the International Advisory Board of TRDDC (Tata Consultancy Services).
On April 13th, G. Huet was invited to give a talk at the International Conference on Rewriting Theory and Applications (RTA'05) in Nara, Japan. He talked on ``Rewriting before RTA''.
On April 26th, G. Huet was invited to deliver the Robin Milner lecture at University of Edinburgh. He talked on ``Design of a Computational Linguistics Platform''.
Christian Bassac and Richard Moot attended ESSLLI 2005 in Edinburgh.
Maxime Amblard, Pierre Guitteny, Christian Retoré, Emilie Voisin attended the conference TALN06, Dourdan, June 2005.