Synthèse de la parole

cordial Man-machine oral and multimodal communication COG Laurent Miclet Faculty member (Professor), Enssat Joëlle Thépault Administrative assistant, Enssat Nelly Barbot Associate Professor, Enssat Olivier Boëffard Professor, Enssat Arnaud Delhay Associate Professor, iut Marc Guyomard Professor, Enssat Jean-Christophe Pettier Associate Professor, Enssat Jacques Siroux Professor, iut Pierre Alain bourse INRIA, from October the 1st, 2003 Sabri Bayoudh bourse INRIA, from October the 1st, 2004 Ali Choumane bourse Région, from October the 1st, 2005 Karl DeVooght FTR&D, from January the 1st, 2005 Josselin Huaulmé CIFRE Télisma, from October the 1st, 2003 Erwan Livolant FTR&D, from February the 1st, 2004 Salma Mouline FTR&D, from January the 1st, 2003 Sylvie Saget bourse Région, from October the 15th, 2003 Overall Objectives Overall Objectives

The Cordial project explores several aspects of multimodal man-machine interfaces, with speech components. Its objectives are both theoretical and practical : on the one hand, no natural dialogue system can be designed without an understanding and a theory of the dialogic activity. On the other hand, the development and the test of real systems allow the evaluation of the models and the constitution of corpora.

The conception of a man-machine interface has to take into account the communication habits of the users, which have been developed within interpersonnal communication. This is particularly true for interfaces using speech, which is a medium quite performant and spontaneous. Users have great difficulties to communicate through an oral dialogue with a machine having a speech interface of mediocre quality. The dialogue phenomena are complex , involving spontaneous speech understanding, strong use of pragmatics in the dialogue process, prosodic effects, etc.

Dialogue modeling

When multimodal dialogue is involved, the interference between speech phenomena and tactile actions or mouse clicks brings up problems of interpreting the coordination of the different actions of the user.

When a user makes a communication action towards the dialogue system, he certainly has an intention; but often, this intention is not explicitely present in the communication. A major problem for the system is to extract it, in order to be able to give a satisfactory answer. This requires a theory coping with the notions of intention, background knowledge, communication between agents, etc. We modelize the dialogue phenomena by using the concepts of speech acts and dialogue acts, and we consider that a sequence of exchanges can be analyzed as the result of a planning. This model gives a satisfactory modeling of many phenomena in real dialogues, such as the coordination between different negotiation phases or the management of the user's knowledge base.

However, several points are not straightforwardly modeled in such a theory: parts of the dialogue do not carry any obvious intention or errors in understanding may mistake the planner, etc. Moreover, the extraction of the dialogue acts from the speech of the user is a complex problem, as is also the restitution of the dialogue acts of the system into synthetic speech.

Machine learning

In addition to the modeling of the core of dialogue phenomena, the Cordial project has also a particular interest in machine learning from corpora at different stages of a dialogue system. It covers the extraction of semantics from the outputs of a speech recognizer. It also tackles the problems of constructing the prosody of the machine synthetic speech or helping the dialogue engine to compute an answer. Machine learning is a field with many different facets, spanning from the inference of finite automata from symbolic sequences data to the optimization of parameters in stochastic processes. Our research in this field makes use of quite different techniques, reflecting the variety of the data and of the models met at the different stages of a dialogue system.

Speech synthesis.

The front-end part of an oral dialogue system consists in a text generator producing a sequence of words, corresponding to the message to be emitted . This part of text is then converted into an oral message through a speech synthesis system. The text to speech technology has still to increase its quality, especially in a dialogue environment, in order to produce a speech as natural as possible. This can be made partly in producing a good prosody, but also in working on the quality of the acoustic signal. In a dialogue system, the speech synthesizer can be given extra information on the semantics and the pragmatics of the situation and therefore produce a speech with special effects: delivering information, stressing on a detail, insisting on a misunderstanding, repeating an information, etc. This can influence the way the message has to be delivered, especially concerning its prosody. In the same way, a text-to-speech system built from the target application can lead to significative improvements . Lastly, an interesting problem is to diversify the synthetic voices, without having to record and index a new corpus. Acoustic voice transformation techniques can be used, but changing a voice into another requires also a modification of the segmental and prosodic characteristics.

Scientific Foundations Introduction

Our activities are distributed into four complementary domains. The first one is concerned with both the coding and the structure of interaction. It also deals with the applications. The second one deals with multimodality and system prototyping(architecture and evaluation). The third one is concerned with machine learningtechniques and their application to dialogue phenomena and speech technologies. The last area deals with speech synthesisadapted to dialogue.

Dialogue and modeling Speech Acts planning plan recognition

We use a family of dialogue models based on speech acts plans. This modeling takes into account the general framework of communication and makes easier the implementation on computer. But it does not solve some problems like extracting speech acts from utterances or the integration of different information sources and miscommunication between participants.

Man-Machine interaction can be seen as a sequence of particular actions: speech acts , called in our context dialogue actswhich support both the function of the act in the dialogue (for example: requesting, querying, ...) and a propositional content (for example: the theme of the query). These acts can also be characterized by their conditions of use which are concerned with the mental states of the participants (intention, knowledge, belief). The most accurate computerized model is the planning operator , in which preconditions and constraints as well as effects of an act can be represented. For example, the act to ask for somebody to perform one action can be modeled as follows:

Request(Speaker, Hearer, Action(A) )

precondition-intention: Want(Speaker, Request(Speaker, Hearer, Action(A)))

precondition preparatory: Want(Speaker, Action(A))

Body: Mutual Belief(Hearer, Speaker, Want(Speaker, Action(A)))

effect: Want(Hearer, Action(A))

This can be interpreted as: when an agent wants that its listener performs an action A, it can use the action labeled Requestwhose goal is to build up a consensus between participants in order to perform A. Realizing this consensus is the task of another action which is not described here. The set of actions which are necessary for reaching a goal is named a plan. This approach makes the hypothesis that each dialogue partner participates in the realization of the other's plan. This dialogue act modeling allows to consider several types of automatic reasoning in order to manage the dialogue. The first one is concerned with the contextual understanding of user's utterances by means of a mechanism so-called plan recognition. It aims at rebuilding a part of the other participant's plan; if this part is correctly identified, it allows to give an account of the explicit motivations and believes of the other participant. A second process aims at computing a relevant response by means of a planning mechanism which is able, because of the nature of the modeling itself, to take into account the known information and the possible misunderstandings. This type of modeling makes easier the implementation in some simple situations but does not deal with some important problems in various fields.

Dialog act extraction

The first problem is to translate the sentence uttered by the user into a dialogue act. This process is not a simple transcoding problem. It is necessary to take into account altogether a large collection of knowledge (mental states, presuppositions, prosody, ...) as well as some indices present in the sentence (syntactic structure, lexical items, ...). In addition, the surface form of speech sentences contents a lot of irregularities (problems of performance) which complicates the speech recognition task as well as the understanding and interpretation tasks.

System modeling

The second problem takes place in the use of the planning formalism in order to associate three points of view : the one of the application, the one of the main dialogue (which is concerned with user's intentions towards the application) and the one of the dialogue management (meta dialogue and phatic dialogue). Some partial solutions have been found but they are not well adapted to data management applications (querying data base) or applications which allow several parallel tasks and the processing of certain functions for communication management. A possible approach to deal with this problem could be a multi-agent modeling. Indeed, this conceptual framework allows to combine a prioriexclusive models and dialogue contexts in order to increase the number of dialogue problems dealt with. Therefore, the problem is partly moved from dialogue modeling towards integration modeling.

Communication errors

The third problem arises frequently in interaction: it concerns bad communication. Each of the two participants ( i.e.the human and the system) can indeed have some erroneous knowledge about the application, about the other participant's abilities and about current references used to point out objects during the interaction. One error which concerns this information, may (in the long or short run) leads to a failure, i.e. to an impossibility for the system to satisfy the user. Detecting and dealing with these errors basically requires a characterization process and a plan based modeling.

Application modeling

In an interactive system, the application has to behave as an active component. In current systems, the application modeling affords two types of main defaults. The task model may be too rigid (for example: plans in the systems for transmitting information) constraining too heavily the user's initiative. The task model may also be based on constraints (as in CAD application), allowing in this way a user's activity more free but causing a lack of co-operation for helping the user to reach its goal. We believe that the task model has to include the following elements: data and their ontology, knowledge about the use of data (operating modes) and the interface with the rest of the system. Lastly, the modeling has to be designed in order to make easier the changing of the task.

System and multimodality multimodality reference

We are studying an additional modality, a tactile screen, in order to avoid some of the problems coming from using only speech. The problems to deal with due to this new modality are concerned with integrating messages coming from the different channels, processing of references as well as evaluating systems.

The use of speech technologies in interactive systems raises problems and difficulties spanning from the design of complete softwares (including the research of the task) to the architecture design, including a particularly good quality speech synthesis and the introduction of a new modality.

Human communication is seldom monomodal: gesture and speech are often used jointly because of functional motivations (designing elements, communication reliability). In a speech environment, introducing an additional modality -in our case, gesture by means of a tactile screen- allows to overcome some speech recognition errors.

But it raises also new difficulties. The first one is that the informations come from various communication channels: at which level (syntactic, semantic, pragmatic) has the integration to be done? What kind of modeling has to be used? In the literature, few satisfactory responses can be found. We chose to lean on Maybury's works , performed in a different context (the generation of communicative acts for the system ouput). Maybury proposes several levels of communicative acts which allow to integrate at each level information coming from different modalities. We adopt this principle (which is fully coherent with our dialogue modeling) but we use it for recognizing the act: the tactile and speech modalities are processed separately as communicative acts which are merged in speech acts.

The second difficulty is the processing of references, particularly in the framework of the chosen application (querying a geographical and tourist database). Indicating the interesting objects during the dialogue is done both by means of speech sentence and gesture (pointing out, drawing a zone) and takes into account the application context (the user can follow the outline of a cartographic object with her finger).

Studies in this domain are in the linguistic field and in the artificial intelligence field. Some linguists propose very precise studies about the condition of use of prepositions (functional approach) in the designation of objects. We think that these results are interesting and we have adapted them for our parsing of sentences. In the artificial intelligence field, several modeling of spatial relations have been proposed. We use the one proposed by IRIT (Toulouse) in order to check the semantic coherency of referential expressions in the framework of our application. This modeling is based on certain characteristics (dimension, morphology, ...) of elements which govern the use of linguistic items in the expressions.

The ambition to put dialogue systems on the market needs to comply with requirement about the quality of interaction. It is necessary to be able to evaluate and compare different systems using different points of view (speech recognition rate, dialogue efficiency, language and dialogue abilities,...) in the framework of equivalent applications, and eventually for the same system, to evaluate different approaches. Various metrics have been yet proposed , (for example: length of dialogue, number of speech turns for recovering speech recognition errors), but they do not take into account all the dimensions of an interactive system. Some new solutions are currently under consideration (for example in the CLIPS labs in Grenoble): they are based on pragmatics issues such relevance, or based on the concept of system self evaluation which consists in doing process by the system, or by one part of it, pieces of dialogue which present some difficulties, giving it all necessary contextual information.

Machine learning in dialogue systems machine learning grammatical inference Kalman filter hidden Markov model speech data bases

This research theme focuses on the elaboration of machine learning methodologies in all the stages of a dialogue system.

Machine Learning can be seen as the branch of Artificial Intelligence concerned with the development of programs able to increase their performances with their experience . It is basically concerned with the problem of inductionor generalization, which is to extract a concept or a process from examples of its output. From an engineering point of view, a Machine Learning algorithm is often the search for the best element h^*in a family $Im1 $\#8459 $$ of functions, of statistical parameters or of algorithms. Such a choice is done in optimizing a continuous or a discrete function on a set of learning examples. The element h^*must capture the properties of this learning set and generalize its properties.

Machine Learning is a very active field, gathering a variety of different techniques. Grossly speaking, two families of techniques can be distinguished. On the one hand, some Machine Learning algorithms use learning sets of symbolic data and discover a concept h^*which is also symbolic. For example, Grammatical Inference learns finite automata from set of sentences. On the other hand, other Machine Learning algorithms extract numerical concepts from numerical data. Neural networks, Support Vectors Machines, Hidden Markov Models are methods of the second kind. Some methods can work in examples with both numerical and symbolic features, as Decision Trees do. Some concepts that are learned may have both a structure and a set of real values to optimize, as Bayesian Networks or stochastic automata, for example.

The Cordial project is concerned with the introduction of Machine Learning techniques at every stage of a dialogue process. This implies that we want to learn concepts which basically produce time ordered sequences. That is why we are interested in learning from sequences, either in a symbolic background or in a statistical one.

Grammatical inference.

In the frontal part on an oral dialogue system, the incoming speech is processed by a recognition device, generally producing a latticeof word hypotheses, i.e. the lexical possibilities between two instants in the sentence. Then a syntax has to be used, to help producing a sequence of words with the best conjoint lexical and syntactic likelihood.

The syntactic analysis can be realized either through a formal model, given a prioriby the designer of the system, or through a statistical model, the simplest being based on the counting of how grammatical classes follow each other in a learning corpus ( bigrammodel).

Both types of models are of interest in Machine Learning : grammatical inference is basically the theory and the algorithmics of extracting formal grammars from samples of sentences; the discovery of a statistical model from a corpus is an important problem in natural language processing. It is interesting to combine both approaches in extracting from the learning corpus a stochastic finite automaton as the language model. It has the advantages of a probabilistic model, but can also exhibit long distances dependencies reflecting a real structure in the sentences.

We have worked on grammatical inference in the recent years, especially within a contract with FTR&D between 1998 and 2001. The field is always very active in the Machine Learning community. Many progresses in grammatical inference have recently be done in the framework of Language and Speech processing ( , ).

We are now interested in the learning of a special class of finite automata called transducers. They read a sentence to produce another one, on a different alphabet. The machine learning of transducers from sets of couples of sentences is a well mastered problem (some real size experiments in language translation have been already made, , ). We want to experiment and improve these techniques in the framework of the transformation of the outputs of a speech recognizer into a sequence of dialogue acts (see ). In particular, we will consider the introduction of domain knowledge in the learning algorithm.

Nearest Neighbors learning of tree structures

Any sentence is both a sequence of words and a hierarchical organization of this sequence. The second aspect is particularly important to analyze if one wants to understand syntactic and prosodic aspects in oral speech. Producing synthetic speech in oral dialogue requires a good quality prosody generator, since much information is carried through that channel. Usually, the prosody in synthetic speech is made by rules which use syllabic, lexical, syntactic and pragmatic information to compute the pitch and the duration of every syllabe of the synthetic sentence.

An alternative issue is to consider a corpus of natural sentences and to use some machine learning algorithms. More precisely, any sentence in this learning set must be described both in terms of relevant information with regards to its prosody (syllabic, lexical, etc.) and in terms of its prosody. The machine learning task is to produce explicit or hidden rules to associate the description with the prosody. At the end of the learning procedure, a prosody can be associated to any sentence described in the same representation.

The learning methods used in the bibliography make use of neural networks or decision trees, ignoring the hierarchical nature of the organization of the syntax and the prosody, which are also known to have strong links. This is why we have represented a sentence by a tree and made use of a corpus-based learning method. In a first step, we have used the nearest-neighbour rule.

Given a learning sample of couples of trees (sentences) and labels (prosody), $Im2 ${{\#119982 ={(}t_i,p_i{)}}}$$ and a tree x, the nearest-neighbour rule finds in $Im3 $\#119982 $$ the tree $Im4 $t^\#9734 $$ which is the closest to xand adapts to xa prosody p_xdirectly deduced from $Im5 $p^\#9734 $$ .

This raises two problems: firstly to find a good description of a sentence as a tree, secondly to define a distance between trees. We have worked on these questions during the last years , .

Learning by analogy in sequences and trees structures

In the context of speech synthesis, we would like to use now a more sophisticated lazy learning method: learning by analogy. Its principle is as follows: knowing a sentence xto synthesize, look for a triplet of sentences ( b, c, d) in $Im3 $\#119982 $$ such that xis to bas cis to d.

Actually, we do not yet study learning by analogy directly on trees, but on sequences. The reason is that we use a distance between the trees and the sequences (the edit distance) which is much easier to manage on the universe of sequences.

We have firstly worked on defining what is solving an analogical equation on sequenceswhen the edit distance is introduced. In general, an analogical equation can be described as follows: find xfrom a triple a, band csuch that ais to bas cis to xand is often written by

a b c x

Solving analogical equations

The idea is to generalize the studies of Lepage and of Yvon for whom the edit distance is a trivial case. The classical definition of a: b:: c: das an analogical equation requires the satisfaction of two axioms, expressed as equivalences of this primitive equation with two others equations :

Symmetry of the 'as' relation: c: d:: a: b

Exchange of the means: a: c:: b: d

Symmetry of the 'as' relation:	c: d:: a: b
Exchange of the means:	a: c:: b: d

As a consequence of these two primitive axioms, five other equations are easy to prove equivalent to a: b:: c: d.

Another possible axiom ( determinism) requires that one of the following trivial equations has a unique solution (the other being a consequence):

a: a:: b: x $\Rightarrow$ x= b

a: b:: a: x $\Rightarrow$ x= b

a: a:: b: x $\Rightarrow$ x= b
a: b:: a: x $\Rightarrow$ x= b

We can give now a definition of a solution to an analogical equation which takes into account the axioms of analogy : xis a correct solutionto the analogical equation a: b:: c: xif xis a solution to this equation and is also a solution to the two others equations: c: x:: a: band a: c:: b: x.

Solving analogical equations between sequences has only drawn little attention in the past. Most relevant to our discussion are the works of Yves Lepage, presented in full details in and the very recent work of Yvon and Stroppa .

Our approach to solving equations on sequences is based on classical edit distance and uses deletion, insertion and substitution. We did not assume that inclusion property is true for analogy. That is where our approach generalizes the studies of Yvon and Lepage.

We consider that the relation "is to" is defined with the alignment between two sequences, and that the relation "as" requires to compare two alignments, which are themselves sequences (or more simply, "as" can be the equality).

Aims of this study

We aim at giving a sound definition of analogy in sequences as a first step, then in prosodic tree structures in a second step. With this definition of analogy, we will implement an algorithm for solving analogical equation. Then, in the learning by analogy problem, the adaptation of fast NN-algorithms, such as AESA , is necessary. AESA is interesting as it gives a nearest neighbour in constant time on average, with the cost of a pre-computation that is linear in time and space.

Learning speech units for speech synthesis

Text-to-speech synthesis(TTS) can be carried out by the concatenating of acoustic units obtained from a continuous speech database. The state-of-the-art TTS systems consist in juxtaposing pre-recorded acoustic units, typically phones, diphones or units of non-uniform length.

An alternative to the production of speech from a dictionnary of diphones consists in using a indexed corpus of continuous speech . When one has to produce a sequence of phonemes, the idea is to get in the corpus the best acoustic sequence. It is selected according to several criteria: its phonetic correspondance, its length, position in the sentence, etc. The relative importance of these criteria can be tuned by learning.

The multiple representation of these configurations at the acoustic and phonological levels enables voice quality to be improved significantly , . Furthermore, one can consider that the acoustic segments used to build an acoustic utterance no longer have a predefined linguistic definition. We consider here that the phonological units are not defined a prioriover a finite set of phonemes. Therefore, we face a combinatorial issue where linguistic units , that we no longer know, are useful to parse a phoneme sequence in order to find the best acoustic segments.

Our methodological research framework try to answer the following points :

From a linguistic point of view, how can we automatically build a set of phonological units ?

From an acoustic quality point of view, given a continuous speech database with phone labels, how can we characterize the best sequence of linguistic units?

From an algorithmic point of view, in a graph search framework, what are the best heuristics to solve this combinatorial problem ?

From a pragmatic point of view, given a target application, what could be the best set of pre-recorded speech sentences yielding the best TTS quality?

Automatic Speech Labelling and Recognition

Machine learning methodologies based on statistical approaches require databases of consequent size. The examples taken from these databases show the relationship between the numerous variables involved in the studied phenomenon. In voice synthesis as well as in voice recognition, one wishes to be able to have an explicit relation between the acoustic level and the phonological level. If an automatic labelling starting from phonetic sequences is a task which finds acceptable solutions, the process is more complex when only the text is known.

In such a context, given a speech utterance realized by a speaker and its particular phonetic transcription, the precise location of temporal marks delimiting phone boundaries on the speech is required. The state-of-the-art systems use a markovian description of the speech in an appropriate acoustic space .

Sequences of Hidden Markov Models (HMM) are built from the phonetic description of the acoustic observations. As one needs to discriminate phone boundaries, the majority of the phone segmentation systems postulates a monophone modeling hypothesis. During a learning phase, the parameters of each phone model are learned through a set of exemples using the well known EM iteration scheme . During a decoding phase, the segmentation system finds the most probable alignment between the sequence of models and the observations. The temporal marks delimiting each phone are easily recovered using the model transitions on the optimal path alignment.

We propose to weaken the previous hypothesis by relaxing the exact phonetic transcription with the exact phonemic sequence. The phonemic sequence is built automatically from the text. Under the same phonemic symbol, various acoustic realizations can be found depending on the coarticulation context of the realized phone.

Firstly, we developed a baseline speech segmentation system based on HMM – Hidden Markov Models as briefly exposed section . The scores of segmentation which we obtained are equivalent to those of the state-of-the-art systems from literature. Moreover, to analyze the behavior of this baseline segmentation system, we carried out experiments on two axes : the topological definition of the HMM and the acoustic analysis of the speech .

Secondly, we focus our activity on the automatic phone labelling from the text, not from the true phonetic sequence. Given an automatic phonemic transcription from the text, various acoustic realizations can be found depending on the coarticulation context of the realized phone and depending on the speaker. Since the HMM framework is well adapted to introduce variants of pronunciation, all is needed is to extend the graph of model hypotheses and let the decoding phase find the best alignment. The main drawback of this scenario is that as the degrees of freedom increase, the system becomes instable and less accurate.

Learning semantics from speech

Currently, many automatic systems delivering information suppose that the user of such a service must be able to adapt himself to the implicit requirements of the automatic system.

We postulate that a man-machine interface being based on the natural language must facilitate the access of the greatest number of us with this type of services . Under this assumption, the machine must make the maximal effort to adapt itself to the user.

There already exists many information systems which technology is based on a man-machine oral dialogue. In a first stage, the speech, the entry of such a system, is translated into a sequence of words. This sequence of words will be then treated by a pragmatic entity taking into account a dialogue model. In return, the machine response is stated by a speech synthesis system starting from the text or a concept modeling.

Within this experimental framework, a problem which still today does not find satisfactory scientific and technological answers is that of the semantic treatment.

A semantic function in a context of natural speech processing has a double objective. Firstly, a pragmatic treatment carried out on a sequence of concepts is more relevant than carried out directly on a sequence of words – words are sensitive to the errors of the recognition system. Secondly, a system which would be able to understandthe message can propose different alternatives, what cannot do the current speech synthesis systems starting from the text.

The proposed research framework is settled between the output of an automatic speech recognition system and the input of a dialogue management system. From the description of a statement recognized by the ASR system and translated into a lattice of words, the goal consists in providing the sequence of the underlying concepts. We propose to explore the temporal dimension of the sequence of the concepts in an oral statement starting from its relationship to syntax and more particulary the discovery of the set of thematic roles .

We will adopt a methodology based on the observation of corpus of examples within a statistical theoretical framework. However, it is difficult to find corpora annotated by semantic elements (particularly in French). The phenomenon will be described by random variables partially observed , . Then, an objective consists in determining the optimal quantity of annotated information required for training and mixing them with unlabelled corpora under an assumption of unsupervised training .

Learning prosody from speech

On the one hand, speech processing state of the art can efficiently model acoustic voice characteristics starting from a voice print. On the other hand, few studies are interested in the suprasegmental aspect of the speech, more precisely on the automatic modeling of melody contours . One could find multiple objectives, setting up automatic voice transformation systems, merging prosodic and acoustic information in a ASR system, tuning text-to-speech synthesis systems with ad hocprosodic models.

We propose a model making it possible to describe a melody contour at the sentence level built over a sequence of elementary melody contours. The difficulty of modeling is that one does not a prioriknow an alphabet made of classes of elementary contours. They thus should be estimated starting from the observations of the sentence level all while being based on assumptions of parsimony .

A melody contour is a mono-dimensional signal with real values evolving according to time. We propose to take for methodological assumption the class of dynamical state space models . We consider that the observation of a portion of the melody is explained by a stochastic state variable defining the equation of a standard Kalman filter under gaussian and linearities assumptions. We suppose that a portion of the melody curve followed by a Kalman filter corresponds to a class. The complete observation of the sentence level is governed by a time switching of Kalman filters. This switching process is modeled by an hidden Markov chain.

Learning to improve the dialogue management

We modelize the dialogue phenomena by using the concepts of speech acts and dialogue acts, and we consider that a sequence of exchanges can be analyzed as the result of planning. Machine learning can also be used to increase the efficiency of the planner. A well-known topic in Artificial Intelligence is the use of experience to increase the efficiency of inference engines, planners, generally speaking every kind of reasoning system. Often used is the framework of Case-Based Reasoning, which uses corpus of previous experience to discover "shortcuts" or memorize often used pieces of elaborated information. Another possibility is to use statistics on the sequencing of actions for making decisions informed by experience.

This work is part of the CRC Contrat de Recherche Coopérative, Cooperative Research Contract"Machine learning in man-machine interaction" between the Cordial project and France Télécom Recherche et Développement, DIH/DII. This contract is described in section .

Language learning Educational software teaching and learning languages

The aim of this study is to design and to develop educational software for helping to teach and to learn languages.

The use of Ordictéeis concerned with the primary class exercise called dictation. In this application, a speech synthesiser reads French text while the pupil writes the orthographic transcription on his keyboard. The reading speed is continuously tailored to the speed of the typing. The pupil can correct the text whenever he wants. This application is based on the design and the development of specific tools such as the alignment of the text provided by the teacher and the pupil text.

Application Domains Application Domains

The application domains for our researches are all the situations where man-machine communication requires speech or where the use of speech brings more comfort. These applications are in general complex enough to require a real dialogue situation, and would be tedious if used through a simple sequence of guided short answers.

Examples for these applications are : information services on a personal computer or on a public, booking services by telephone, computer assisted language learning.

Software Introduction

We develop our applications on the CNRT platform DORIS, to promote joints projects with industrial research. The Georalsystem is a demonstrator of touristic information services, with oral dialogue and tactile screen. We also have a "dictation" software called Ordictée, which has been experimented in primary schools.

DORIS platform Laurent Miclet correspondant Jacques Siroux Olivier Boëffard

The Cordial project aims to promote its research activities by means of technological demonstrations. To achieve this point, hardware and software ressources have been defined to build a R&D platform named DORIS and dedicated to man-machine interaction, in particular with the use vocal and dialogue technologies. The main funding comes from IRISA/INRIA, the Regional Council of Brittany and Cordial public contrat funding. DORIS, in the context of the CNRT-TIM Bretagne, has vocation to promote joint projects between institutional and industrial research.

DORIS is concerned by the different research projets like Georal(see sections and ) and Ordictée(see section ). In November 2005, an research engineer (founded by FNADT) will be full time on the DORIS platform to manage the technical aspects and to develop new softwares for the previously quoted projets.

Hardware architecture

On the powerhouse systems side, a Compaq AlphaServer system has been chosen to support the calculation power needs, especially for speech processing. In addition, the platform includes a Network Appliance file server with a storage capacity up to 350 Go.

In order to facilitate technical access for industrial partnership, the platform includes fast secure network access. DORIS inherits from the ENSSAT-Université de Rennes 1 network. We propose high internet connection with VPN access.

On the client side, PCs with an up-to-date sound configuration are used. These computers are meant for software development within DORIS. They are nowadays used by engineers, PhD and postgraduate students involved in the CORDIAL project. Touch screens have been purchased in order to facilitate the development of multimodal man-machine interfaces.

This client-server configuration is fully functional inside the ENSSAT campus. Further improvements will be focused on lightweight clients and resources sharing with external partners (see section ).

Software architecture

The DORIS platform main goal is to group research projects that deal with the man-machine interaction field. In this entity, they shall take advantage of other teams works and tools.

We first direct our efforts towards the installation of a multi-agent An agent is an independent and autonomous process that has an identity, possibly persistent, and that requires communication with other agents in order to fulfill its tasks.architecture. It satisfies the needs for modularity, quick and clean development and interoperability. To fulfill this role, we chose and installed JADE Java Agent Development Framework, a free software distributed by Telecom Italia Lab (TILAB)., a software framework fully implemented in Java language. It allows the implementation of multi-agent systems through a middle-ware that complies with the FIPA Foundation for Intelligent Physical Agents, which purpose is the promotion of emerging agent-based applications, services and equipment. This goal is pursued by making available internationally agreed specifications that maximize interoperability across agent-based applications, services and equipment.specifications. The agent platform can be distributed across machines, which do not need to share the same OS.

We made this choice to simplify the development while ensuring standard compliance. Furthermore, the Java technology allows us to use already developed libraries that are not necessarily in our sphere of competences (e.g. sound or speech coding, framing, streaming) and therefore to concentrate on the scientific interests of the team.

Today, one main project has taken place inside the DORIS platform: Georal(see sections and ).

New steps with DORIS

Thanks to the CNRT TIM Bretagne, Télisma and France Télécom R&D are actively involved within the DORISproject. Télisma proposes their software suite for speech recognition and France Télécom R&D for text-to-speech synthesis.

Several publications have reported on efforts in building such a platform and several issues need to be addressed. Among those, we focus in this work on the distributivity of the solution based on an Agent architecture, and on the use of Voice over IP solutions, and we illustrate such issues through a demonstration application built upon such an architecture. Additionally, this platform helps us to integrate different third-party solutions – speech bundles, VoIP protocols, applications, etc. – and test them in an acceptable technological environment.

A salient feature of the proposed solution is to mask the third-party API specificities behind the MRCP protocol, Media Resource Control Protocol. MRCP controls media service resources like speech synthesizers, recognizers, signal generators, signal detectors, fax servers etc. over a network. This protocol is designed to work with streaming protocols like RTSP (Real Time Streaming Protocol) which help establish control connections to external media streaming devices, and media delivery mechanisms like RTP (Real Time Protocol). RTSP protocol is a standard protocol for controlling the delivery of data with real-time properties. The main contribution of this protocol to our platform concerns the negotiation of the RTP, setup parameters (client and server port numbers, session id) and the transport of MRCP messages between client and proxy-Agents dedicated to speech ressources. We have defined half-duplex streaming. A client can initiate a session on the DORIS platform from one source, for example a PDA, and get a speech feedback from another source, for example with a cellular phone. We developed a complete stack following the MRCP specifications and other necessary protocols like RTSP and RTP. An API for MRCP clients has been developed in Java (the MRCP stack and the client API is about 12000 lines of code).

Several communications have been done during the year 2004, including presentations and demonstration with industrials, local organisations and journalists. The INRIA associate engineer managing the platform has taken part in a congress (Synerg'Etic, Nantes).

The multiagent/IP possibilities of the DORISplatform has been presented in Interspeech 2004 congress in Corea, .

Georal Jacques Siroux correspondant Marc Guyomard

Georal Tactileis a multimodal system which is able to provide information of a touristic nature to naive users. Users can ask for information about the location of places of interest (city, beach, chateau, church,...) within a region or a subregion, or distance and itinerary between two localities. Users interact with the system using three different modalities: in a visual mode by looking to the map displayed on the screen, in an oral and natural language mode (thanks to a speech recognition system) and in a gesture mode pointing to or drawing on a touch screen. The system itself uses both the oral channel (text-to-speech synthesis) and graphics such as the flashing of sites, routes and zooming in on subsections of the map, so as to best inform the user.

The Georalproject started in 1989 and is the origin of various works since then. It was fully developed in Visual Prolog 4.0. We decided to re-implement Georalmaking the most of the capabilities of the DORIS platform.

The foundation stone of this re-implementation was to split the initial Prolog modules (syntactic and semantic analysis, dialogue management and tactile screen management) taking into account the multi-agent paradigm (one module for one functionality). We assigned one agent for each specific role, agents that are written in Java. However, we kept core functions written in Prolog, in order to take advantage of the fact that this language is really convenient for tasks like natural language processing. But all peripheral functions from screen management to client-server communication have been rewritten in Java and C/C++ languages.

The call of Prolog predicates from agents written in Java was not straightforward. After a benchmarking phase, we decided to use a Java package that allows such calls ( tuProlog). This implied studying the existing Prolog files to extract useful predicates and to correct them to bring the code closer to the ISO Prolog. Furthermore, the work on the Prolog code allowed an improvement of the Georalengine capacities. A larger range of queries are now accepted by the system and some bugs have been fixed. Improvement have also been made on the gesture management side. The touch screens prove to be useful to process new kind of drawings like windings follow-up.

A text-to-speech server has been installed and a dedicated agent communicate with it. The processing time and the sound quality are very good, but we are using a local network for the moment. We have in mind to insert the Internet between the clients (possibly wireless devices) and the server. An important work on data coding and communication protocol has to be made beforehand.

Several analyses and use tests have been made on the different normalized communication protocols between agents ( FIPAnomralization). The dialogue engine has been modified to make the dialog be the most natural (taking ellipses, anaphores and interruptions into account). Student projects have been integrated to model new types of tactile acts. Agents for speech recognition and speech synthesis have been developed for communication with the server. The grammar of speech recognition of GEORALhas been written.

Finally, several improvements have been added, including the improvement of screen display, the speed up and debugging of the code, and internal facilities for software development (abstract agents, agents managing functions, simplified communication interface).

To learn more about Georalsystem, see section .

Ordictée Marc Guyomard correspondant Olivier Boëffard

As explained in section , Ordictéeis a software that allows a pupil to perform a dictation exercise on his one. It is made up of three modules: The pupil module, which, together with the pupil itself, carries out the dictation exercise, the teacher module, which allows the teacher to design his own dictation texts, and the administrator module which is devoted to set the application parameters. One of the main function of Ordictéeis to follow the typing, i.e. to adapt the reading rhythm to the typing speed. This function is based on the one hand on the hypotheses that mistakes do not affect the pronunciation, and on the other hand on the phonetic closeness of the two texts (the pupil text and the teacher one).

Semantic Parrot

>From a usability point of view, the semantic parrotthat we propose consists in taking a speech message from a standard audio input (Personal computer, PDA, Cell phone), to understand the underlying concepts and finally to generate a paraphrase using a speech ouput.

>From a technological point of view, the semantic parrot implements techniques of speech recognition, automatic speech understanding, and finally of concept to speech synthesis.

Currently, a first demonstrator is built on the DORIS platform (see section ) implementing a technology of speech recognition provided by TELISMA and a technology of speech synthesis provided by FTR&D.

New Results Dialogue and modeling Logical modeling for dialogue processing Jean-Christophe Pettier

Dialogue systems have to model a world description to answer user requests. In order to avoid the system to start an infinite loop on a query it can not answer, a finite first-order dialogue logic has been devised which permits to envision the computation of world model backbones (assignments that pertain to every models). To assess our approach on model inference mechanisms, the propositional case seems to be the relevant testbed. Indeed, in this context, comparison to state-of-the-art solvers for which performance is an highly competitive issue should provide valuable information before lifting our strategy to the first-order case.

Roughly speaking, this strategy postpones space search while the input formula can be contracted on inner conflict detections. The widely spread Conjonctive Normal Form (CNF) format is then inappropriate as it looses the instance inner structure. Another drawback of this format is that it makes implicitly solvers sensitive to encoding strategies for natural structured problems. The relevant structured format for which wide benchmarks are available is ISCAS.

For this format, a SAT solver development has been undertaken for a year and a half. In the meantime, there has been a proposal for a general structured format for which our resulting solver could be easily adapted if large benchmarks become available.

Dialogue systems evaluation Jacques Siroux

There is no new results in this theme as no manpower could have been devoted to this task since the end of the contract of our engineer.

Modeling negotiation sub-dialogues Sylvie Saget Marc Guyomard

This study is covered within the framework of a PhD thesis funded by the 'Région Bretagne'. Work began on October 15, 2003.

Modeling reference negotiation sub-dialogues is a way of handling communicative errors, by giving a dialog system, and his users, the capacity to interactively refine their understanding until a point of intelligibility is reached.

The approach chosen within the framework of this thesis is based on the explicit modeling of the collaborative aspects of dialogue in order to obtain an explicative as well as generic model. The paradigm chosen for designing our dialogue system are rational models. Indeed, such a system is seen as an intelligent agent able to take part in a dialogue. Thus, collaborative activities (and subjacent phenomena such as the necessity of a common ground, grounding, negotiation, etc.) can be seen as concerning a general social intelligence. Next, we want to instantiate these general axioms to the particular case of dialogue, and more precisely to reference resolution.

During 2005, an interdisciplinary study on the concepts of collaboration (vs cooperation), common ground, grounding, and negotiation brought us to the definition of the principles which will have to respect our model and the interrelationship between the various under-phenomena.

Hability modeling Karl DeVooght Marc Guyomard

Recently, the capability has become an emerging notion in the agent theories. On the one hand, it allows modularity and reuse of agent behaviour for the development of applications based on such system. On the other hand, natural dialog system based on a logical model can in this case take into account the agent activities as well as the control over them.

We focus our work on a particular dialog system called ARTIMIS which is developed by France Telecom R&D. This architecture results from solid theory on a BDI Agent model.

Our concern consists of proposing a model of capability and its linked notions through this architecture. According to our study, we note that few approaches include them and are mature enough for incorporating into such system. For instance, one reason is that there is no relation between the concepts of capability and intention.

We should have proposed in a short period a formal characterization as well as the principles which allows to set them through the system ARTIMIS.

System and multimodality

A study about referring phenomena in an enlarged version of Georalhad been led. We also continued activities to improve the ORDICTEE software (dealing with faults coming from phonetic, following typing).

Georal Tactileand reference Jacques Siroux Ali Choumane

Recent progresses in speech recognition allow to plan new important developments inside the dialogue system Georal Tactile . Increasing the vocabulary size gives the users the possibility to utter more complex linguistic sentences. We use this fact to enrich the application world with new elements on the map which is the support for querying. In this new framework, several issues are studied: modeling the cartographic context, linguistic and gestural of users referencing elements on the map, and at last the architecture of the system.

In a first time we have made an experiment in order to determine the linguistic behaviour of the users when they reference elements on the map. A large number of linguistic forms and of tactile built up elements (for example referencing a triangle using particular points) have been observed. A new type of gesture (following a line) has also been observed .

We have proposed a syntactic model in order to parse and filter referential expressions in the user utterances. This model is based on Vandeloise and Borillo's works , which take into consideration the spatial characteristics of the handled elements. Next we have developed a semantic model which allows to filter more precisely the output of the syntactic parser. The model is derived from the Aurnague's one which uses specific attributes of the elements (for example size, consistency, position, ...). We only use three attributes (dimension, consistency and form) but we combine them in order to take into account the possible syntactic forms.

As far as the cartography is concerned, we developed a new data model and search algorithms that are better adapted to handled elements.

Finally, we have redesigned the architecture of the system and the processing flow in order to deal with various facts: more complex gestures, references on objects which are not stored in the database and a two stages processing. By contrast with the current version, we have given priority to gesture activity over speech activity; this principle allows to progressively check and possibly correct the referential linguistic expressions, to determine referents on the map and to build up, if necessary, new elements in the database. Some of these algorithms have been implemented and we are integrating them in the system.

We began studies, firstly in order to model in uniform way the different semantic points of view (natural language, graphics) from the Pineda and Garza's work [Pin00], secondly to bring together the processing on references in Georaland the plan-based modeling of dialogue. We began to studying the use of the concept of salience taking into account the results from LORIA project-team Langue et Dialogue. We especially studied the processing of some tactile designations: those that appear when user touches the screen following the cartographic representation of roads, rivers, ... Some referring ambiguities may arise if two cartographic elements are very close or if the user's performance is fuzzy. We propose to solve these ambiguities using a salience score to choose the best candidate. Some preliminary results are encouraging but we have to experiment the algorithm with naive users in real conditions and with more complex geographic maps and elements.

We have started another study in order to design the best way for representing linguistic knowledge (from lexical level to contextual level). The best way means that the design and implementation would be on the one hand, less expensive as possible, and on the other hand, reusable and easily integrable within the system.

We complemented the above studies on referential problems by studying two complementary ways. The first one is concerned with works on written natural language understanding for applications as data mining, question answering, message understanding, etc. Some of these works , are interesting for our purpose because they are using poor knowledge and light parsing in order to solve anaphora. But, they need using corpora in order to tune up the values of the different parameters used. The second one is concerned with text generation studies [Man03]. In this thesis work, the author shows that it is necessary to use linguistic knowledge in order to generate referential relationships and that this knowledge could be deduced from experiments and corpora. It couild be interesting to merge this knowledge with the Vandeloise's results.

As a first results of these activities, we are starting a research project, agreed and partially funded by the regional council of Britany. The project REPAIMTA(Référence parole image tactile) aims at produce generic tools for dealing with resolution of references in a multimodal framework.

Ordictée Marc Guyomard Olivier Boëffard

A new algorithm for the identification of the pupil spelling mistakes is implemented. It is expected to overcome some of the major drawbacks of the usual alignment algorithms. As far as the following of the typing is concerned, new features are under investigation. They aim at a better synchronisation between the pupil text and the teacher module utterances.

Machine learning in dialogue systems Grammatical inference Erwan Livolant Laurent Miclet

A thesis has started in 2004 with the following topics : the adaptation of the actions of an agent in a communication situation. The main issue is to give the agent a capacity of analysis on the ongoing dialogue, in order to adapt dynamically its strategy if necessary.

As a first phase, a study is conducted to test the efficiency of the learning of subsequential transducers realizing the transformation of the outputs of a speech recognizer into a sequence of dialogue acts.

The work undertaken so far on this subject lead to the human labeling of dialogue corpora containing approximately 3000 sentences. A technological review on the inference of transducers has been carried during this year.

Learning by analogy on sequences Sabri Bayoudh Arnaud Delhay Laurent Miclet

The thesis of L. Blin has studied how to learn the prosody of a sentence by using a distance between trees (the sentences are represented by trees) and the nearest neighbour technique. It has been concluded at the end of year 2002, and given last results in 2003 .

We examine how the nearest neighbor method could be extended to learning by analogy. Our first aim is to define what is analogy on sequences, then to define learning by analogy on sequences. Future work will extend the study on other structures.

An analogy is described as follows : find xfrom a triple a, band csuch that xis to cas bis to aand is often written by

a b c x

Our approach uses the edit distance between sequences to define the relation "is to". Therefore it enables substitutions in sequences. We have worked on defining what is solving an analogical equation on sequences when the edit distance is introduced, to generalize the works by Lepage and Yvon (for whom the edit distance is a trivial case).

We have proposed an other way to consider solving analogies on letters by defining a letter with a set of features . Solving analogies on these sets is straightforward and the basic technique has been explained by Lepage in . Considering this approach, we can now consider an alphabet either as a cyclic group with a constrained distance or as a set of elements that are defined by binary features. It is possible to define a distance between these sets of features; one of the simplest is the Hamming distance.

Our recent studies are focused on learning by analogy on sequences and a PhD student, Sabri Bayoudh, has begun his thesis on the 1st of October 2004. In recent papers , , we have defined a notion of analogical dissimilarity (AD) between four objects in a metric space, with a special focus on dissimilarity between strings. Firstly, we have studied the case where four objects have a null analogical dissimilarity, i.e. are in an analogical relation. Secondly, when one of these objects is unknown, we have given algorithms to compute it. In particular, we have studied a new formulation of solving analogical equations on strings, based on the edit distance between strings. Thirdly, we have tackled the problem of defining analogical dissimilarity, which is a measure of how close four objects are from being in analogical relation.

What is interesting in the dissimilarity measure that we proposed, is that it respects the properties of a distance, especially the triangle inequality. This has been particularly used in adapting rapid algorithms for finding nearest neighbors, like AESA. The new algorithm that we have proposed considerably decreases the time complexity of searching the triple of objects in a learning sample which has the least analogical dissimilarity with a given object. The brute force algorithm should have given a complexity in O( n³)if nin the size of the searching space.

Current studies are focused on classification by analogy and comparing our approach with existing classification methods on reference sets (like those in UCI Machine Learning Repository).

Solving analogical equations on sequences Arnaud Cadel Master Student Arnaud Delhay Laurent Miclet

This work is focused on solving analogical equations on sequences and is complementary to the search techniques for learning by analogy presented in section .

To solve analogical equations like A: B:: C:: Xon sequences, we use the notion of edit distance. We have proposed an algorithm, called SEQUANA1, that proceeds in aligning Aand B, then Aand Cand combines the two alignments to produce an alignment of three sequences. A second method can be used: the direct alignment of the three sequences. This algorithm has been called SEQUANA2 and it has been proved that it produces the same results than SEQUANA1. The worst case complexity of both algorithms seems to indicate that SEQUANA1 is more efficient, what can be confirmed by the experiments we have done. These results are presented in .

The other point of this work was to implement a prototype solver for analogical equations in sequences. We now have a solver that works on sequences of letters taken in a cyclic group, in an alphabet defined by binary features, and the resolution can be based on Analogical Dissimilarity (AD) presented in section .

Learning speech units for speech synthesis Olivier Boëffard

This study was covered within the framework of a PhD thesis funded by a research contract with FTR&D Lannion (FTR&D/DIH/ISP). Work began on October 1, 1999 and was completed by the defence of Helene François' PhD thesis in December 2002. Since December 2002, we are involved in national research project named Néologos, 2003 Technolanguecall for proposal . Our essential participation consists in applying the methodologies developed during the Hélène François' thesis work to define speech corpora usable at the same time for speech recognition and speech synthesis. Currently, the content of the corpora is defined, their collections has taken place during 2004.

The NEOLOGOS project is a speech databases creation project for the French language subsidized by the French ministry for research in the framework of the Technolangues program. Academic laboratories (LORIA and IRISA) and industrial companies (France Telecom, ELDA and TELISMA, coordinator of the project) are collaborating in the field of speech recognition for the creation of two new kinds of speech databases : a SpeechDat-like speech database for children's voices (PAIDIALOGOS sub-project); a speech database with a novel kind of structure for adult voices (IDIOLOGOS sub-project). In both subprojects the goal is to bring to the research community new sources of telephone speech data likely to improve ASR performance : on one hand, to significantly improve speech recognition for children (with PAIDIALOGOS), on the other hand to provide speech data to support the development of advanced ASR techniques such as eigenvoices (with IDIOLOGOS). IDIOLOGOS should also provide the means of advanced studies on speakers characteristics, with a significant panel of reference speakers, including in the area of speech synthesis and speaker identification.

During 2003 and 2004, Neologos focused on the design of the linguistic databases : a bootstrap database recorded by 1,000 speakers; each speaker utterances the same set of 50 phonetically well-covered sentences and a full database recorded by 200 reference speakers chosen among the first 1,000. These databases are designed to maximize the phonological diversity of the speech material. But, they are technically built to allow text-to-speech synthesis. With the bootstrap database, one can find 1,000 small inventories, in order to synthesize 8 of the 50 sentences of each small database. We think that these 1,000 small unit dictionaries can help in finding a relation between the quality of the synthetic speech and some speaker voice quality. The 200 full databases achieve the same goal, but now we can build, thanks to the 500 sentences for each speaker, a full diphone TTS system. Both corpora of phonetically rich sentences were constructed by processing and simplifying sentences from large publicly available newspaper corpora in French. Automatic corpora reduction methods such as the greedy algorithm reported in were used to extract a subset of sentences meeting a criterion of minimal representation of all phonemes as well as a criterion of minimum representation of diphone classes. There were 99 diphone classes constructed from 10 broad phonetic classes including the silence.

During year 2005, work was focused on the search for a reference speaker set . The bootstrap database was recorded by 1,000 speakers (each one pronouncing 50 sentences). From these 1,000 speakers various methodologies were applied to find the 200 speakers most representative speakers. Reference speakers come out of a selection process which guarantees that their recorded voices are non-redundant but keep a balanced coverage of the voice space given by the bootstrap database. Finding reference speakers has been interpreted as a clustering task, which consists in partitioning the voice space in homogeneous subspaces that can be abstracted by a single reference speaker. We formulate this problem in a general framework which remains compatible with a variety of speech/speaker modeling methods, across which some lists of reference speakers can be compared and jointly optimized .

Automatic speech labeling Samir Nefti Josselin Huaulmé Olivier Boëffard

This study is covered within the framework of a PhD thesis funded by a research contract with FTR&D Lannion (FTR&D/DIH/ISP). Work began on January 1, 2000.

This work relates to the automatic segmentation of speech corpora, read or spontaneous, into phone units. Text-to-Speech synthesis systems based on concatenated acoustic units need this process. The general framework of this study is quoted in section .

Considering the wide scope of this topic, we have addressed only the detection problem. We turn out attention towards methods of scoring the confidence of this acoustic to phonologic mapping. We have conducted experimental studies to validate our choices. Compared to state-of-the-art scientific background, the original confidence measure we proposed within the HMM framework yields experimentally the best scores evaluated through DET curves – 12% Equal Error Rate, EER, for a randomly blurred test database. The next step concerns the treatment of rejected models given the speech. Within an area delimited by the confidence measure we propose to substitute the wrong sequence of phones by a local language model built on sequences of phonemes. Experiments were conducted and we concluded that this strategy can improve the performance of the speech alignment process , .

Learning prosody from speech Nelly Barbot Damien Lolive Master Student Salma Mouline Olivier Boëffard

This study is covered within the framework of two PhD thesis. One started on January 1, 2003 is funded by FTR&D, ( S. Moulines, another started on October 1, 2005 is funded by the French Ministry of National Education and Research.

The approach suggested within the framework of this thesis is based on the modeling of the prosody by a set of forms representative of the various realizations of the melody present in a reference speech corpus . Once this representation defined, we plan to segment automatically the database using these elementary forms. Each one of these segments is annotated using syntactic, phonetic and phonological labels obtained during the linguistic analysis of the corpus. Next, we want to answer the question of mapping the tagged prosodic elementary forms and the associated linguistic characteristics. Taking into account the correspondence between linguistic and prosodic parameters should make it possible to restore the style of elocution actually recorded by a speaker. Moreover, at the synthesis stage, during the selection process of the acoustic units, the prosodic targets resulting from the proposed system should better correspond to the true prosodic parameters of the recorded speaker. During 2003, we studied the MoMel parametric model and propose, within an automatic learning framework, a solution to adapt the parameters of this model to new speech corpora (different voices), .

In relation to this scientific topic , we have welcomed a MSc student during the last academic year. We focus on modelling the F₀evolution and propose a parametric representation based on B-spline model. This model has smoothing properties and local irregularities which capture the global shape of the F₀curve and the breaks of curvature and discontinuities. Moreover, few parameters are needed to characterize a B-spline curve, that are the degree of the B-spline, the number of knots, the location of knots and control points. A B-spline curve of degree mis the sum of the control points weighted by B-spline functions of degree m. Between two successive knots, these B-spline functions are non-negative polynomial functions of degree mand its degrees of continuity at a knot depend of the knot multiplicity. For a given degree (generally m= 3) and sequence of knots, the control points are estimated using the least-squares error criterion. For what concerns the knot placement and multiplicity, we propose a global optimization algorithm using a simulated annealing procedure .

The main difficulty is to discover an optimal number of knots. Experiments show that, for too few knots, the error is too high, and for too many, the model complexity is overestimated. A first means is an experimental stopping when this error is less than a given threshold, and we plan to consider this issue in a better theoretic way applying a true bayesian framework or a simpler solution like MLD (Minimum Description Length) principle.

Learning semantics from speech Pierre Alain Nelly Barbot Olivier Boëffard

This study is covered within the framework of a PhD thesis funded by INRIA within a scientific collaboration with FTR&D Lannion (FTR&D/DIH/D2I) . Work began on October 1, 2003.

We adopt a methodology based on the observation of corpus of examples within a statistical theoretical framework. However, it is difficult to find corpora annotated by semantic elements (particularly in French). We propose to handle three kind of semantic sources :

Eurowordnet is an ontology which comprises a hierarchical network of concept nodes, populated with words. The nodes in WordNet networks are termed synsets, as they contain sets of synonymous words representing a common underlying concept. Synsets offer a means of semantic generalization, both over the component words within a given synset and between synsets via hierarchical relations such as hyponymy and hypernymy.

Dictionnary of the Language. Within this traditional dictionary, each word is explained by one definition or more if the word is polysemic. We propose to exploit the sentence given as an example to illustrate the correct use of a concept.

The web. Recently, several publications refer to a methodological framework which consists in web searching to find semantic correlates and use it as learning sentences.

During 2004, we focused on the representation of the linguistic data. We adopt an XML frawework due to the great activity of this technological field. Our next step will focus on the use of eurowordnet to tag semantically text corpora with hierarchical concepts. The corollary of this work is to propose a solution to automatically enrich eurowornet with new related synsets.

During 2005, we have focused on statistical language models . Indeed, Language modelling is a crucial problem and is widely tackled in automatic processing of both written and spoken language processing.

We present a framework to evaluate heterogeneous language models. Perplexity is a measure traditionally used to evaluate the performance of a statistical model of the language. It reflects the crossed-entropy between a test set and the language model. Indeed, the calculation of the standard perplexity is not always possible. In the worst situations, perplexity can lead to erroneous decisions. The rank-based evaluation that we propose gives a new point of view for comparing language models. We made experiments in English with language models based on a vocabulary of 30 000 words. We compare traditional n-grams, with variable n-grams, and multigrams. Considering the same number of parameters, we observe that the traditional n-grams give the best results. Originally proposed as a substitute for n-grams, we show that variable n-grams outperform standard n-grams considering a perplexity measure (10 % better than n-gram scores) but are worse if we use a rank-based statistics within the Shannon's framework .

We present also a necessary condition so that a statistical Multigram Language Model outperforms standard n-grams. Multigrams are models which extend the properties of the n-gram predicting word groups, namely phrases, according to a variable length word history. However, multigram models suffer from a spatial complexity issue. Taking into account the combinatorial nature of the language and its specific distribution (power laws), it is not effective to keep the sequences of words according to a threshold cut. By means of perplexity, we propose to relativize the interest of a multigram parameter compared to its n-gram counterpart. A solution is presented by making the assumption of a 3-gram reference model. There is however no difficulty in generalizing to higher models. In our experiments, we compare a 2-multigram model (bigram of phrases) to a traditional 3-gram. We show that the 2-multigram is more interesting than the related 3-gram in the situation where the 2-gram of words crossing the history and the head of the 2-multigram has a weaker perplexity than that of the 3-gram. This selecting rule is thus applicable to a set of phrases in order to keep only units improving the perplexity of a standard n-gram model. Moreover, this article presents a linguistic analysis of this selection rule. In particular, we make a distinction between functional and lexical words .

Learning to improve the dialogue management

some collaborative work has started in 2004 in the framework of the CRC with FTR&D. No manpower is explicitly devoted to this task, since no PhD student within the CRC has been oriented towards this activity.

Contracts and Grants with Industry Néologos

The main topic of this project relates to the creation of new telephone vocal data bases for the French language.

The project has two main objectives : a multi-speaker speech database with children voices (1000 speakers) and a multi-speaker speech database with adults voices.

Cordial is mainly concerned with the second task. We aim to define, for French, a speech database of reference speakers, i.e. a speech corpus where each speaker will have pronounced sufficient statements so that one can exploit them to characterize his voice. To achieve this goal, we need more than only 50 statements to record for each speaker. We plan to record a database where 200 reference speakers have recorded over the fixed phone network 500 well defined statements to cover the main coarticulation features of the language.

In addition to speech recognition systems, such a corpus is also useful for the research and the development of the techniques of speaker identification and authentification, voice transformation, voice characteristics for Test-To-Speech systems.

The partners of the project are of three types :

Academic laboratories undertaking an active research on vocal technologies (IRISA, LORIA, and FTR&D), whose main contribution will be done on the supply of research tools and on the realization of validation tests.

Industrial partners (TELISMA and DIALOCA) marketing products of speech recognition, whose contribution will be done by the organization itself of the collection and the realization of "industrial" tests more intended to show the contribution of the corpus for the improvement of the products.

The ELDA (European Language Resources Distribution Agency) whose vocation is to distribute linguistic resources, and who leads an activity of creation of corpus.

Dialogue and Semantics

In 2003 has been finalized the CRC Contrat de Recherche Coopérative, Cooperative Research Contract"Machine learning in man-machine interaction" between the Cordial project and France Télécom Recherche et Développement, DIH/DII, Lannion.

The subject is of common interest between our two research units. The CRC federates all the manpower in both teams involved on the topic. It covers the thesis of P. Alain, described at section , another thesis at FTRD DIH/DII, started Feb 2002 and the thesis of E. Livolant started in January 2004. The total manpower in permanent researchers is of 0.125 man-year at FTRD and at Cordial (scientific management of the CRC and direction of the thesis).

Other Grants and Activities International networks and workgroups

The Cordial team is a member of the European Network of Excellence in Human Language Technologies Elsnet, and of the French-speaking network FRANCIL (Réseau FRANCophne d'Ingénierie de la Langue).

Dissemination Leadership within scientific community

Laurent Miclet has been a member of the scientific committee of the French Machine Learning Congress, Conférence d'Apprentissage CAp 2005.

Teaching at University

Olivier Boëffard teaches the course Speech Synthesisin the Master STIR, Rennes 1 (option Signal, orientation 2) and takes part in the module Data Mining ( Fouille de données) in the Master Informatique de Rennes 1.

Marc Guyomard and Jacques Siroux teach the module human-machine communicationat Enssat, Lannion (Lannion part of the Master Informatique de Rennes 1).

Laurent Miclet teaches a course in Pattern Recognition Reconnaissance des Formesin the Master STIR and a part of the module Apprentissage et Classification(AC) in the Master Informatique de Rennes 1. In the Lannion part of the Master Informatique de Rennes 1, for which he is the coordinator, he teaches a module of Machine Learning Apprentissage Artificieland takes part in the module Data Mining ( Fouille de données).

Conferences, workshops and meetings, invitations

Laurent Miclet will be

a member of the jury for the Ph.D. thesis: F. Nicart. Conception de modèles Génériques pour les machines à états finis.Thèse de l'Université de Rouen, November 2005.

a member of the jury for the Ph.D. thesis: N. Stroppa. Définitions et caractérisations de modèles à base d'analogies pour l'apprentissage automatique des langues naturelles.Thèse de l'Ecole Nationale Supérieure des Télécommunications, Paris, November 2005.

a member of the jury for the HDR: I. Tellier. Modéliser l'acquisition de la syntaxe du langage naturel via l'hypothèse de la primauté du sens.Habilitation à Diriger les Recherches de l'Université Charles de Gaulle, Lille 3, December 2005.

Laurent Miclet will be a reporter for the Ph.D. thesis: G. Valétudie. Nouvelles méthodes en data-mining et extraction de connaissances à partir de données : application au complexe Mycobacterium Tuberculosis.Thèse de l'Université des Antilles et de la Guyane, December 2005.

Graduate Student and Student intern

We have this year two Master students in a research period.

Synthèse de la parole O. Boëffard O. C. d'Alessandro C. Hermès Science

New-York

2002 Apprentissage artificiel : méthodes et algorithmes Antoine Cornuéjols A. Laurent Miclet L. Eyrolles 2002 What is the search space of the regular inference ? P. Dupont P. L. Miclet L. E. Vidal E. Grammatical Inference and Applications, Lecture notes in AI 862 Springer Verlag September 1994 Evaluation if units selection criteria in corpus-based speech synthesis H. François H. O. Boëffard O. proceedings of the Eurospeech Conference, Geneva, Switzerland 2003 1325–1328 Plans, métaplans et dialogue M. Guyomard M. P. Nerzic P. J. Siroux J. Technical report 1169 Irisa September 1998 Analogy on Sequences: a Definition and an Algorithm Laurent Miclet L. Arnaud Delhay A. Technical report 4969 INRIA October 2003 http://www.inria.fr/rrrt/rr-4969.html Confidence measure for phonetic segmmentation of continuous speech S. Nefti S. O. Boëffard O. T. Moudenc T. proceedings of the Eurospeech Conference, Geneva, Switzerland 2003 897–900 Oral and Gestural Activities of the users in the géoralSystem J. Siroux J. M. Guyomard M. F. Multon F. C. Rémondeau C. Intelligence and Multimodality in Multimedia, Research and Applications, AAAI Press John Lee (ed) 1998 A Hybrid Model for Text-to-Speech Synthesis F. Violaro F. O. Boëffard O. IEEE transactions on Speech and Audio Processing 6 5 1998 426–434 Analogie entre séquences : Définition, calcul et utilisation en apprentissage supervisé Arnaud Delhay A. Laurent Miclet L. 0992-499X RSTI - RIA in French 19/2005 2005 683–712 Évaluation des Modèles de Langage <hi rend="italic">n</hi>-gramme et <hi rend="it">n</hi>/ <hi rend="it">m</hi>-multigramme P. Alain P. O. Boëffard O. actes de la conférence TALN, Traitement Automatique de la Langue Naturelle, Dourdan, France 1 2005 353–363 A necessary condition on Multigram Language Models in order to outperform standard <hi rend="italic">n</hi>-grams P. Alain P. O. Boëffard O. proceedings of the SPECOM Conference, Patras, Greece 2005 247–250 <hi rend="it">F</hi>0stylisation with a free-knot B-spline model and simulated annealing optimization N. Barbot N. O. Boëffard O. D. Lolive D. proceedings of the Eurospeech Conference, Lisboa, Portugal 2005 325–328 Comparing Rank-based statistics and standard perplexity to evaluate statistical language models O. Boëffard O. P. Alain P. proceedings of the SPECOM Conference, Patras, Greece 2005 111–114 Neologos: an optimized database for the development of new speech processing algorithms D. Charlet D. S. Krstulovic S. F. Bimbot F. O. Boëffard O. D. Fohr D. O. Mella O. F. Korkmazsky F. D. Mostefa D. K. Choukri K. A. Vallée A. proceedings of the Eurospeech Conference, Lisboa, Portugal 2005 1549–1552 Focal Speakers: a speaker selection method able to deal with heteregeneous similarity criteria S. Krstulovic S. F. Bimbot F. D. Charlet D. O. Boëffard O. proceedings of the Eurospeech Conference, Lisboa, Portugal 2005 3057–3060 Définitions et premières expériences en apprentissage par analogie dans les séquences. Laurent Miclet L. Sabri Bayoudh S. Arnaud Delhay A. Proceedings of CAp 2005, Nice in French June 2005 31–47 Analogical Dissimilarity: definition, algorithms and first experiments in machine learning Laurent Miclet L. Arnaud Delhay A. Technical report 5694 INRIA sept 2005 http://www.inria.fr/rrrt/rr-5694.html Natural Language Understanding J. Allen J. Benjamin/Cummings Menlo Park 1987 Efficient particle filtering for jump Markov systems. Application to time-varying autoregressions C. Andrieu C. M. Davy M. A. Doucet A. IEEE transactions on Signal Processing 51 7 1770 2003 1762 Grammar fragment acquisition using syntactic and semantic clustering K. Arai K. J.H. Wright J. G. Riccardi G. A.L. Gorin A. Speech Communication 27 1 1999 43–62 A unified processing of orientation for internal and external localization Michel Aurnague M. Groupe Langue, Raisonnement, Calcul, Toulouse 1993 Quand dire c'est faire J.L. Austin J. Editions du seuil

Paris

1970 A non-CNF DIMACS style Fahiem Bacchus F. Tobby Walsh T. Technical report http://www.satcompetition.org/2005/edimacs.pdf Multilingual PSOLA Text-to-Speech system D. Bigorgne D. O. Boëffard O. B. Cherbonnel B. F. Emerard F. D. Larreur D. J. L. Le Saint-Milon J. L. I. Métayer I. C. Sorin C. S. White S. IEEE International Conference on Acoustics, Speech, and Signal Processing 2 1993 187–190 Apprentissage de structures d'arbres à partir d'exemples : application à la prosodie pour la synthèse de la parole L. Blin L. Ph. D. Thesis IRISA – Université de Rennes 1 December 2002 Génération de prosodie par apprentissage de structures arborescentes. Laurent Blin L. Actes de la Conférence d'Apprentissage, Laval, France July 2003 Le lexique de l'espace : les noms et les adjectifs de localisation interne A. Borillo A. Cahiers de grammaire 13 1988 1–22 Variable-length acoustic units inference for text-to-speech synthesis O. Boëffard O. proceedings of the Eurospeech Conference 2001 Contributions à la synthèse de la parole O. Boëffard O. Ph. D. Thesis Habilitation à diriger les recherches, IRISA – Université de Rennes 1 December 2004 Application-dependent prosodic models for text-to-speech synthesis and automatic design of learning database corpus using genetic algorithm O. Boëffard O. F. Emerard F. Eurospeech'97, Rhodes, Greece 5 September 1997 2507–2510 Modélisation d'un contexte cartographique et dialogique G. Breton G. ENSSAT Technical report DEA Informatique de Rennes 1 1998 Automatic Segmentation and Labeling of Speech based on Hidden Markov Models F. Brugnara F. D. Falavigna D. M. Omologo M. Speech Communication 12 1999 357–370 Intonation: Models and Theories E. Campione E. D. Hirst D. J. Véronis J. Stylisation and symbolic coding of F0 : comparison of five models Kluwer Academic Publishers 2000 185–208 Strategies for Oral Dialogue Control A. Cozannet A. J. Siroux J. Proceedings of International Conference on Spoken Language Processing (ICSLP) 94, Yokohama, Japon 2 1994 963–966 Spoken Dialogue with Computers Renato de Mori R. ISBN 0122090551 Academic Press 1998 Context-dependent probability adaptation in speech understanding E.W. Drenth E. B. Ruber B. Computer Speech and Language 11 3 1997 225–252 L'inférence grammaticale régulière : fondements théoriques et principaux algorithmes P. Dupont P. Laurent Miclet L. Technical report 3449 INRIA July 1998 http://www.inria.fr/rrrt/rr-3449.html Synthèse de la parole par concaténation d'unités acoustique : construction et exploitation d'une base de parole continue H. François H. Ph. D. Thesis IRISA – Université de Rennes 1 December 2002 Machine Learning Journal - Special Issue on grammatical inference 1-2 V. Honavar V. C. de la Higuera C. 44 2001 DORIS, a multiagent/IP platform for multimodal dialogue applications J. L'Hour J. O. Boëffard O. J. Siroux J. L. Miclet L. F. Charpentier F. T. Moudenc T. proceedings of the International Conference on Spoken Language Processing (ICSLP'04) 2004 Saussurian analogy: a theoretical account and its application Yves Lepage Y. Shin-Ichi Ando S.-I. Proceedings of COLING-96, København August 1996 717–722 http://www.slt.atr.co.jp/~lepage/ps/coling96.ps.gz De l'analogie rendant compte de la commutation en linguistique Yves Lepage Y. Habilitation à diriger les recherches Université Joseph Fourier

Grenoble

2003 Plan Recognition and Discourse Analysis : An Integrated Approach for Understanding Dialogues D. J. Litman D. J. Ph. D. Thesis University of Rochester, TR 170 1985 Expectation maximization algorithms for MAP estimation of jump Markov linear systems A. Logothetis A. V. Krishnamurthy V. IEEE transactions on Signal Processing 47 8 1999 2139–2156 Communicative Acts for Explanation Generation M. Maybury M. International Journal of Man-machine studies 37(2) 1990 135–172 Spoken Dialogue Technology : Enabling the Conversational User Interface M.F. McTear M. ACM Computing surveys 34 1 2002 90–169 Synthesizing elaborate intonation contours in text-to-speech for french P. Mertens P. Proccedings of the Speech Prosody Conference 2002 Relation d'analogie et distance sur un alphabet défini par des traits Laurent Miclet L. Arnaud Delhay A. Technical report 5244 INRIA July 2004 http://www.inria.fr/rrrt/rr-5244.html Stochactically-based semantic analysis W. Minker W. A. Waibel A. J.G. Mariani J. Kluwer Academic Publishers 1999 Anaphora resolution Ruslan Mitkov R. Lonman 2002 Automatic Adaptation of the Momel <hi rend="it">F</hi>0Stylisation Algorithm to New Corpora S. Mouline S. O. Boëffard O. P.C. Bagshaw P. proceedings of the International Conference on Spoken Language Processing (ICSLP'04) 2004 Acoustical and topological experiments for an HMM-based speech segmentation system S. Nefti S. O. Boëffard O. proceedings of the Eurospeech Conference, Aalbord, Denmark 2001 Segmentation automatique de parole en phones S. Nefti S. Ph. D. Thesis IRISA – Université de Rennes 1 December 2004 Reprise des échecs et erreurs dans le dialogue homme-machine P. Nerzic P. M. Guyomard M. J. Siroux J. Cahiers de linguistique sociale 21 1992 35–46 Using unlabeled data to improve text classification K.P. Nigam K. Ph.D. thesis Carnegie Mellon University

Pittsburgh, PA 15213

2001 Interacting with computers by voice: automatic speech recognition and synthesis D. O'Shaughnessy D. Proceedings of the IEEE 91 9 2003 1272–1305 Learning Subsequential Transducers for Pattern Recognition and Interpretation Tasks José Oncina J. Pedro García P. Enrique Vidal E. IEEE Transactions on Pattern Analysis and Machine Intelligence 15 1993 448-458 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition L.R. Rabiner L. proceedings of the IEEE 77 2 1989 257–286 Representing Sentence Structure in Hidden Markov Models for Information Extraction S. Ray S. M. Craven M. proceedings of the IJCAI conference 2 2001 1273–1279 An Empirically Based System for Processing Definite Descriptions Massimo Poesio Renata Vieira M. P. Computational Linguistics 26 4 2000 539-545 Combinatorial issues in TTS synthesis J. P. H. Van Santen J. P. H. V. proceedings of the Eurospeech Conference 1997 Sens et expression J.R. Searle J. Les éditions de minuit 1982 Multimodal References in Georal Tactile J. Siroux J. et al. Proceedings of the workshop Referring Phenomena in a multimedia Context and their Computational Treatment, SIGMEDIA and ACL/EACL, Madrid July 1997 39–44 Définitions et caractérisations de modèles à base d'analogies pour l'apprentissage automatique des langues naturelles Nicolas Stroppa N. Ph. D. Thesis Ecole Nationale Supérieure des Télécommunications November 2005 SUNDIAL, Prototype performance evaluation report Sundial Deliverable D3WP8 projet Sundial P2218 September 1993 The architecture of the Festival speech synthesis system P. Taylor P. A. Black A. Proceedings of the 3rd ESCA Workshop on Speech Synthesis 1998 L'espace en français Claude Vandeloise C. Éditions du seuil, Paris 1986 Learning Finaite-State Models for Machine Translation E. Vidal E. F. Casuberta F. Proceedings of the 7th International Colloquium on Grammatical Inference 2004 Sémantique des relations spatiales et inférences spatio-temporelles : une contribution à l'étude des structures formelles de l'espace en langage naturel L. Vieu L. Ph. D. Thesis Université Paul Sabatier, Toulouse 1991 Analogy-based NLP : Implementation Issues. F. Yvon F. Technical report Ecole Nationale Supérieure des Télécommunications 2003 A new version of the Nearest-Neighbour Approximating and Eliminating Search Algorithm (AESA) with linear preprocessing time and memory requirements María Luisa Micó M. L. José Oncina J. Enrique Vidal E. Pattern Recognition Letters 15 1 1992 9-17