3a cordial Man-machine oral and multimodal communication

Head of project (Responsable scientifique) Faculty member (professeur), Enssat Faculty members (Personnel Université Rennes 1) maître de conférences, Enssat, from September the 1rst, 2003 Teaching Assistant (ATER), Enssat, until August the 31st, 2003 maître de conférences, Enssat maître de conférences, iut professeur, Enssat maître de conférences, Enssat professeur, iut Ph. D. Students bourse INRIA, from October the 1st, 2003 bourse EGIDE bourse Région, from October the 15th, 2003 FTR&D, from January the 1st, 2003 Technical staff Project technical staff (ingénieur sur contrat Université-FT R&D), until September the 15th, 2003 Junior technical staff (Ingénieur associé INRIA) (Sans Titre)

The Cordial project explores different aspects of multimodal man-machine interfaces, with speech components. Its objectives are both theoretical and practical : on the one hand, no natural dialogue system can be designed without an understanding and a theory of the dialogic activity. On the other hand, the development and the test of real systems allow the evaluation of the models and the constitution of corpora.

The conception of a man-machine interface has to take into account the communication habits of the users, which have been developed within interpersonnal communication. This is particularly true for interfaces using speech, which is a medium quite performant and very spontaneous. Users have great difficulties to communicate through an oral dialogue with a machine having a speech interface of mediocre quality. The dialogue phenomena are complex , involving spontaneous speech understanding, strong use of pragmatics in the dialogue process, prosodic effects, etc.

Dialogue modeling

When multimodal dialogue is involved, the interference between speech phenomena and tactile actions or mouse clicks brings up problems of interpreting the coordination of the different actions of the user.

When a user makes a communication action towards the dialogue system, he certainly has an intention ; but often, this intention is not explicitely present in the communication. A major problem for the system is to extract it, in order to be able to give a satisfactory answer. This requires a theory dealing with the notions of intention, background knowledge, communication between agents, etc. We modelize the dialogue phenomena by using the concepts of speech acts and dialogue acts, and we consider that a sequence of exchanges can be analyzed as the result of planning. This model gives a satisfactory modeling of many phenomena in real dialogues, such as the coordination between different negotiation phases or the management of the user's knowledge base.

However, several points are not straightforwardly modeled in such a theory : parts of the dialogue do not carry any obvious intention or errors in understanding may mistake the planner, etc. Moreover, the extraction of the dialogue acts from the speech of the user is a complex problem, as is also the restitution of the dialogue acts of the system into synthetic speech.

Machine learning

In addition to the modeling of the core of dialogue phenomena, the Cordial project has also a particular interest in machine learning from corpora at different stages of a dialogue system. It covers the latter problem : the extraction of semantics from the outputs of a speech recognizer. It also tackles the problems of constructing the prosody of the machine synthetic speech or helping the dialogue engine to compute an answer. Machine learning is a field with many different facets, spanning from the inference of finite automata from symbolic sequences data to the computation of parameters in stochastic processes. Our research in this field makes use of quite different techniques, reflecting the variety of the data and the models met at the different stages of a dialogue system with an oral component.

Speech synthesis.

The front-end part of an oral dialogue system consists in a text generator producing a sequence of words, corresponding to the message to be emitted . This part of text is then converted into an oral message through a text to speech synthesis system. The text to speech technology has still to increase its quality, especially in a dialogue environment, in order to produce a speech as natural as possible. This can be made partly in producing a good prosody, but also in working on the quality of the acoustic signal. In a dialogue system, the speech synthesizer can be given extra information on the semantics and the pragmatics of the situation and therefore produce a speech with special effects: delivering information, stressing on a detail, insisting on a misunderstanding, repeating an information, etc. This can influence the way the message has to be delivered, especially concerning its prosody. A last interesting problem is to diversify the synthetic voices, without having to record and index a new corpus. Acoustic voice transformation techniques can be used, but changing a voice into another requires also a modification of the segmental and prosodic characteristics.

Introduction

Our activities are distributed into four complementary domains. The first one is concerned with both coding and structure of interaction. It also deals with the applications. The second one deals with multimodality and system prototyping (architecture and evaluation). The third one is concerned with machine learning techniques and their application to dialogue phenomena and speech technologies. The last area deals with speech synthesis adapted to dialogue.

Dialogue and modeling Speech Acts planning plan recognition Glossaryspeech act:

In the Speech Act Theory initiated by Austin and developed by Searle , the main axiom claims that the emission of a utterance can be assimilated to the performance of actions which modify the mental states of the participants.

plan :

sequence of actions which aims at the realization of an intentional goal.

plan recognition:

Recognizing a plan given a sequence of observed actions consists in identifying underlying relations between these actions in order to guess goals and possible continuations of the current plan.

Our project is using a family of dialogue models based on speech acts plans. This modeling takes into account the general framework of communication and makes easier the implementation on computer. But it does not solve some problems as extracting speech acts from utterances as well as the integration of different information sources and miscommunication between participants.

Man-Machine interaction can be seen as a sequence of particular actions: speech acts called in our context dialogue acts which support both the function of the act in the dialogue (for example: requesting, querying, ...) and a propositional content (for example: the theme of the query). These acts can also be characterized by their conditions of use which are concerned with the mental states of the participants (intention, knowledge, belief). The most accurate computerized model is the planning operator in which preconditions and constraints as well as effects of an act can be represented. For example, the act to ask for somebody to perform one action can be modeled as follows:

Request(Speaker, Hearer, = Action(A)) precondition-intention: > Want(Speaker, Request(Speaker, Hearer, Action(A))) precondition preparatory: > Want(Speaker, Action(A)) Body: > Mutual Belief(Hearer, Speaker, Want(Speaker, Action(A))) effect: > Want(Hearer, Action(A))

This can be interpreted as: when an agent wants that its listener performs an action A, it can use the action labeled Request whose goal is to build up a consensus between participants in order to perform A. Realizing this consensus is the task of another action which is not described here. The set of actions which are necessary for reaching a goal is named a plan. This approach makes the hypothesis that each dialogue partner participates in the realization of the other's plan. This dialogue act modeling allows to consider several types of automatic reasoning in order to manage the dialogue. The first one is concerned with the contextual understanding of user's utterances by means of a mechanism so-called plan recognition. It aims at rebuilding a part of the other participant's plan; if this part is correctly identified, it allows to give an account of the explicit motivations and believes of the other participant. A second process aims at computing a relevant response by means of a planning mechanism which is able, because of the nature of the modeling itself, to take into account the known information and the possible misunderstandings. This type of modeling makes easier the implementation in some simple situations but does not deal with some important problems in various fields.

Dialog act extraction

The first problem is to translate the sentence uttered by the user into a dialogue act. This process is not a simple transcoding problem. It is necessary to take into account altogether a large collection of knowledge (mental states, presuppositions, prosody, ...) as well as some indices present in the sentence (syntactic structure, lexical items, ...). In addition, the surface form of speech sentences contents a lot of irregularities (problems of performance) which complicates the speech recognition task as well as the understanding and interpretation tasks.

System Modeling

The second problem takes place in the use of the planning formalism in order to associate three points of view: the one of the application, the one of the main dialogue (which is concerned with user's intentions towards the application) and the one of the dialogue management (meta dialogue and phatic dialogue). Some partial solutions have been found but they are not well adapted to data management applications (querying data base) or applications which allow several parallel tasks as well as to process certain functions for communication management. A possible approach to deal with this problem could be a multi-agent modeling. Indeed, this conceptual framework allows to combine a priori exclusive models and dialogue contexts in order to increase the number of dialogue problems dealt with. So, the problem is partly moved from dialogue modeling towards integration modeling.

Communication errors

The third problem arises frequently in interaction: it concerns bad communication. Each of the two participants (i.e. human and system) can indeed have some erroneous knowledge about the application, about other's abilities and about current elements dialogue as references used to point out objects during the interaction. One error which concerns this information, may in the long or short run leads to a failure, i.e. to an impossibility for the system to satisfy the user. Detecting and dealing with these errors basically need a characterization process and a plan based modeling.

Application modeling

In an interactive system, the application has to behave as an active component. In the current systems, the application modeling affords two types of main defaults. The task model may be too rigid (for example: plans in the systems for transmitting information) constraining too heavily the user's initiative. The task model may also be based on constraints (as in CAD application), allowing in this way a user's activity more free but causing a lack of co-operation for helping the user to reach its goal. We believe that the task model has to include the following elements: data and their ontology, knowledge about the use of data (operating modes) and the interface with the rest of the system. Lastly, the modeling has to be designed in order to make easier the changing of the task.

System and multimodality multimodality reference

We are studying an additional modality, a tactile screen, in order to get round some of the problems coming from speech use. The problems to deal with due to this new modality are concerned with integrating messages coming from the different channels, processing of references as well as evaluating systems.

The use of speech technologies in interactive systems raises problems and difficulties spanning from the design of complete softwares (including the research of the task) to the architecture design, including a particularly good quality speech synthesis and the introduction of a new modality.

Human communication is seldom monomodal: gesture and speech are often used jointly because of functional motivations (designing elements, communication reliability). In a speech environment, introducing an additional modality -in our case, gesture by means of a tactile screen- allows to overcome some speech recognition errors.

But it raises also new difficulties. The first one is concerned with the ways by which information coming from the various communication channels: at which level, syntactic, semantic, pragmatic one, has the integration to be done? What kind of modeling has to be used? In the literature, few satisfactory responses can be found. We chose to lean on Maybury's works , performed in a different context (the generation of communicative acts for the system ouput). Maybury proposes several levels of communicative acts (type of speech act) which allow to integrate at each level information coming from different modalities. We take up this principle (which is fully coherent with our dialogue modeling) but we use it for recognizing the act: the tactile and speech modalities are processed separately as communicative acts which are merged in speech acts.

The second difficulty is the processing of references, particularly in the framework of the chosen application (querying a geographical and tourist database). Pointing out the interesting objects during the dialogue is done both by means of speech sentence and gesture (pointing out, drawing a zone) and takes into account the application context (the user can follow the outline of a cartographic object with her finger).

Studies in this domain are in the linguistic field and in the artificial intelligence field. Some linguists propose very precise studies about the condition of use of prepositions (functional approach) in the designation of objects. We think that these results are interesting and we have adapted them for our parsing of sentences. In the artificial intelligence field, several modeling of spatial relations have been proposed. We use the one proposed by IRIT (Toulouse) in order to check the semantic coherency of referential expressions in the framework of our application. This modeling is based on certain characteristics (dimension, morphology, ...) of elements which govern the use of linguistic items in the expressions.

The ambition to put dialogue systems on the market needs to comply with requirement about the quality of interaction. It is necessary to be able to evaluate and compare different systems using different points of view (speech recognition rate, dialogue efficiency, language and dialogue abilities,...) in the framework of equivalent applications, and eventually for the same system, to evaluate different approaches. Various metrics have been yet proposed (examples : length of dialogue, number of speech turn for recovering speech recognition errors), but they do not take into account all the dimensions of an interactive system. Some new solutions are currently under consideration (for example in the CLIPS labs in Grenoble): they are based on pragmatics issues such relevance, or based on the concept of system self evaluation which consists in doing process by the system, or by one part of it, pieces of dialogue which present some difficulties, giving it all necessary contextual information.

Machine learning in dialogue systems machine learning grammatical inference Kalman filter hidden Markov model speech data bases

This research theme focuses on the elaboration of machine learning methodologies in all the stages of a dialogue system.

Machine Learning can be seen as the branch of Artificial Intelligence concerned with the development of programs able to increase their performances with their experience. It is basically concerned with the problem of induction or generalization, which is to extract a concept or a process from examples of its output. From an engineering point of view, a Machine Learning algorithm is often the search for the best element $h^{*}$ in a family $ℋ$ of functions, of statistical parameters or of algorithms. Such a choice is done in optimizing a continuous or a discrete function on a set of learning examples. The element $h^{*}$ must capture the properties of this learning set and generalize its properties.

Machine Learning is a very active field, gathering a variety of different techniques. Grossly speaking, two families of techniques can be distinguished. On the one hand, some Machine Learning algorithms use learning sets of symbolic data and discover a concept $h^{*}$ which is also symbolic. For example, Grammatical Inference learns finite automata from set of sentences. On the other hand, other Machine Learning algorithms extract numerical concepts from numerical data. Neural networks, Support Vectors Machines, Hidden Markov Models are methods of the second kind. Some methods can work in examples with both numerical and symbolic features, as Decision Trees do. Some concepts that are learned may have both a structure and a set of real values to optimize, as Bayesian Networks or stochastic automata, for example.

The Cordial project is concerned with the introduction of Machine Learning techniques at every stage of a dialogue process. This implies that we want to learn concepts which basically produce time ordered sequences. That is why we are interested in learning from sequences, either in a symbolic background or in a statistical one.

Grammatical inference.

In the frontal part on an oral dialogue system, the incoming speech is processed by a recognition device, generally producing a lattice of word hypotheses, i.e. the lexical possibilities between two instants in the sentence. Then a syntax has to be used, to help producing a sequence of words with the best conjoint lexical and syntactic likelihood.

The syntactic analysis can be realized either through a formal model, given a priori by the designer of the system, or through a statistical model, the simplest being based on the counting of how grammatical classes follow each other in a learning corpus (bigram model).

Both types of models are of interest in Machine Learning : grammatical inference is basically the theory and the algorithmics of extracting formal grammars from samples of sentences; the discovery of a statistical model from a corpus is an important problem in natural language processing. It is interesting to combine both approaches in extracting from the learning corpus a stochastic finite automaton as the language model. It has the advantages of a probabilistic model, but can also exhibit long distances dependencies reflecting a real structure in the sentences.

We have worked on grammatical inference in the recent years, especially within a contract with FTR&D between 1998 and 2001. The field is always very active in the Machine Learning community. Many progresses in grammatical inference have recently be done in the framework of Language and Speech processing ().

Nearest Neighbors learning of tree structures

Any sentence is both a sequence of words and a hierarchical organization of this sequence. The second aspect is particularly important to analyze if one wants to understand syntactic and prosodic aspects in oral speech. Producing synthetic speech in oral dialogue requires a good quality prosody generator, since much information is carried through that channel. Usually, the prosody in synthetic speech is made by rules which use syllabic, lexical, syntactic and pragmatic information to compute the pitch and the duration of every syllabe of the synthetic sentence.

An alternative issue is to consider a corpus of natural sentences and to use some machine learning algorithms. More precisely, any sentence in this learning set must be described both in terms of relevant information with regards to its prosody (syllabic, lexical, etc.) and in terms of its prosody. The machine learning task is to produce explicit or hidden rules to associate the description with the prosody.

At the end of the learning procedure, a prosody can be associated to any sentence described in the same representation.

The learning methods used in the bibliography make use of neural networks or decision trees, ignoring the hierarchical nature of the organization of the syntax and the prosody, which are also known to have strong links. This is why we have represented a sentence by a tree and made use of a corpus-based learning method. In a first step, we have used the nearest-neighbour rule.

Given a learning sample of couples of trees (sentences) and labels (prosody), $𝒮 = {(t_{i}, p_{i})}$ and a tree x, the nearest-neighbour rule finds in $𝒮$ the tree $t_{a}$ which is the closest to x and adapts to x a prosody $p_{x}$ directly deduced from $p_{a}$ .

This raises two problems: firstly to find a good description of a sentence as a tree, secondly to define a distance between trees. We have worked on these questions during the last years .

Learning by analogy in sequences and trees structures

In the context of speech synthesis, we would like to use now a more sophisticated lazy learning method: learning by analogy. Its principle is as follows: knowing a sentence x to synthesize, look for a triplet of sentences (b, c , d) in $𝒮$ such that x is to b as c is to d.

Actually, we do not yet study learning by analogy directly on trees, but on sequences. The reason is that we use a distance between the trees and the sequences (the edit distance) which is much easier to manage on the universe of sequences.

We have firstly worked on defining what is solving an analogical equation on sequences when the edit distance is introduced. In general, an analogical equation can be described as follows: find x from a triple a, b and c such that a is to b as c is to x and is often written by

a : b : : c : x

Solving analogical equations

The idea is to generalize the studies of Lepage and of Yvon for whom the edit distance is a trivial case. The classical definition of $a : b : : c : d$ as an analogical equation requires the satisfaction of two axioms, expressed as equivalences of this primitive equation with two others equations :

Symmetry of the 'as' relation: $c : d : : a : b$ Exchange of the means: $a : c : : b : d$

As a consequence of these two primitive axioms, five other equations are easy to prove equivalent to $a : b : : c : d$ .

Another possible axiom (determinism) requires that one of the following trivial equations has a unique solution (the other being a consequence):

$a : a : : b : x \Rightarrow x = b$ $a : b : : a : x \Rightarrow x = b$

We can give now a definition of a solution to an analogical equation which takes into account the axioms of analogy : x is a correct solution to the analogical equation $a : b : : c : x$ if x is a solution to this equation and is also a solution to the two others equations: $c : x : : a : b$ and $a : c : : b : x$ .

Solving analogical equations on sequences

Solving analogical equations between sequences has only drawn little attention in the past. Most relevant to our discussion are the works of Yves Lepage, presented in full details in and the very recent work of Yvon.

Lepage details the implementation of an analogy solver for sequences, based on a generalization of the algorithm computing the longest common subsequence between strings. Given the equation $a : b : : c : x$ , this algorithm proceeds as follows: it alternatively scans b and c for symbols in a; when one is found, say in c, it outputs in x the prefix of c which did not match, while extending as much as possible the current match; then exchange the roles of b and c. If a has been entirely matched, then output if necessary in x the remaining suffixes of b and c; otherwise the analogical equation does not have any solution.

Yvon considers that comparing sequences for solving analogical equations must be based only on insertions and deletions of letters, and must satisfy the axioms of Lepage. His work is based on an algebraic approach and an operator, shuffle. The shuffle of two words u and v contains all the words w which can be composed using all the symbols in u and v, with the constraint that if a precedes b in u (or in v), then it necessary precedes b in w.

Yvon constructs a finite-state transducer to compute the set of solutions to any analogical equation on strings. He produces also a refinement of this finite state machine able to rank these analogies so as to recover an intuitive preference towards simple analogies. The "simplicity" of an analogy is related to the number of entanglements produced by the shuffle operator. This corresponds to the intuition that good analogies should preserve large chunks of the original objects; the ideal analogy involving identical objects.

Both these studies are strongly based on the property of distribution in analogy that is taken as an axiom by Lepage and Yvon: In an analogy $a : b : : c : d$ every property of a is in b or c. In a set of symbols, that is equivalent to say that every symbol of a is in the union of the sets of the symbols of b and c. This assumption implies that substitution is not in the set of edit operations. We chose to use substitution and as a consequence, we did not assume that distribution property is true for analogy. That is where our approach generalizes the studies of Yvon and Lepage.

Introducing the edit distance

Firstly, we have to define what means "is to". If we keep the line of the previous work, "a is to b" can be described by the way the two sequences or trees, a and b match together in the algorithm which computes their edit distance. This is known as the edit trace between a and b. This edit distance is also used in the nearest neighbour learning technique..

Secondly, "as" requires to compare two traces, which are themselves sequences (or more simply, "as" can be the equality). One problem here is to define precisely what type of edit operation between we will take into account : are there only deletion, insertion or substitution between subtrees or letters, whatever their place, or do we have to precise that a substitution involves two verbal groups for instance ? When a set of edit operations is given, a second problem is to define a distance on this set. If a distance based on edit operation is defined on traces, we can say that there is an analogy when this distance is null. We still question ourselves on an approximative analogy when this distance is not null. For instance, take exact is to inexacte approximately like finitude is to infinité. The smaller is the distance between traces, the better is the approximation.

Aims of this study

As a conclusion, we aim at giving a sound definition of analogy in sequences as a first step, then in prosodic tree structures in a second step. With this definition of analogy, we will implement an algorithm for solving analogical equation. Then, in the learning by analogy problem, the adaptation of fast NN-algorithms, such as AESA , is necessary. AESA is interesting as it gives a nearest neighbour in constant time on average, with the cost of a pre-computation that is linear in time and space.

Learning speech units for speech synthesis

Text-to-speech synthesis(TTS) can be carried out by concatenation of acoustic units obtained from a continuous speech database. The state-of-the-art TTS systems consist in juxtaposing pre-recorded acoustic units, typically phones, diphones ou non-uniform longer units.

An alternative to the production of speech from a dictionnary of diphones consists in using a indexed corpus of continuous speech . When one has to produce a sequence of phonemes, the idea is to get in the corpus the best acoustic sequence. It is selected according to several criteria : of course its phonetic correspondance, its length, position in the sentence, etc. The relative importance of these criteria can be tuned by learning.

The multiple representation of these configurations at the acoustic and phonological levels enables voice quality to be improved significantly . Furthermore, one can consider that the acoustic segments used to build an acoustic utterance no longer have predefined linguistic definition. We consider here that the phonological units are not defined a priori over a finite set of phonemes. Therefore, we face a combinatorial issue where linguistic units , that we no longer know, are useful to parse a phoneme sequence in order to find the best acoustic segments.

Our methodological research framework try to answer the following points :

From an acoustic quality point of view, given a continuous speech database with phone labels, how can we characterize the best sequence of linguistic units

From an algorithmic point of view, in a graph search framework, what are the best heuristics to solve this combinatorial problem.

From a pragmatic point of view, given a target application, what could be the best set of pre-recorded speech sentences yielding the best TTS quality.

Automatic Speech Labelling and Recognition

Machine learning methodologies based on statistical approaches require databases of relatively consequent size. The examples taken from these databases show the relationship between the numerous variables involved in the studied phenomenon. As well in synthesis as in voice recognition, one wishes to be able to have an explicit relation between the acoustic level and the phonological level. If an automatic labelling starting from phonetic sequences is a task which finds acceptable solutions, the process is more complex when only the text is known.

In such a context, given a speech utterance realized by a speaker and its particular phonetic transcription, the precise location of temporal marks delimiting phone boundaries on the speech is required. The state-of-the-art systems use a markovian description of the speech in an appropriate acoustic space .

Sequences of Hidden Markov Models(HMM) are built from the phonetic description of the acoustic observations. As one needs to discriminate phone boundaries, the majority of the phone segmentation systems postulates a monophone modeling hypothesis. During a learning phase, the parameters of each phone model are learned through a set of exemples using the well known EM iteration scheme . During a decoding phase, the segmentation system finds the most probable alignment between the sequence of models and the observations. The temporal marks delimiting each phone are easily recovered using the model transitions on the optimal path alignment.

We propose to weaken our previous hypothesis by relaxing the exact phonetic transcription with the exact phonemic sequence. The phonemic sequence is built automatically from the text. Under the same phonemic symbol, various acoustic realizations can be found depending on the coarticulation context of the realized phone.

Learning semantics from speech

Currently, many automatic systems delivering information suppose that the user of such a service must be able to adapt himself to the implicit requirements of the automatic system.

We postulate that a man-machine interface being based on the natural language must facilitate the access of the greatest number of us with this type of services . Under this assumption, the machine must make the maximal effort to adapt itself to the user.

There already exists, of course, many information systems which technology is based on a man-machine oral dialogue. In a first stage, the speech, the entry of such a system, is translated into a sequence of words. This sequence of words will be then treated by a pragmatic entity taking into account a dialogue model. In return, the machine response is stated by a speech synthesis system starting from the text or a concept modeling.

Within this experimental framework, a problem which still today does not find satisfactory scientific and technological answers is that of the semantic treatment.

A semantic function in a context of natural speech processing has a double objective. Firstly, a pragmatic treatment carried out on a sequence of concepts is more relevant than carried out directly on a sequence of words – words are sensitive to the errors of the recognition system. Secondly, a system which would be able to understand the message can propose different alternatives, what cannot do the current speech synthesis systems starting from the text.

The proposed research framework is settled between the output of an automatic speech recognition system and the input of a dialogue management system. From the description of a statement recognized by the ASR system and translated into a lattice of words, the goal consists in providing the sequence of the underlying concepts. We propose to explore the temporal dimension of the sequence of the concepts in an oral statement starting from its relationship to syntax and more particulary the discovery of the set of thematic roles .

We will adopt a methodology based on the observation of corpus of examples within a statistical theoretical framework. However, it is difficult to find corpora annotated by semantic elements (particularly in French). The phenomenon will be described by random variables partially observed , . Then, an objective consists in determining the optimal quantity of annotated information required for training and mixing them with unlabelled corpora under an assumption of unsupervised training .

Learning prosody from speech

On one hand, speech processing state of the art can efficiently model acoustic voice characteristics starting from a voice print. On the other hand, few studies are interested in the suprasegmental aspect of the speech, more precisely on the automatic modeling of melody contours . One could find multiple objectives, setting up automatic voice transformation systems, merging prosodic and acoustic information in a ASR system, tuning text-to-speech synthesis systems with ad hoc prosodic models.

We propose a model making it possible to describe a melody contour at the sentence level built over a sequence of elementary melody contours. The difficulty of modeling is that one does not a priori know an alphabet made of classes of elementary contours. They thus should be estimated starting from the observations of the sentence level all while being based on assumptions of parsimony .

A melody contour is a mono-dimensional signal with real values evolving according to time. We propose to take for methodological assumption the class of dynamical state space models . We consider that the observation of a portion of the melody is explained by a stochastic state variable defining the equation of a standard Kalman filter under gaussian and linearities assumptions. We suppose that a portion of the melody curve followed by a Kalman filter corresponds to a class. The complete observation of the sentence level is governed by a time switching of Kalman filters. This switching process is modeled by an hidden Markov chain.

Learning to improve the dialogue management

We modelize the dialogue phenomena by using the concepts of speech acts and dialogue acts, and we consider that a sequence of exchanges can be analyzed as the result of planning. Machine learning can also be used to increase the efficiency of the planner. A well-known topic in Artificial Intelligence is the use of experience to increase the efficiency of inference engines, planners, generally speaking every kind of reasoning system. Often used is the framework of Case-Based Reasoning, which uses corpus of previous experience to discover "shortcuts" or memorize often used pieces of elaborated information. Another possibility is to use statistics on the sequencing of actions for making decisions informed by experience.

A thesis has been proposed this year with the following topic : the adaptation of the actions of an agent in a communication situation. The main issue is to give the agent a capacity of analysis on the ongoing dialogue, in order to adapt dynamically its strategy if necessary. To achieve this goal, statistical machine learning techniques will be used on dialogue corpora.

This work is part of the CRCContrat de Recherche Coopérative, Cooperative Research Contract "Machine learning in man-machine interaction" between the Cordial project and France Télécom Recherche et Développement, DIH/DII. This contract is described in section .

Language learning Educational software teaching and learning languages

The aim of this study is to design and to develop educational software for helping to teach and to learn languages.

The use of Ordictée is concerned with the primary class exercise called dictation. In this application, a speech synthesiser reads French text while the pupil writes the orthographic transcription on his keyboard. The reading speed is at any moment tailored to the speed of the typing. The pupil can correct the text at any time. This application is based on the design and the development of specific tools such as the alignment of the text provided by the teacher and the pupil text.

(Sans Titre)

The application domains for our researches are all the situations where man-machine communication requires speech or where the use of speech brings more comfort. These applications are in general complex enough to require a real dialogue situation, and would be tedious if used through a simple sequence of guided short answers.

Examples for these applications are : information services on a personal computer or on a public, booking services by telephone, computer assisted language learning.

Introduction

We develop our applications on the CNRT platform DORIS, to promote joints projects with industrial research. The Georal system is a demonstrator of touristic information services, with oral dialogue and tactile screen. We also have a "dictation" software called Ordictée, which has been experimented in primary schools. Finally, a grammatical inference library, Epigram, has been developed.

DORIS platform correspondant

The Cordial project aims to promote its research activities by means of technological demonstrations. To achieve this point, hardware and software ressources have been defined to build a R&D platform named DORIS and dedicated to man-machine interaction, in particular with the use vocal and dialogue technologies. The main funding comes from IRISA/INRIA, the Regional Council of Brittany and Cordial public contrat funding. DORIS, in the context of the CNRT-TIM Bretagne, has vocation to promote joint projects between institutional and industrial research.

DORIS is concerned by the different research projets like Georal(see sections and ), Semantic Parrot(see section ), and Ordictée(see section ). An INRIA research engineer manages the technical aspects of the platform and develops new softwares for the previously quoted projets.

Hardware architecture

On the powerhouse systems side, a Compaq AlphaServer system has been chosen to support our calculation power needs, especially for speech processing. In addition, the platform includes a Network Appliance file server with a storage capacity up to 350 Go.

In order to facilitate technical access for industrial partnership, the platform includes fast secure network access. DORIS inherits from the ENSSAT-Université de Rennes 1 network. We propose high internet connection with VPN access.

On the client side, PCs with an up-to-date sound configuration are used. These computers are meant for software development within DORIS. They are nowadays used by engineers, Ph. D. and postgraduate students involved in the CORDIAL project. Touch screens have been bought in order to facilitate the development of multimodal man-machine interfaces.

This client-server configuration is fully functional inside the ENSSAT campus. Further improvements will be focused on lightweight clients and resources sharing with external partners (see section ).

Software architecture

The DORIS platform main goal is to group research projects that deal with the man-machine interaction field. In this entity, they shall take advantage of other teams works and tools.

We first direct our efforts towards the installation of a multi-agentAn agent is an independent and autonomous process that has an identity, possibly persistent, and that requires communication with other agents in order to fulfill its tasks. architecture. It satisfies our needs for modularity, quick and clean development and interoperability. To fulfil this role, we chose and installed JADEJava Agent Development Framework, a free software distributed by Telecom Italia Lab (TILAB)., a software framework fully implemented in Java language. It allows the implementation of multi-agent systems through a middle-ware that complies with the FIPAFoundation for Intelligent Physical Agents, which purpose is the promotion of emerging agent-based applications, services and equipment. This goal is pursued by making available internationally agreed specifications that maximize interoperability across agent-based applications, services and equipment. specifications. The agent platform can be distributed across machines, which do not need to share the same OS.

We made this choice to simplify the development while ensuring standard compliance. Furthermore, the Java technology allows us to use already developed libraries that are not necessarily in our sphere of competences (e.g. sound or speech coding, framing, streaming) and therefore to concentrate on our scientific interests.

Today, two projects have taken place inside the DORIS platform: Georal (see sections and ) and the Semantic Parrot (see section ).

Next steps with DORIS

In order to share our resources and to have a efficient cooperative work with extra-ENSSAT partners, we would like to allow those involved (e.g. other academic researchers or industrial partners) to connect to the DORIS platform network.

VLAN and VPN solutions are at the present time studied. The main point is to be careful of security issues.

At another level, one of the guiding lines in DORIS developments is to reach a complete independence between the server and the clients, in a technological sense. A solution is to allow lightweight clients (e.g. PDAs and cellphones) to communicate with the DORIS server as fluently as a web application running on a PC.

Georal correspondant

Georal Tactile is a multimodal system which is able to provide information of a touristic nature to naive users. Users can ask for information about the location of places of interest (city, beach, chateau, church,...) within a region or a subregion, or distance and itinerary between two localities. Users interact with the system using three different modalities: in a visual mode by looking to the map displayed on the screen, in an oral and natural language mode (thanks to a speech recognition system) and in a gesture mode pointing to or drawing on a touch screen. The system itself uses both the oral channel (text-to-speech synthesis) and graphics such as the flashing of sites, routes and zooming in on subsections of the map, so as to best inform the user.

The Georal project started in 1989 and is the origin of various works since then. It was fully developped in Visual Prolog 4.0. We decided to re-implement Georal making the most of the capabilities of the DORIS platform.

The foundation stone of this re-implementation was to split the initial Prolog modules (syntactic and semantic analysis, dialogue management and tactile screen management) taking into account the multi-agent paradigm (one module for one functionality). We assigned one agent for each specific role, agents that are written in Java. However, we kept core functions written in Prolog, in order to take advantage of the fact that this language is really convenient for tasks like natural language processing. But all peripheral functions from screen management to client-server communcation have been rewritten in Java and C/C++ languages.

The call of Prolog predicates from agents written in Java was not straightforward. After a benchmarking phase, we decided to use a Java package that allows such calls (tuProlog). This implied studying the existing Prolog files to extract useful predicates and to correct them to bring the code closer to the ISO Prolog. Furthermore, the work on the Prolog code allowed an improvement of the Georal engine capacities. A larger range of queries are now accepted by the system and some bugs have been fixed. Improvement have also been made on the gesture management side. The touch screens prove to be useful to process new kind of drawings like windings follow-up.

A text-to-speech server has been installed and a dedicated agent communicate with it. The processing time and the sound quality are very good, but we are using a local network for the moment. We have in mind to insert the Internet between the clients (possibly wireless devices) and the server. An important work on data coding and communication protocol has to be made beforehand.

The speech recognition server integration is in progress. Here again the various kind of hardware and software used will request open-mindedness for the next developments.

To learn more about Georal system, see section .

Ordictée correspondant

The use of Ordictée is concerned with the primary class exercise called dictation. In this application, a speech synthesizer reads French text while the pupil writes the orthographic transcription on his keyboard. The reading speed is at any moment tailored to the speed of the typing. The pupil can correct the text at any time. This application is based on the design and the development of specific tools such as the alignment of the text provided by the teacher and the pupil text.

Ordictée is a software that allows a pupil to perform a dictation exercise on his one. It is made up of three modules: The pupil module, which, together with the pupil itself, carries out the dictation exercise, the teacher module, which allows the teacher to design his own dictation texts, and the administrator module which is devoted to set the application parameters. One of the main function of Ordictée is to follow the typing, i.e. to adapt the reading rhythm to the typing speed. This function is based on the one hand on the hypotheses that mistakes do not affect the pronunciation, and on the other hand on the phonetic closeness of the two texts (the pupil text and the teacher one).

Semantic Parrot

>From a usability point of view, the semantic parrot that we propose consists in taking a speech message from a standard audio input (Personal computer, PDA, Cell phone), to understand the underlying concepts and finally to generate a paraphrase using a speech ouput.

>From a technological point of view, the semantic parrot implements techniques of speech recognition, automatic speech understanding, and finally of concept to speech synthesis.

Currently, a first demonstrator is built on the DORIS platform (see section ) implementing a technology of speech recognition provided by TELISMA and a technology of speech synthesis provided by FTR&D.

Epigram correspondant

The software library called Epigram (Environnement de Programmation pour l'Inférence GRAMmaticale), has been developed between 1997 and 2001. Epigram is a library of high level modules enabling the development of grammatical inference programs and applications. It has been written around a C++ environment called LEDA, a library of data types and combinatoric algorithms developed at the Max-Planck-Institut für Informatik, Saarbrücken, Germany. Epigram has been written together with the team EURISE of the University of Saint-Étienne and the former IRISA project Aïda (F. Coste, now in the Symbiose project), as a part of a contract with France Telecom FT R&D (CTI 97 1B 004).

Dialogue et modeling Logical requirements for dialogue processing

Dialogue systems have to access a world description which is relevant to the interacting user. In order to avoid the system to start an infinite loop on a query it can not answer, a finite first-order dialogue logic has been devised. Besides its dialogue-oriented characteristics, this logic permits to envision the satisfiability detection procedure as spatial rewritings based on the axiomatization of inner data structures. As various axiom sets are conceivable, a solver has been prototyped from which the rewritings may be animated on a graphical interface.

This study has been completed this year and has permitted to cull the most effective axioms but also to forge new ones, which were not initially conceived. The next move is now to assess the performance of this spatial rewriting approach. On this purpose, we are now developing a solver in which seems the most competitive context: the propositional calculus where efficiency has been a constant issue over the past decades.

Dialogue systems evaluation

This activity was developed in the framework of the AUPEL-AUF agency and with a collaboration with several laboratories. The grant had to be 4 years long. But the agency ended the funding after only two years because of internal problems and despite the interesting results obtained. Then, we stop temporary studies on this topic but maintaining a scientific review (bibliography, workshop during the ACL congress in Toulouse in August 2001). The design of the new Georal system will allow to work on this topic by recording real corpora and by testing different methods and algorithms.

System and multimodality

A study about referering phenomena in an enlarged version of Georal had been led. We also continued activities to improve the ORDICTEE software (dealing with faults coming from phonetic, following typing).

Georal Tactile and reference

Recent progresses in speech recognition allow to plan new important developments inside the dialogue system Georal Tactile . Increasing the vocabulary size gives the users the possibility to utter more complex linguistic sentences. We use this fact to enrich the application world with new elements on the map which is the support for querying. In this new framework, several issues are studied: modeling the cartographic context, linguistic and gestural of users referencing elements on the map, and at last the architecture of the system.

In a first time we led an experiment in order to determine the linguistic behaviour of the users when they reference elements on the map. A large number of linguistic forms as well as using built up elements (for example referencing a triangle using particular points) have been observed. A new type of gesture (following a line) has been also observed .

We proposed a syntactic model in order to parse and filter referential expressions in the user utterances. This model is based on Vandeloise and Borillo's works which take into consideration the spatial characteristics of the handled elements. Next we developed a semantic model which allows to filter more precisely the output of the syntactic parser. The model is derived from the Aurnague's one which uses specific attributes of the elements (for example size, consistency, position, ...). We only use three attributes (dimension, consistency and form) but we combine them in order to take into account the possible syntactic forms.

As far as the cartography is concerned, we developed a new data model and search algorithms that are better adapted to handled elements.

At last, we redesigned the architecture of the system and the processing flow in order to deal with various facts: more complex gestures, the references on objects which are not stored in the database and the two stages processing. At the opposite of what it occurs in the current version, we give priority to gesture activity over speech activity; this principle allows to progressively check and possibly correct the referential linguistic expressions, to determine referents on the map and to build up, if necessary, new elements in the database. Some of these algorithms have been implemented and we are integrating them in the system.

We began studies firstly in order to model in uniform way the different semantic points of view (natural language, graphics) from the Pineda and Garza's work [Pin00], secondly to bring together the processing on references in Georal and the plan-based modeling of dialogue. We are also studying the use of the concept of salience taking into account the results from LORIA labs.

Ordictée

A new algorithm for aligning the teacher text and the pupil text has been designed. Its implementation and evaluation are under way. Provided promising results, a patent will be taken out.

Machine learning in dialogue systems Grammatical inference

A technology review has been done this year.

Lazy learning of tree structures

This research topic has no new results in 2003, since its main contributor has completed his thesis had its defense on December 2002. A communication on the last results has been given at the congress CAp, Laval, June 2003. The topic is described at section . In 2003, the reflexion has been conducted in two directions : firstly, we have maintained the discussion with France Telecom Recherche et Developpement in the goal of setting up a new collaboration on this topic. This could be finalised next year with another CRC (see section ). Secondly, we have started more fondamental studies on learning by analogy on sequences and trees, to generalize the "nearest-neighbour" technique used in the previous approach. This is described in the next section.

Learning by analogy on sequences and tree structures

The thesis of L. Blin has studied how to learn the prosody of a sentence by using a distance between trees (the sentences are represented by trees) and the nearest neighbour technique. It has been concluded at the end of year 2002, and given last results in 2003.

We examine how the nearest neighbour method could be extended to learning by analogy.

Before defining what is learning by analogy, let us define what is solving an analogy on trees. It can be described as follows : find x from a triple a, b and c such that x is to c as b is to a and is often written by

a : b : : c : x

Firstly, we have to define what means "is to". If we keep the line of the previous work, "a is to b" can be described by the way the two trees, a and b match together in the algorithm which computes their edit distance. This is known as the edit trace between a and b. Secondly, "as" requires to compare two traces, which are themselves sequences (or more simply, "as" can be the equality).

We do not presently work on trees, but on sequences, since the edit distance on sequences is well known and easier to implement. We have mostly worked on defining what is solving an analogical equation on sequences when the edit distance is introduced, to generalize the works by Lepage and Yvon (for whom the edit distance is a trivial case).

We have produced an algorithm which is consistent with those of Lepage and Yvon, and shown that it can give an unique solution to any analogical equation on sequences. A sufficient condition is that the alphabet on which the sequences are written has itself a inner distance relation and that any analogical equation can be solved on its letters. This is for example the case if this alphabet is a cyclic group.

Learning speech units for speech synthesis

This study is covered within the framework of a PhD thesis funded by a research contract with FTR&D Lannion (FTR&D/DIH/ISP). Work began on October 1, 1999 and was completed by the defence of Helene François' PhD thesis in December 2002.

A Text-to-speech synthesis system produces an acoustic speech signal corresponding to the pronunciation of a written text. Currently, the technological state of the art of a TTS system consists in assembling elementary acoustic segments in order to produce the speech signal. Most of the time, the set of elementary acoustic speech units, e.g. a set of diphones, triphones or longer units, is fixed and determined from a human phonological expertise whatever the sentences to produce.

The main objective of this work consists in reformulating this assumption on the nature of the acoustic units. For that, we consider that the acoustic continuum at synthesis stage is made by assembly of acoustic segments unknown a priori and subjected to contextual conditions. The acoustic database from where are extracted the segments to assemble is a continuous speech database, long enough to be able to implement these contextual choices of segments.

Firstly, we propose to determine automatically a minimal set of sentences taken from a huge set of written sentences . The selection must maximize some properties like allophone covering. The problem can be formalized like a minimal cover over a finite set. The work undertaken since 2000 on this subject led to two significant results :

A french corpus from various sources (dialogue transcription, literature, TV series scripts, medicine courses) containing approximately 400,000 sentences (see activity report 2002).

Obviously, it is impossible to record the speech equivalent of this first linguistic database. With a method solving a minimal covering , we have found a reduced linguistic database, made of only 4,000 sentences, ensuring $95 %$ of the allophones cover .

Secondly, we propose a methodology to evaluate a set of criteria uses in units selection methods. Usually criteria are evaluated in a comparative black-box may : the performance of a criterion is measured relatively to other criteria performances. We present a glass-box method exploring sequences of units able to synthesize a given target utterance. Given a continuous speech database, one can enumerate all the possible phonological units up to a predefined length. Given the phonetic sequence of the text to be synthesized, we need to parse the phonetic sequence with phonological units. We represent the different cuts of the phonetic sequence in units sequences by a graph. The selection problem comes down to a search of best paths. Obviously, a true operational synthesis system can not afford such a time-consuming task. The usual approach consists in avoiding the whole search by defining selection criteria which lead in a reasonable time to an as good as possible solution. The problem is then to evaluate how far the solution is from the best one. From another point of view, we can consider this work as a solution to bound the heuristics used in the majority of TTS system , .

Since December 2002, we are involved in national research project named Néologos, 2003 Technolangue call for proposal . Our essential participation consists in applying the methodologies developed during the Hélène François' thesis work to define speech corpora usable at the same time for speech recognition and speech synthesis. Currently, the content of the corpora is defined, their collections must take place during 2004.

Automatic speech labelling

This study is covered within the framework of a PhD thesis funded by a research contract with FTR&D Lannion (FTR&D/DIH/ISP). Work began on January 1, 2000.

This work relates to the automatic segmentation of speech corpora, read or spontaneous, into phone units. Text-to-Speech synthesis systems based on concatenated acoustic units need this process. The general framework of this study is quoted in section .

Firstly, we developed a baseline speech segmentation system based on HMM – Hidden Markov Models as briefly exposed section . The scores of segmentation which we obtained are equivalent to those of the state-of-the-art systems from literature. Moreover, to analyze the behavior of this baseline segmentation system, we carried out experiments on two axes : the topological definition of the HMM and the acoustic analysis of the speech .

Secondly, we focus our activity on the automatic phone labelling from the text, not from the true phonetic sequence. Given an automatic phonemic transcription from the text, various acoustic realizations can be found depending on the coarticulation context of the realized phone and depending on the speaker. Since the HMM framework is well adapted to introduce variants of pronunciation, all is needed is to extend the graph of model hypotheses and let the decoding phase find the best alignment. The main drawback of this scenario is that as the degrees of freedom increase, the system becomes instable and less accurate. We derive the following working hypothesis from the previous remarks :

The segmentation system operates only with text and signal. The grapheme/phoneme transcription should be carried out automatically. The risk of a mismatch between the signal and its phonemic representation can be locally very high.

Given a specific phoneme, all of the acoustical realizations are captured by a single HMM model using mixture of distributions for the observable densities.

In order to correct the automatically generated phonemic transcription and achieve good segmentation scores, local mismatches between the phonemic transcription and the acoustics must be detected and alternative labels should be proposed relaxing the constraints of the model description.

Considering the wide scope of this topic, we have addressed only the detection problem. Hence, we turn out attention towards methods of scoring the confidence of this acoustic to phonologic mapping. Compared to state-of-the-art scientific background, the original confidence measure we proposed within the HMM framework yields experimentally the best scores evaluated through DET curves – 12% Equal Error Rate, EER, for a randomly blurred test database .

Considering the confidence measure, a patent filling took place in October 2002 in France (see activity report 2002), an extension to the US part took place in October 2003.

Learning prosody from speech

This study is covered within the framework of a PhD thesis funded by FTR&D. Work began on January 1, 2003.

The approach suggested within the framework of this thesis is based on the modeling of the prosody by a set of forms representative of the various realizations of the melody present in a reference speech corpus . Once this representation defined, we plan to segment automatically the database using these elementary forms. Each one of these segments is annotated using syntactic, phonetic and phonological labels obtained during the linguistic analysis of the corpus. Next, we want to answer the question of mapping the tagged prosodic elementary forms and the associated linguistic characteristics. Taking into account the correspondence between linguistic and prosodic parameters should make it possible to restore the style of elocution actually recorded by a speaker. Moreover, at the synthesis stage, during the selection process of the acoustic units, the prosodic targets resulting from the proposed system should better correspond to the true prosodic parameters of the recorded speaker.

In relation to this scientific topic , we welcome a DEA student during the 2003/04 academic year. His work consists in proposing a modeling of melody contours with a nonlinear state-space system.

Learning semantics from speech

This study is covered within the framework of a PhD thesis funded by INRIA within a scientific collaboration with FTR&D Lannion (FTR&D/DIH/D2I) . Work began on October 1, 2003.

Learning to improve the dialogue management

As indicated in section , a thesis is to be started this year in the framework of the CRC with FTR&D.

Néologos

The main topic of this project relates to the creation of new telephone vocal data bases for the French language.

The project has two main objectives : a multi-speaker speech database with children voices (1000 speakers) and a multi-speaker speech database with adults voices.

Cordial is mainly concerned with the second task. We aim to define, for French, a speech database of reference speakers, i.e. a speech corpus where each speaker will have pronounced sufficient statements so that one can exploit them to characterize his voice. To achieve this goal, we need more than only 50 statements to record for each speaker. We plan to record a database where 200 reference speakers have recorded over the fixed phone network 500 well defined statements to cover the main coarticulation features of the language.

In addition to speech recognition systems, such a corpus is also useful for the research and the development of the techniques of speaker identification and authentification, voice transformation, voice characteristics for Test-To-Speech systems.

The partners of the project are of three types :

Academic laboratories undertaking an active research on vocal technologies (IRISA, LORIA, and FTR&D), whose main contribution will be done on the supply of research tools and on the realization of validation tests.

Industrial partners (TELISMA and DIALOCA) marketing products of speech recognition, whose contribution will be done by the organization itself of the collection and the realization of "industrial" tests more intended to show the contribution of the corpus for the improvement of the products.

The ELDA (European Language Resources Distribution Agency) whose vocation is to distribute linguistic resources, and who leads an activity of creation of corpus.

Dialogue and Semantics

This year has been finalized the CRCContrat de Recherche Coopérative, Cooperative Research Contract "Machine learning in man-machine interaction" between the Cordial project and France Télécom Recherche et Développement, DIH/DII, Lannion.

The subject is of common interest between our two research units. The CRC federates all the manpower in both teams involved on the topic. It covers the thesis of P. Alain, described at section , another thesis at FTRD DIH/DII, started Feb 2002 and a thesis to begin at the end of 2003. The total manpower in permanent researchers is of 0.125 man-year at FTRD and at Cordial (scientific management of the CRC and direction of the thesis).

International networks and workgroups

The Cordial team is a member of the European Network of Excellence in Human Language Technologies Elsnet, and of the French-speaking network FRANCIL (Réseau FRANCophne d'Ingénierie de la Langue).

The Cordial team is a member of the CNRS action spécifique ASILA (machine learning and dialogue), which has bring together in 2002 and 2003 computer scientists and linguists from the following teams : Cordial (IRISA), GREYC, groupe "dialogue" (Université de Caen) groupe LIR (LIMSI), LIUM (Université du Mans) et Langue et Dialogue (LORIA). The managers are Laurent Romary (LORIA) and Daniel Luzzati (LIUM). In particular, a catalogue of dialogue corpora has been established and a seminar has been held on the different aspects of the learning process in dialogue. The activity report for ASILA is available at LIUM.

Cordial is also part of a "pre-projet" in the interdisciplinary program TCAN of the CNRS, called ANALANGUE (analogies in sequences).

Leadership within scientific community

Olivier Boëffard has been reviewer for the IEEE transactions on Speech and Audio Processing, the Signal Processing journal, and the Speech Communication journal.

Laurent Miclet has been a member of the scientific committee of the French Machine Learning Congress, Conférence d'Apprentissage CAP 2003.

Jacques Siroux has been reviewer for the journal Cahiers romans de sciences cognitives and for a special issue of the journal Revue française d'Intelligence artificielle.

Teaching at University

Olivier Boëffard teaches the course in Speech Synthesis in the DEA STIR, Rennes 1 (option Signal, orientation 2) and takes part in the module Data Mining (Fouille de données) in the DEA Informatique de Rennes 1.

Marc Guyomard and Jacques Siroux teach the module human-machine communication at Enssat, Lannion (Lannion part of the DEA Informatique de Rennes 1).

Laurent Miclet teaches a course in Pattern Recognition Reconnaissance des Formes in the DEA STIR and a part of the module Apprentissage et Classification (AC) in the DEA Informatique de Rennes 1. In the Lannion part of the DEA Informatique de Rennes 1, for which he is the coordinator, he teaches a module of Machine Learning Apprentissage Artificiel and takes part in the module Data Mining (Fouille de données).

Conferences, workshops and meetings, invitations

Laurent Miclet has been invited to give a conference on Machine learning in dialogue systems at the 4th sino franco workshop on Web Technologies, Taipei, April 2003.

Laurent Miclet has been invited to give a conference on "Apprentissage artificiel : applications à la robotique", together with A. Cornuéjols at the Journées Nationales de Recherche en Robotique, October 2003. Clermont-Ferrand.

Laurent Miclet has been a member of the jury for the following thesis or HdR:

A. Skrzyniarz. May 2003. Analyse de problèmes de décision distribuée, évaluation de leur complexité, et conception d'heuristiques de résolution. ENST Bretagne.

C. Godin. June 2003. Introduction aux structures multi-échelles. Application à la représentation des plantes. Habilitation à diriger des recherches. Montpellier 2.

D. Fredouille. October 2003. Inférence d'automates finis non déterministes par gestion de l'ambiguïté, en vue d'applications en bioinformatique. Rennes 1.

F. Duclaye. November 2003. Apprentissage automatique de relations d'équivalence sémantique à partir du web. ENST Paris

C. Blouin (as rapporteur). November 2003. Sélection des unités en synthèse de la parole. Orsay.

Jacques Siroux participated as reporter to the PhD thesis juries of J. Goulian (Valoria Labs) and F. Landragin (LORIA Labs) as well as to the HDR (Habilitation à diriger des recherches)jury of J.-Y. Antoine (VALORIA Labs).

Graduate Student and Student intern

We have this year two DEA students in a research period.

Variable-length acoustic units inference for text-to-speech synthesis proceedings of the Eurospeech Conference 2001 Synthèse de la parole Hermès Science 2002 Application-dependent prosodic models for TTS synthesis and automatic design of learning database corpus using genetic algoritm proceedings of the Eurospeech Conference 1997 Apprentissage artificiel : méthodes et algorithmes Eyrolles 2002 L'inférence grammaticale régulière : fondements théoriques et principaux algorithmes Technical report 3449 INRIA July 1998 http://www.inria.fr/rrrt/rr-3449.html What is the search space of the regular inference ? Grammatical Inference and Applications, Lecture notes in AI 862 Springer Verlag September 1994 The greedy algorithm and its application to the construction of a continuous speech database In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), Vol. 5 2002 Plans, métaplans et dialogue Technical report 1169 Irisa September 1998 Suggestive and Corrective Answers: A Single Mechanism North Holland, Amsterdam 1989 Acoustical and topological experiments for an HMM-based speech segmentation system proceedings of the Eurospeech Conference 2001 Reprise des échecs et erreurs dans le dialogue homme-machine Cahiers de linguistique sociale 21 1992 35–46 Oral and Gestural Activities of the users in the géoral System Intelligence and Multimodality in Multimedia, Research and Applications John Lee (ed) AAAI Press 1998 A Hybrid Model for Text-to-Speech Synthesis IEEE transactions on Speech and Audio Processing 5 6 1998 426–434 Apprentissage de structures d'arbres à partir d'exemples : application à la prosodie pour la synthèse de la parole Ph. D. Thesis IRISA – Université de Rennes 1 December 2002 Synthèse de la parole par concaténation d'unités acoustique : construction et exploitation d'une base de parole continue Ph. D. Thesis IRISA – Université de Rennes 1 December 2002 Génération de prosodie par apprentissage de structures arborescentes. Actes de la Conférence d'Apprentissage July 2003 Evaluation if units selection criteria in corpus-based speech synthesis proceedings of the Eurospeech Conference 1325–1328 2003 Confidence measure for phonetic segmmentation of continuous speech proceedings of the Eurospeech Conference 897–900 2003 Analogy on Sequences: a Definition and an Algorithm Technical report 4969 INRIA October 2003 http://www.inria.fr/rrrt/rr-4969.html Natural Language Understanding Benjamin/Cummings Menlo Park 1987 Efficient particle filtering for jump Markov systems. Application to time-varying autoregressions IEEE transactions on Signal Processing 7 51 1770 2003 1762 Grammar fragment acquisition using syntactic and semantic clustering Speech Communication 1 27 1999 43–62 A unified processing of orientation for internal and external localization Groupe Langue, Raisonnement, Calcul, Toulouse 1993 Quand dire c'est faire Editions du seuil Paris 1970 Multilingual PSOLA Text-to-Speech system IEEE International Conference on Acoustics, Speech, and Signal Processing 2 1993 187–190 Le lexique de l'espace : les noms et les adjectifs de localisation interne Cahiers de grammaire 13 1988 1–22 Modélisation d'un contexte cartographique et dialogique Technical report DEA Informatique de Rennes 1 1998 ENSSAT Automatic Segmentation and Labeling of Speech based on Hidden Markov Models Speech Communication 12 1999 357–370 Intonation: Models and Theories Kluwer Academic Publishers 2000 Stylisation and symbolic coding of F0 : comparison of five models 185–208 Strategies for Oral Dialogue Control Proceedings of International Conference on Spoken Language Processing (ICSLP) 94 2 963–966 Yokohama, Japon 1994 Spoken Dialogue with Computers Academic Press 1998 ISBN 0122090551 Context-dependent probability adaptation in speech understanding Computer Speech and Language 3 11 1997 225–252 Algorithms on strings, trees and sequences: Computer Science and Computational Biology Cambridge University Press 1997 ISBN 0-521-58519-8 Saussurian analogy: a theoretical account and its application Proceedings of COLING-96 717–722 København August 1996 http://www.slt.atr.co.jp/~lepage/ps/coling96.ps.gz De l'analogie rendant compte de la commutation en linguistique Université Joseph Fourier Grenoble 2003 Habilitation à diriger les recherches Solving Analogies on Words: an Algorithm Proceedings of COLING-ACL'98, Vol 1 728-735 Montréal August 1998 http://www.slt.atr.co.jp/~lepage Plan Recognition and Discourse Analysis : An Integrated Approach for Understanding Dialogues Ph. D. Thesis University of Rochester, TR 170 1985 Expectation maximization algorithms for MAP estimation of jump Markov linear systems IEEE transactions on Signal Processing 8 47 1999 2139–2156 Communicative Acts for Explanation Generation International Journal of Man-machine studies 37(2) 1990 135–172 Spoken Dialogue Technology : Enabling the Conversational User Interface ACM Computing surveys 1 34 2002 90–169 Synthesizing elaborate intonation contours in text-to-speech for french Proccedings of the Speech Prosody Conference 2002 Stochactically-based semantic analysis Kluwer Academic Publishers 1999 Using unlabeled data to improve text classification Ph.D. thesis Carnegie Mellon University Pittsburgh, PA 15213 2001 Interacting with computers by voice: automatic speech recognition and synthesis Proceedings of the IEEE 9 91 2003 1272–1305 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition proceedings of the IEEE 2 77 1989 257–286 Representing Sentence Structure in Hidden Markov Models for Information Extraction proceedings of the IJCAI conference 2 2001 1273–1279 Combinatorial issues in TTS synthesis proceedings of the Eurospeech Conference 1997 Sens et expression Les éditions de minuit 1982 Multimodal References in Georal Tactile Proceedings of the workshop Referring Phenomena in a multimedia Context and their Computational Treatment, SIGMEDIA and ACL/EACL 39–44 Madrid July 1997 SUNDIAL, Prototype performance evaluation report Deliverable D3WP8 projet Sundial P2218 September 1993 The architecture of the Festival speech synthesis system Proceedings of the 3rd ESCA Workshop on Speech Synthesis 1998 L'espace en français Éditions du seuil, Paris 1986 Sémantique des relations spatiales et inférences spatio-temporelles : une contribution à l'étude des structures formelles de l'espace en langage naturel Ph. D. Thesis Université Paul Sabatier, Toulouse 1991 Analogy-based NLP : Implementation Issues. Technical report Ecole Nationale Supérieure des Télécommunications 2003 A new version of the Nearest-Neighbour Approximating and Eliminating Search Algorithm (AESA) with linear preprocessing time and memory requirements Pattern Recognition Letters 1 15 1992 9-17