The Cordial project explores different aspects of multimodal man-machine interfaces, with speech components. Its objectives are both theoretical and practical : on the one hand, no natural dialogue system can be designed without an understanding and a theory of the dialogic activity. On the other hand, the development and the test of real systems allow the evaluation of the models and the constitution of corpora.
The conception of a man-machine interface has to take into account
the communication habits of the users, which have been developed
within interpersonnal communication. This is particularly true for
interfaces using speech, which is a medium quite performant and
very spontaneous. Users have great difficulties to communicate
through an oral dialogue with a machine having a speech interface
of mediocre quality. The dialogue phenomena are
complex
Dialogue modeling
When multimodal dialogue is involved, the interference between speech phenomena and tactile actions or mouse clicks brings up problems of interpreting the coordination of the different actions of the user.
When a user makes a communication action towards the dialogue system, he certainly has an intention ; but often, this intention is not explicitely present in the communication. A major problem for the system is to extract it, in order to be able to give a satisfactory answer. This requires a theory dealing with the notions of intention, background knowledge, communication between agents, etc. We modelize the dialogue phenomena by using the concepts of speech acts and dialogue acts, and we consider that a sequence of exchanges can be analyzed as the result of planning. This model gives a satisfactory modeling of many phenomena in real dialogues, such as the coordination between different negotiation phases or the management of the user's knowledge base.
However, several points are not straightforwardly modeled in such a theory : parts of the dialogue do not carry any obvious intention or errors in understanding may mistake the planner, etc. Moreover, the extraction of the dialogue acts from the speech of the user is a complex problem, as is also the restitution of the dialogue acts of the system into synthetic speech.
Machine learning
In addition to the modeling of the core of dialogue phenomena, the
Cordial project has also a particular interest in machine learning
from corpora at different stages of a dialogue system. It covers
the latter problem : the extraction of semantics from the outputs
of a speech recognizer. It also tackles the problems of
constructing the prosody of the machine synthetic speech or
helping the dialogue engine to compute an answer. Machine learning
Speech synthesis.
The front-end part of an oral dialogue system
consists in a text generator producing a sequence of words,
corresponding to the message to be emitted
Our activities are distributed into four complementary domains.
The first one is concerned with both coding and structure
of interaction. It also deals with the applications. The second
one deals with multimodality and system prototyping
(architecture and evaluation). The third one is concerned
with machine learning techniques and their application to
dialogue phenomena and speech technologies. The last area deals
with speech synthesis adapted to dialogue.
In the Speech Act Theory initiated by Austin
sequence of actions which aims at the realization of an intentional goal.
Recognizing a plan given a sequence of observed actions consists in identifying underlying relations between these actions in order to guess goals and possible continuations of the current plan.
Our project is using a family of dialogue models based on speech acts plans. This modeling takes into account the general framework of communication and makes easier the implementation on computer. But it does not solve some problems as extracting speech acts from utterances as well as the integration of different information sources and miscommunication between participants.
Man-Machine interaction can be seen as a sequence of particular
actions: speech acts dialogue acts which support both the function of
the act in the dialogue (for example: requesting, querying, ...)
and a propositional content (for example: the theme of the query).
These acts can also be characterized by their conditions of use
which are concerned with the mental states of the participants
(intention, knowledge, belief). The most accurate computerized
model is the planning operator
Request(Speaker, Hearer, Action(A)) Want(Speaker, Request(Speaker, Hearer, Action(A)))Want(Speaker, Action(A))Mutual Belief(Hearer, Speaker, Want(Speaker, Action(A)))Want(Hearer, Action(A))
This can be interpreted as: when an agent wants that its
listener performs an action Request whose goal is to build up a consensus between
participants in order to perform plan recognition. It aims
at rebuilding a part of the other participant's plan; if this part
is correctly identified, it allows to give an account of the
explicit motivations and believes of the other participant. A
second process aims at computing a relevant response by means of a
planning mechanism which is able, because of the nature of the
modeling itself, to take into account the known information and
the possible misunderstandings. This type of modeling makes easier
the implementation in some simple situations but does not deal
with some important problems in various fields.
The first problem is to translate the sentence uttered by the user into a dialogue act. This process is not a simple transcoding problem. It is necessary to take into account altogether a large collection of knowledge (mental states, presuppositions, prosody, ...) as well as some indices present in the sentence (syntactic structure, lexical items, ...). In addition, the surface form of speech sentences contents a lot of irregularities (problems of performance) which complicates the speech recognition task as well as the understanding and interpretation tasks.
The second problem takes place in the use of the planning
formalism a priori
exclusive models and dialogue contexts in order to increase the
number of dialogue problems dealt with. So, the problem is partly
moved from dialogue modeling towards integration modeling.
The third problem arises frequently in interaction: it
concerns bad communication. Each of the two participants (i.e. human and system) can indeed have some erroneous knowledge
about the application, about other's abilities and about current
elements dialogue as references used to point out objects during
the interaction. One error which concerns this information, may in
the long or short run leads to a failure, i.e. to an impossibility
for the system to satisfy the user. Detecting and dealing with
these errors basically need a characterization process and a plan
based modeling.
In an interactive system, the application has to behave as an active component. In the current systems, the application modeling affords two types of main defaults. The task model may be too rigid (for example: plans in the systems for transmitting information) constraining too heavily the user's initiative. The task model may also be based on constraints (as in CAD application), allowing in this way a user's activity more free but causing a lack of co-operation for helping the user to reach its goal. We believe that the task model has to include the following elements: data and their ontology, knowledge about the use of data (operating modes) and the interface with the rest of the system. Lastly, the modeling has to be designed in order to make easier the changing of the task.
We are studying an additional modality, a tactile screen, in order to get round some of the problems coming from speech use. The problems to deal with due to this new modality are concerned with integrating messages coming from the different channels, processing of references as well as evaluating systems.
The use of speech technologies in interactive systems raises problems and difficulties spanning from the design of complete softwares (including the research of the task) to the architecture design, including a particularly good quality speech synthesis and the introduction of a new modality.
Human communication is seldom monomodal: gesture and speech are often used jointly because of functional motivations (designing elements, communication reliability). In a speech environment, introducing an additional modality -in our case, gesture by means of a tactile screen- allows to overcome some speech recognition errors.
But it raises also new difficulties. The first one
is concerned with the ways by which information coming from the
various communication channels: at which level, syntactic,
semantic, pragmatic one, has the integration to be done? What kind
of modeling has to be used? In the literature, few satisfactory
responses can be found. We chose to lean on Maybury's works
The second difficulty is the processing of references, particularly in the framework of the chosen application (querying a geographical and tourist database). Pointing out the interesting objects during the dialogue is done both by means of speech sentence and gesture (pointing out, drawing a zone) and takes into account the application context (the user can follow the outline of a cartographic object with her finger).
Studies in this domain are in the linguistic field and in the
artificial intelligence field. Some linguists
The ambition to put dialogue systems on the market needs to comply
with requirement about the quality of interaction. It is necessary
to be able to evaluate and compare different systems using
different points of view (speech recognition rate, dialogue
efficiency, language and dialogue abilities,...) in the framework
of equivalent applications, and eventually for the same system, to
evaluate different approaches. Various metrics have been yet
proposed
This research theme focuses on the elaboration of machine learning methodologies in all the stages of a dialogue system.
Machine Learning can be seen as the branch of Artificial
Intelligence concerned with the development of programs able to
increase their performances with their experienceinduction or
generalization, which is to extract a concept or a process
from examples of its output. From an engineering point of view, a
Machine Learning algorithm is often the search for the best
element
Machine Learning is a very active field, gathering a variety of
different techniques. Grossly speaking, two families of techniques
can be distinguished. On the one hand, some Machine Learning
algorithms use learning sets of symbolic data and discover a
concept
The Cordial project is concerned with the introduction of Machine Learning techniques at every stage of a dialogue process. This implies that we want to learn concepts which basically produce time ordered sequences. That is why we are interested in learning from sequences, either in a symbolic background or in a statistical one.
In the frontal part on an oral dialogue system, the incoming
speech is processed by a recognition device, generally producing a
lattice of word hypotheses, i.e. the lexical possibilities
between two instants in the sentence. Then a syntax has to be
used, to help producing a sequence of words with the best conjoint
lexical and syntactic likelihood.
The syntactic analysis can be realized either through a formal
model, given a priori by the designer of the system, or
through a statistical model, the simplest being based on the
counting of how grammatical classes follow each other in a
learning corpus (bigram model).
Both types of models are of interest in Machine Learning : grammatical inference is basically the theory and the algorithmics of extracting formal grammars from samples of sentences; the discovery of a statistical model from a corpus is an important problem in natural language processing. It is interesting to combine both approaches in extracting from the learning corpus a stochastic finite automaton as the language model. It has the advantages of a probabilistic model, but can also exhibit long distances dependencies reflecting a real structure in the sentences.
We have worked on grammatical inference in the recent years,
especially within a contract with FTR&D between 1998 and 2001.
The field is always very active in the Machine Learning community.
Many progresses in grammatical inference have recently be done in
the framework of Language and Speech processing (
Any sentence is both a sequence of words and a hierarchical organization of this sequence. The second aspect is particularly important to analyze if one wants to understand syntactic and prosodic aspects in oral speech. Producing synthetic speech in oral dialogue requires a good quality prosody generator, since much information is carried through that channel. Usually, the prosody in synthetic speech is made by rules which use syllabic, lexical, syntactic and pragmatic information to compute the pitch and the duration of every syllabe of the synthetic sentence.
An alternative issue is to consider a corpus of natural sentences and to use some machine learning algorithms. More precisely, any sentence in this learning set must be described both in terms of relevant information with regards to its prosody (syllabic, lexical, etc.) and in terms of its prosody. The machine learning task is to produce explicit or hidden rules to associate the description with the prosody.
At the end of the learning procedure, a prosody can be associated to any sentence described in the same representation.
The learning methods used in the bibliography make use of neural networks or decision trees, ignoring the hierarchical nature of the organization of the syntax and the prosody, which are also known to have strong links. This is why we have represented a sentence by a tree and made use of a corpus-based learning method. In a first step, we have used the nearest-neighbour rule.
Given a learning sample of couples of trees (sentences) and labels
(prosody),
This raises two problems: firstly to find a good description of a
sentence as a tree, secondly to define a distance between trees.
We have worked on these questions during the last years
In the context of speech synthesis, we would like to use now a
more sophisticated lazy learning method: learning by
analogy. Its principle is as follows: knowing a sentence
.
Actually, we do not yet study learning by analogy directly on trees, but on sequences. The reason is that we use a distance between the trees and the sequences (the edit distance) which is much easier to manage on the universe of sequences.
We have firstly worked on defining what is solving an
analogical equation on sequences when the edit distance is
introduced. In general, an analogical equation can be described as
follows: find and is often written by
The idea is to generalize the studies of
Lepage
As a consequence of these two primitive axioms, five other
equations are easy to prove equivalent to
Another possible axiom (determinism) requires that one of
the following trivial equations has a unique solution (the other
being a consequence):
We can give now a definition of a solution to an analogical
equation which takes into account the axioms of analogy :
correct solution to the
analogical equation
Solving analogical equations between sequences has only drawn
little attention in the past. Most relevant to our discussion are
the works of Yves Lepage, presented in full details in
Lepage
Yvon considers that comparing sequences for solving analogical
equations must be based only on insertions and deletions of
letters, and must satisfy the axioms of Lepage. His work is based
on an algebraic approach and an operator, shuffle. The
shuffle of two words
Yvon constructs a finite-state transducer to compute the set of solutions to any analogical equation on strings. He produces also a refinement of this finite state machine able to rank these analogies so as to recover an intuitive preference towards simple analogies. The "simplicity" of an analogy is related to the number of entanglements produced by the shuffle operator. This corresponds to the intuition that good analogies should preserve large chunks of the original objects; the ideal analogy involving identical objects.
Both these studies are strongly based on the property of distribution in analogy that is taken as an axiom by Lepage and
Yvon: In an analogy
Firstly, we have to define what means "is to". If we keep the line
of the previous work, "edit trace between
Secondly, "as" requires to compare two traces, which are
themselves sequences (or more simply, "as" can be the equality).
One problem here is to define precisely what type of edit
operation between we will take into account : are there only
deletion, insertion or substitution between subtrees or letters,
whatever their place, or do we have to precise that a substitution
involves two verbal groups for instance ? When a set of edit
operations is given, a second problem is to define a distance on
this set. If a distance based on edit operation is defined on
traces, we can say that there is an analogy when this distance is
null. We still question ourselves on an approximative analogy when
this distance is not null. For instance, take exact
is to inexacte approximately like
finitude is to infinité. The smaller is
the distance between traces, the better is the approximation.
As a conclusion, we aim at giving a sound definition of analogy in
sequences as a first step, then in prosodic tree structures in a
second step. With this definition of analogy, we will implement an
algorithm for solving analogical equation. Then, in the learning
by analogy problem, the adaptation of fast NN-algorithms, such as
AESA
Text-to-speech synthesis(TTS) can be carried out by concatenation of acoustic units obtained from a continuous speech database. The state-of-the-art TTS systems consist in juxtaposing pre-recorded acoustic units, typically phones, diphones ou non-uniform longer units.
An alternative to the production of speech from a dictionnary of
diphones consists in using a indexed corpus of continuous speech
The multiple representation of these configurations at the
acoustic and phonological levels enables voice quality to be
improved significantly a priori over a finite set of phonemes. Therefore, we
face a combinatorial issue where linguistic units
Our methodological research framework try to answer the following points :
From an acoustic quality point of view, given a continuous speech database with phone labels, how can we characterize the best sequence of linguistic units
From an algorithmic point of view, in a graph search framework, what are the best heuristics to solve this combinatorial problem.
From a pragmatic point of view, given a target application, what could be the best set of pre-recorded speech sentences yielding the best TTS quality.
Machine learning methodologies based on statistical approaches require databases of relatively consequent size. The examples taken from these databases show the relationship between the numerous variables involved in the studied phenomenon. As well in synthesis as in voice recognition, one wishes to be able to have an explicit relation between the acoustic level and the phonological level. If an automatic labelling starting from phonetic sequences is a task which finds acceptable solutions, the process is more complex when only the text is known.
In such a context, given a speech utterance realized by a speaker
and its particular phonetic transcription, the precise location of
temporal marks delimiting phone boundaries on the speech is
required. The state-of-the-art systems use a markovian description
of the speech in an appropriate acoustic space
Sequences of Hidden Markov Models(HMM) are built from the phonetic
description of the acoustic observations. As one needs to
discriminate phone boundaries, the majority of the phone
segmentation systems postulates a monophone modeling hypothesis.
During a learning phase, the parameters of each phone model are
learned through a set of exemples using the well known EM
iteration scheme
We propose to weaken our previous hypothesis by relaxing the exact phonetic transcription with the exact phonemic sequence. The phonemic sequence is built automatically from the text. Under the same phonemic symbol, various acoustic realizations can be found depending on the coarticulation context of the realized phone.
Currently, many automatic systems delivering information suppose that the user of such a service must be able to adapt himself to the implicit requirements of the automatic system.
We postulate that a man-machine interface being based on the
natural language must facilitate the access of the greatest number
of us with this type of services
There already exists, of course, many information systems which technology is based on a man-machine oral dialogue. In a first stage, the speech, the entry of such a system, is translated into a sequence of words. This sequence of words will be then treated by a pragmatic entity taking into account a dialogue model. In return, the machine response is stated by a speech synthesis system starting from the text or a concept modeling.
Within this experimental framework, a problem which still today does not find satisfactory scientific and technological answers is that of the semantic treatment.
A semantic function in a context of natural speech processing has
a double objective. Firstly, a pragmatic treatment carried out on
a sequence of concepts is more relevant than carried out directly
on a sequence of words – words are sensitive to the errors of the
recognition system. Secondly, a system which would be able to
understand the message can propose different alternatives,
what cannot do the current speech synthesis systems starting from
the text.
The proposed research framework is settled between the output of
an automatic speech recognition system and the input of a dialogue
management system. From the description of a statement recognized
by the ASR system and translated into a lattice of words, the goal
consists in providing the sequence of the underlying concepts. We
propose to explore the temporal dimension of the sequence of the
concepts in an oral statement starting from its relationship to
syntax and more particulary the discovery of the set of thematic
roles
We will adopt a methodology based on the observation of corpus of
examples within a statistical theoretical framework. However, it
is difficult to find corpora annotated by semantic elements
(particularly in French). The phenomenon will be described by
random variables partially observed
On one hand, speech processing state of
the art can efficiently model acoustic voice characteristics
starting from a voice print. On the other hand, few studies are
interested in the suprasegmental aspect of the speech, more
precisely on the automatic modeling of melody contours ad hoc prosodic
models.
We propose a model making it possible to describe a melody contour
at the sentence level built over a sequence of elementary melody
contours. The difficulty of modeling is that one does not a
priori know an alphabet made of classes of elementary contours.
They thus should be estimated starting from the observations of
the sentence level all while being based on assumptions of
parsimony
A melody contour is a mono-dimensional signal with real values
evolving according to time. We propose to take for methodological
assumption the class of dynamical state space models
We modelize the dialogue phenomena by using the concepts of speech acts and dialogue acts, and we consider that a sequence of exchanges can be analyzed as the result of planning. Machine learning can also be used to increase the efficiency of the planner. A well-known topic in Artificial Intelligence is the use of experience to increase the efficiency of inference engines, planners, generally speaking every kind of reasoning system. Often used is the framework of Case-Based Reasoning, which uses corpus of previous experience to discover "shortcuts" or memorize often used pieces of elaborated information. Another possibility is to use statistics on the sequencing of actions for making decisions informed by experience.
A thesis has been proposed this year with the following topic : the adaptation of the actions of an agent in a communication situation. The main issue is to give the agent a capacity of analysis on the ongoing dialogue, in order to adapt dynamically its strategy if necessary. To achieve this goal, statistical machine learning techniques will be used on dialogue corpora.
This work is part of the CRCContrat de Recherche
Coopérative, Cooperative Research Contract
The aim of this study is to design and to develop educational software for helping to teach and to learn languages.
The use of Ordictée is concerned with the primary class
exercise called dictation. In this application, a speech
synthesiser reads French text while the pupil writes the
orthographic transcription on his keyboard. The reading speed is
at any moment tailored to the speed of the typing. The pupil can
correct the text at any time. This application is based on the
design and the development of specific tools such as the alignment
of the text provided by the teacher and the pupil text.
The application domains for our researches are all the situations where man-machine communication requires speech or where the use of speech brings more comfort. These applications are in general complex enough to require a real dialogue situation, and would be tedious if used through a simple sequence of guided short answers.
Examples for these applications are : information services on a personal computer or on a public, booking services by telephone, computer assisted language learning.
We develop our applications on the CNRT platform DORIS, to promote
joints projects with industrial research. The Georal
system is a demonstrator of touristic information services, with
oral dialogue and tactile screen. We also have a "dictation"
software called Ordictée, which has been experimented in
primary schools. Finally, a grammatical inference library,
Epigram, has been developed.
The Cordial project aims to promote its research activities by means of technological demonstrations. To achieve this point, hardware and software ressources have been defined to build a R&D platform named DORIS and dedicated to man-machine interaction, in particular with the use vocal and dialogue technologies. The main funding comes from IRISA/INRIA, the Regional Council of Brittany and Cordial public contrat funding. DORIS, in the context of the CNRT-TIM Bretagne, has vocation to promote joint projects between institutional and industrial research.
DORIS is concerned by the different research projets like
Georal(see sections and
), Semantic Parrot(see
section ), and Ordictée(see section
). An INRIA research engineer manages the technical
aspects of the platform and develops new softwares for the
previously quoted projets.
On the powerhouse systems side, a Compaq AlphaServer system has been chosen to support our calculation power needs, especially for speech processing. In addition, the platform includes a Network Appliance file server with a storage capacity up to 350 Go.
In order to facilitate technical access for industrial partnership, the platform includes fast secure network access. DORIS inherits from the ENSSAT-Université de Rennes 1 network. We propose high internet connection with VPN access.
On the client side, PCs with an up-to-date sound configuration are used. These computers are meant for software development within DORIS. They are nowadays used by engineers, Ph. D. and postgraduate students involved in the CORDIAL project. Touch screens have been bought in order to facilitate the development of multimodal man-machine interfaces.
This client-server configuration is fully functional inside the ENSSAT campus. Further improvements will be focused on lightweight clients and resources sharing with external partners (see section ).
The DORIS platform main goal is to group research projects that deal with the man-machine interaction field. In this entity, they shall take advantage of other teams works and tools.
We first direct our efforts towards the installation of a
multi-agent
We made this choice to simplify the development while ensuring standard compliance. Furthermore, the Java technology allows us to use already developed libraries that are not necessarily in our sphere of competences (e.g. sound or speech coding, framing, streaming) and therefore to concentrate on our scientific interests.
Today, two projects have taken place inside the DORIS platform:
Georal (see sections and
) and the Semantic Parrot
(see section ).
In order to share our resources and to have a efficient cooperative work with extra-ENSSAT partners, we would like to allow those involved (e.g. other academic researchers or industrial partners) to connect to the DORIS platform network.
VLAN and VPN solutions are at the present time studied. The main point is to be careful of security issues.
At another level, one of the guiding lines in DORIS developments is to reach a complete independence between the server and the clients, in a technological sense. A solution is to allow lightweight clients (e.g. PDAs and cellphones) to communicate with the DORIS server as fluently as a web application running on a PC.
Georal Tactile is a multimodal system which is able to
provide information of a touristic nature to naive users. Users
can ask for information about the location of places of interest
(city, beach, chateau, church,...) within a region or a subregion,
or distance and itinerary between two localities. Users interact
with the system using three different modalities: in a visual mode
by looking to the map displayed on the screen, in an oral and
natural language mode (thanks to a speech recognition system) and
in a gesture mode pointing to or drawing on a touch screen. The
system itself uses both the oral channel (text-to-speech
synthesis) and graphics such as the flashing of sites, routes and
zooming in on subsections of the map, so as to best inform the
user.
The Georal project started in 1989 and is the origin of
various works since then. It was fully developped in Visual Prolog
4.0. We decided to re-implement Georal making the most of
the capabilities of the DORIS platform.
The foundation stone of this re-implementation was to split the initial Prolog modules (syntactic and semantic analysis, dialogue management and tactile screen management) taking into account the multi-agent paradigm (one module for one functionality). We assigned one agent for each specific role, agents that are written in Java. However, we kept core functions written in Prolog, in order to take advantage of the fact that this language is really convenient for tasks like natural language processing. But all peripheral functions from screen management to client-server communcation have been rewritten in Java and C/C++ languages.
The call of Prolog predicates from agents written in Java was not
straightforward. After a benchmarking phase, we decided to use a
Java package that allows such calls (tuProlog). This implied
studying the existing Prolog files to extract useful predicates
and to correct them to bring the code closer to the ISO Prolog.
Furthermore, the work on the Prolog code allowed an improvement of
the Georal engine capacities. A larger range of queries
are now accepted by the system and some bugs have been fixed.
Improvement have also been made on the gesture management side.
The touch screens prove to be useful to process new kind of
drawings like windings follow-up.
A text-to-speech server has been installed and a dedicated agent communicate with it. The processing time and the sound quality are very good, but we are using a local network for the moment. We have in mind to insert the Internet between the clients (possibly wireless devices) and the server. An important work on data coding and communication protocol has to be made beforehand.
The speech recognition server integration is in progress. Here again the various kind of hardware and software used will request open-mindedness for the next developments.
The use of Ordictée is concerned with the primary class
exercise called dictation. In this application, a speech
synthesizer reads French text while the pupil writes the
orthographic transcription on his keyboard. The reading speed is
at any moment tailored to the speed of the typing. The pupil can
correct the text at any time. This application is based on the
design and the development of specific tools such as the alignment
of the text provided by the teacher and the pupil text.
Ordictée is a software that allows a pupil to perform a
dictation exercise on his one. It is made up of three modules: The
pupil module, which, together with the pupil itself, carries out
the dictation exercise, the teacher module, which allows the
teacher to design his own dictation texts, and the administrator
module which is devoted to set the application parameters. One of
the main function of Ordictée is to follow the typing,
i.e. to adapt the reading rhythm to the typing speed. This
function is based on the one hand on the hypotheses that mistakes
do not affect the pronunciation, and on the other hand on the
phonetic closeness of the two texts (the pupil text and the
teacher one).
>From a usability point of view, the semantic parrot that we
propose consists in taking a speech message from a standard audio
input (Personal computer, PDA, Cell phone), to understand the
underlying concepts and finally to generate a paraphrase using a
speech ouput.
>From a technological point of view, the semantic parrot implements techniques of speech recognition, automatic speech understanding, and finally of concept to speech synthesis.
Currently, a first demonstrator is built on the DORIS platform (see section ) implementing a technology of speech recognition provided by TELISMA and a technology of speech synthesis provided by FTR&D.
The software library called Epigram (Environnement de
Programmation pour l'Inférence GRAMmaticale), has been developed
between 1997 and 2001. Epigram is a library of high
level modules enabling the development of grammatical inference
programs and applications. It has been written around a C++
environment called LEDA, a library of data types and combinatoric
algorithms developed at the Max-Planck-Institut für
Informatik, Saarbrücken, Germany. Epigram has been
written together with the team EURISE of the University of
Saint-Étienne and the former IRISA project Aïda (F. Coste,
now in the Symbiose project), as a part of a contract with France
Telecom FT R&D (CTI 97 1B 004).
Dialogue systems have to access a world description which is relevant to the interacting user. In order to avoid the system to start an infinite loop on a query it can not answer, a finite first-order dialogue logic has been devised. Besides its dialogue-oriented characteristics, this logic permits to envision the satisfiability detection procedure as spatial rewritings based on the axiomatization of inner data structures. As various axiom sets are conceivable, a solver has been prototyped from which the rewritings may be animated on a graphical interface.
This study has been completed this year and has permitted to cull the most effective axioms but also to forge new ones, which were not initially conceived. The next move is now to assess the performance of this spatial rewriting approach. On this purpose, we are now developing a solver in which seems the most competitive context: the propositional calculus where efficiency has been a constant issue over the past decades.
This activity was developed in the framework of the AUPEL-AUF
agency and with a collaboration with several laboratories. The
grant had to be 4 years long. But the agency ended the funding
after only two years because of internal problems and despite the
interesting results obtained. Then, we stop temporary studies on
this topic but maintaining a scientific review (bibliography,
workshop during the ACL congress in Toulouse in August 2001). The
design of the new Georal system will allow to work on
this topic by recording real corpora and by testing different
methods and algorithms.
A study about referering phenomena in an enlarged version of
Georal had been led. We also continued activities to
improve the ORDICTEE software (dealing with faults coming from
phonetic, following typing).
Recent progresses in speech recognition allow to plan new
important developments inside the dialogue system Georal
Tactile
In a first time we led an experiment in order to determine the
linguistic behaviour of the users when they reference elements on
the map. A large number of linguistic forms as well as using built
up elements (for example referencing a triangle using particular
points) have been observed. A new type of gesture (following a
line) has been also observed
We proposed a syntactic model in order to parse and filter
referential expressions in the user utterances. This model is
based on Vandeloise and Borillo's
works
As far as the cartography is concerned, we developed a new data model and search algorithms that are better adapted to handled elements.
At last, we redesigned the architecture of the system and the processing flow in order to deal with various facts: more complex gestures, the references on objects which are not stored in the database and the two stages processing. At the opposite of what it occurs in the current version, we give priority to gesture activity over speech activity; this principle allows to progressively check and possibly correct the referential linguistic expressions, to determine referents on the map and to build up, if necessary, new elements in the database. Some of these algorithms have been implemented and we are integrating them in the system.
We began studies firstly in order to model in uniform way the
different semantic points of view (natural language, graphics)
from the Pineda and Garza's work [Pin00], secondly to bring
together the processing on references in Georal and the
plan-based modeling of dialogue. We are also studying the use of
the concept of salience taking into account the results from LORIA
labs.
A new algorithm for aligning the teacher text and the pupil text has been designed. Its implementation and evaluation are under way. Provided promising results, a patent will be taken out.
A technology review has been done this year.
This research topic has no new results in 2003, since its main
contributor has completed his thesis had its defense on December
2002. A communication on the last results has been given at the
congress CAp, Laval, June 2003
The thesis of L. Blin has studied how to learn the prosody of a
sentence by using a distance between trees (the sentences are
represented by trees) and the nearest neighbour technique. It has
been concluded at the end of year 2002, and given last results in
2003
We examine how the nearest neighbour method could be extended to learning by analogy.
Before defining what is learning by analogy, let us
define what is solving an analogy on trees. It can be
described as follows : find and is often written
by
Firstly, we have to define what means "is to". If we keep the line
of the previous work, "edit
trace
We do not presently work on trees, but on sequences, since the
edit distance on sequences is well known and easier to implement.
We have mostly worked on defining what is solving an analogical
equation on sequences when the edit distance is introduced, to
generalize the works by Lepage
We have produced an algorithm
This study is covered within the framework of a PhD thesis funded by a research contract with FTR&D Lannion (FTR&D/DIH/ISP). Work began on October 1, 1999 and was completed by the defence of Helene François' PhD thesis in December 2002.
A Text-to-speech synthesis system produces an acoustic speech signal corresponding to the pronunciation of a written text. Currently, the technological state of the art of a TTS system consists in assembling elementary acoustic segments in order to produce the speech signal. Most of the time, the set of elementary acoustic speech units, e.g. a set of diphones, triphones or longer units, is fixed and determined from a human phonological expertise whatever the sentences to produce.
The main objective of this work consists in reformulating this assumption on the nature of the acoustic units. For that, we consider that the acoustic continuum at synthesis stage is made by assembly of acoustic segments unknown a priori and subjected to contextual conditions. The acoustic database from where are extracted the segments to assemble is a continuous speech database, long enough to be able to implement these contextual choices of segments.
Firstly, we propose to determine automatically a minimal set of
sentences taken from a huge set of written sentences
A french corpus from various sources (dialogue transcription, literature, TV series scripts, medicine courses) containing approximately 400,000 sentences (see activity report 2002).
Obviously, it is impossible to record the speech equivalent
of this first linguistic database. With a method solving a minimal
covering , we have found a reduced linguistic database, made of
only 4,000 sentences, ensuring
Secondly, we propose a methodology to evaluate a set of criteria
uses in units selection methods. Usually criteria are evaluated in
a comparative black-box may : the performance of a criterion is
measured relatively to other criteria performances. We present a
glass-box method exploring sequences of units able to synthesize a
given target utterance. Given a continuous speech database, one
can enumerate all the possible phonological units up to a
predefined length. Given the phonetic sequence of the text to be
synthesized, we need to parse the phonetic sequence with
phonological units. We represent the different cuts of the
phonetic sequence in units sequences by a graph. The selection
problem comes down to a search of best paths. Obviously, a true
operational synthesis system can not afford such a time-consuming
task. The usual approach consists in avoiding the whole search by
defining selection criteria which lead in a reasonable time to an
as good as possible solution. The problem is then to evaluate how
far the solution is from the best one. From another point of view,
we can consider this work as a solution to bound the heuristics
used in the majority of TTS system
Since December 2002, we are involved in national research project
named Néologos, 2003 Technolangue call for proposal
. Our essential participation consists in
applying the methodologies developed during the Hélène François'
thesis work to define speech corpora usable at the same time for
speech recognition and speech synthesis. Currently, the content of
the corpora is defined, their collections must take place during
2004.
This study is covered within the framework of a PhD thesis funded by a research contract with FTR&D Lannion (FTR&D/DIH/ISP). Work began on January 1, 2000.
This work relates to the automatic segmentation of speech corpora, read or spontaneous, into phone units. Text-to-Speech synthesis systems based on concatenated acoustic units need this process. The general framework of this study is quoted in section .
Firstly, we developed a baseline speech segmentation system based
on HMM – Hidden Markov Models as briefly exposed section
. The scores of segmentation which we
obtained are equivalent to those of the state-of-the-art systems
from literature. Moreover, to analyze the behavior of this
baseline segmentation system, we carried out experiments on two
axes : the topological definition of the HMM and the acoustic
analysis of the speech
Secondly, we focus our activity on the automatic phone labelling from the text, not from the true phonetic sequence. Given an automatic phonemic transcription from the text, various acoustic realizations can be found depending on the coarticulation context of the realized phone and depending on the speaker. Since the HMM framework is well adapted to introduce variants of pronunciation, all is needed is to extend the graph of model hypotheses and let the decoding phase find the best alignment. The main drawback of this scenario is that as the degrees of freedom increase, the system becomes instable and less accurate. We derive the following working hypothesis from the previous remarks :
The segmentation system operates only with text and signal. The grapheme/phoneme transcription should be carried out automatically. The risk of a mismatch between the signal and its phonemic representation can be locally very high.
Given a specific phoneme, all of the acoustical realizations are captured by a single HMM model using mixture of distributions for the observable densities.
In order to correct the automatically generated phonemic transcription and achieve good segmentation scores, local mismatches between the phonemic transcription and the acoustics must be detected and alternative labels should be proposed relaxing the constraints of the model description.
Considering the wide scope of this topic, we have addressed only
the detection problem. Hence, we turn out attention towards
methods of scoring the confidence of this acoustic to phonologic
mapping. Compared to state-of-the-art scientific background, the
original confidence measure we proposed within the HMM framework
yields experimentally the best scores evaluated through DET curves
– 12% Equal Error Rate, EER, for a randomly blurred test
database
Considering the confidence measure, a patent filling took place in October 2002 in France (see activity report 2002), an extension to the US part took place in October 2003.
This study is covered within the framework of a PhD thesis funded by FTR&D. Work began on January 1, 2003.
The approach suggested within the framework of this thesis is based on the modeling of the prosody by a set of forms representative of the various realizations of the melody present in a reference speech corpus . Once this representation defined, we plan to segment automatically the database using these elementary forms. Each one of these segments is annotated using syntactic, phonetic and phonological labels obtained during the linguistic analysis of the corpus. Next, we want to answer the question of mapping the tagged prosodic elementary forms and the associated linguistic characteristics. Taking into account the correspondence between linguistic and prosodic parameters should make it possible to restore the style of elocution actually recorded by a speaker. Moreover, at the synthesis stage, during the selection process of the acoustic units, the prosodic targets resulting from the proposed system should better correspond to the true prosodic parameters of the recorded speaker.
In relation to this scientific topic , we welcome a DEA student during the 2003/04 academic year. His work consists in proposing a modeling of melody contours with a nonlinear state-space system.
This study is covered within the framework of a PhD thesis funded by INRIA within a scientific collaboration with FTR&D Lannion (FTR&D/DIH/D2I) . Work began on October 1, 2003.
As indicated in section , a thesis is to be started this year in the framework of the CRC with FTR&D.
The main topic of this project relates to the creation of new telephone vocal data bases for the French language.
The project has two main objectives : a multi-speaker speech database with children voices (1000 speakers) and a multi-speaker speech database with adults voices.
Cordial is mainly concerned with the second task. We aim to define, for French, a speech database of reference speakers, i.e. a speech corpus where each speaker will have pronounced sufficient statements so that one can exploit them to characterize his voice. To achieve this goal, we need more than only 50 statements to record for each speaker. We plan to record a database where 200 reference speakers have recorded over the fixed phone network 500 well defined statements to cover the main coarticulation features of the language.
In addition to speech recognition systems, such a corpus is also useful for the research and the development of the techniques of speaker identification and authentification, voice transformation, voice characteristics for Test-To-Speech systems.
The partners of the project are of three types :
Academic laboratories undertaking an active research on vocal technologies (IRISA, LORIA, and FTR&D), whose main contribution will be done on the supply of research tools and on the realization of validation tests.
Industrial partners (TELISMA and DIALOCA) marketing products of speech recognition, whose contribution will be done by the organization itself of the collection and the realization of "industrial" tests more intended to show the contribution of the corpus for the improvement of the products.
The ELDA (European Language Resources Distribution Agency) whose vocation is to distribute linguistic resources, and who leads an activity of creation of corpus.
This year has been finalized the
CRCContrat de Recherche Coopérative,
Cooperative Research Contract
The subject is of common interest between our two research units. The CRC federates all the manpower in both teams involved on the topic. It covers the thesis of P. Alain, described at section , another thesis at FTRD DIH/DII, started Feb 2002 and a thesis to begin at the end of 2003. The total manpower in permanent researchers is of 0.125 man-year at FTRD and at Cordial (scientific management of the CRC and direction of the thesis).
The Cordial team is a member of the European Network of Excellence in Human Language Technologies Elsnet, and of the French-speaking network FRANCIL (Réseau FRANCophne d'Ingénierie de la Langue).
The Cordial team is a member of the CNRS action
spécifique ASILA (machine learning and dialogue), which has
bring together in 2002 and 2003 computer scientists and linguists
from the following teams : Cordial (IRISA), GREYC, groupe
"dialogue" (Université de Caen) groupe LIR (LIMSI), LIUM
(Université du Mans) et Langue et Dialogue (LORIA). The managers
are Laurent Romary (LORIA) and Daniel Luzzati (LIUM). In
particular, a catalogue of dialogue corpora has been established
and a seminar has been held on the different aspects of the
learning process in dialogue. The activity report for ASILA is
available at LIUM.
Cordial is also part of a "pre-projet" in the interdisciplinary program TCAN of the CNRS, called ANALANGUE (analogies in sequences).
Olivier Boëffard has been reviewer for the IEEE transactions on Speech and Audio Processing, the Signal Processing journal, and the Speech Communication journal.
Laurent Miclet has been a member of the scientific committee of
the French Machine Learning Congress, Conférence d'Apprentissage
CAP 2003.
Jacques Siroux has been reviewer for the journal Cahiers
romans de sciences cognitives and for a special issue of the
journal Revue française d'Intelligence artificielle.
Olivier Boëffard teaches the course in Speech Synthesis in
the DEA STIR, Rennes 1 (option Signal, orientation 2)
and takes part in the module Data Mining
(Fouille de données) in the DEA
Informatique de Rennes 1.
Marc Guyomard and Jacques Siroux teach the module
human-machine communication at Enssat, Lannion (Lannion
part of the DEA Informatique de Rennes 1).
Laurent Miclet teaches a course in Pattern Recognition Reconnaissance des Formes in the DEA STIR and a part of the
module Apprentissage et Classification (AC) in the DEA
Informatique de Rennes 1. In the Lannion part of the DEA
Informatique de Rennes 1, for which he is the coordinator, he
teaches a module of Machine Learning Apprentissage
Artificiel and takes part in the module Data Mining
(Fouille de données).
Laurent Miclet has been invited to give a conference on
Machine learning in dialogue systems at the 4th sino
franco workshop on Web Technologies, Taipei, April 2003.
Laurent Miclet has been invited to give a conference on "Apprentissage artificiel : applications à la robotique", together with A. Cornuéjols at the Journées Nationales de Recherche en Robotique, October 2003. Clermont-Ferrand.
Laurent Miclet has been a member of the jury for the following thesis or HdR:
A. Skrzyniarz. May 2003. Analyse de problèmes de décision
distribuée, évaluation de leur complexité, et conception
d'heuristiques de résolution. ENST Bretagne.
C. Godin. June 2003. Introduction aux structures
multi-échelles. Application à la représentation des plantes.
Habilitation à diriger des recherches. Montpellier 2.
D. Fredouille. October 2003. Inférence d'automates finis
non déterministes par gestion de l'ambiguïté, en vue
d'applications en bioinformatique. Rennes 1.
F. Duclaye. November 2003. Apprentissage automatique de
relations d'équivalence sémantique à partir du web. ENST Paris
C. Blouin (as rapporteur). November 2003.
Sélection des unités en synthèse de la parole. Orsay.
Jacques Siroux participated as reporter to the PhD thesis juries
of J. Goulian (Valoria Labs) and F. Landragin (LORIA Labs) as well
as to the HDR (Habilitation à diriger des recherches)jury of
J.-Y. Antoine (VALORIA Labs).
We have this year two DEA students in a research period.